GPT-4 Turbo Vision and beyond – a closer look at this LMM 2 – The Future of Generative AI – Trends and Emerging Use Cases

Temporal anticipation is where GPT-4V predicts future events from the beginning frames of an action. It does this for both short-term and long-term events. For example, with a soccer penalty kick, GPT-4V can guess the next moves of the kicker and goalkeeper by understanding the game’s rules. Similarly, in sushi making, it predicts the next steps in the process by recognizing the current stage and the overall procedure. This ability lets GPT-4V understand and predict actions that happen over different lengths of time:

Figure 10.8 – Long-term temporal anticipation: GPT-4V can predict the next moves based on the initial frames

Temporal localization and reasoning refer to GPT-4V’s skill in pinpointing specific moments in time and making logical connections. An example is its ability to identify the exact moment a soccer player hits the ball. Moreover, GPT-4V can understand cause and effect relationships, such as figuring out whether a goalkeeper will successfully stop the ball. Thisinvolves not just seeing where the goalkeeper and ball are, but also understanding how they interact and predicting what will happen next. This shows a high level of complex reasoning in the model:

Figure 10.9 – Temporal localization and reasoning: GPT-4V exhibits its skill in temporal localization by precisely pinpointing the moment the player hits the ball. Additionally, it showcases its understanding of cause and effect by assessing if the ball was stopped and analyzing the interaction between the goalkeeper and the ball

GPT-4V limitations (as of Jan 2024)

Although GPT-4V is very intelligent compared to its predecessors, we must be aware of its limitations when leveraging it in applications. These limitations are mentioned on the OpenAI website (https:// platform.openai.com/docs/guides/vision):

  • Medical diagnostics: It’s not equipped to interpret specialized medical imagery such as CT scans and is not a source for medical guidance
  • Non-Latin scripts: Performance may falter with image texts in non-Latin scripts such as Japanese or Korean
  • Text size: Amplifying text size can enhance readability, but important parts of the image should not be excluded
  • Orientation: Misinterpretation is possible with rotated or upside-down text and images
  • Complex visuals: The model might struggle with graphs or texts where there are variations in color or line styles (solid, dashed, dotted, and so on)
  • Spatial analysis: The model has limitations in tasks that require precise spatial understanding, such as identifying chessboard positions
  • Accuracy: In certain contexts, it might generate incorrect image descriptions or captions
  • Unusual image formats: Challenges arise with panoramic and fisheye photographs
  • Metadata and image resizing: Original filenames and metadata are not processed, and images undergo resizing which alters their original dimensions
  • Object counting: The model may only provide approximate counts of items in an image
  • CAPTCHAs: Due to safety measures, CAPTCHA submissions are blocked

Moving past GPT-4V’s limitations, we expect future models, such as GPT-5, to offer better features for interaction and smarter reasoning, leading to more creative and useful applications. Anticipated improvements include a deeper understanding of language and context, advanced multimodal capabilities for interacting with various types of content, and enhanced reasoning for complex problem-solving. Furthermore, GPT-5 is likely to offer more precise customization options, demonstrate a significant reduction in biases for more ethical responses, and possess an expanded knowledge base that remains current with the latest information, ensuring more accurate and relevant outputs across a wide array of applications.

Leave a Reply

Your email address will not be published. Required fields are marked *