GPT-4 Turbo Vision and beyond – a closer look at this LMM
GPT-4 Turbo with Vision (GPT-4V), released by OpenAI in late 2023, is a new version of the LLM that supports 128,000 tokens of context (~300 pages of text as input prompts), is cheaper, has updated knowledge and image capabilities, provides text-to- speech offerings, and has a copyright shield. It can also understand images as inputs and generate captions and descriptions, all while providing intricate analyses of them.
GPT-4V is an improvement over GPT-V4 in terms of its broader general knowledge and advanced reasoning capabilities. The following figure from the research paper The Dawn of the LMMs: Preliminary Explorations with GPT-4V(ision) demonstrates the remarkable reasoning capabilities of GPT-4V with different prompting techniques (The Dawn of LMMs: Preliminary Explorations with GPT-4, https:// export.arxiv.org/pdf/2309.17421):
Figure 10.2 – Demonstration of GPT-4V following text instructions
Figure 10.3 – Demonstration of GPT-4V with visual referring prompting
It also possesses multilingual multimodal understanding so that it can understand text in different languages in images and answer your questions in English or a language of your choice, as shown here:
Figure 10.4 – GPT-4V’s capabilities regarding multilingual scene text recognition
Figure 10.5 – GPT-4V’s capabilities regarding multimodal multicultural understanding
Video prompts for video understanding
A novel feature not present in earlier GPT models is the capability to comprehend videos. With video prompting, you can prompt the LLM with not only text but also video. GPT-4V can analyze brief video clips and produce comprehensive descriptions. Though GPT-4V doesn’t directly process video inputs, the Azure Open AI Chat playground, enhanced with GPT-4V and Azure Vision services, allows for interactive questioning of video content. This system operates by identifying key frames from the video that are relevant to your query. It then examines these frames in detail to generate a response. This integration bridges the gap between video content and AI-driven insights. For example, you can upload a short video of a boy playing football on Azure Open AI Chat playground and simultaneously state, “Give me a summary of the video and what sport is being played in the video.”
The frames are examined by GPT-4V seamlessly due to its varying capabilities, such as temporal ordering, temporal anticipation, and temporal localization and reasoning. Let’s dig into these concepts in a bit more detail.
Temporal ordering means being able to put things in the right order based on time. For GPT-4V, this skill is really important. It’s like if you mixed up a bunch of photos from an event, say making sushi, and then asked the AI to put them back in the right order. GPT-4V can look at these shuffled pictures and figure out the correct sequence, showing how the sushi was made step by step. There are two types of temporal ordering: long-term and short-term. Long-term is like the sushi example, where the AI organizes a series of events over a longer period. Short-term is more about quick actions, such as opening or closing a door. GPT- 4V can understand these actions and put them in the right order too. These tests are a way to check if GPT-4V understands how things happen over time, both for long processes and quick actions. It’s like testing if the AI can make sense of a story or an event just by looking at pictures, even if they’re all mixed up at first:
Figure 10.6 – Long-term temporal ordering: “GPT-4V” is shown a series of mixed-up images that show the process of making sushi. Despite the images being out of order, GPT-4V successfully recognizes the event and arranges the images in the proper chronological sequence (2309.17421 (arxiv.org))
Figure 10.7 – Short-term temporal ordering: when presented with a specific action, such as opening or closing a door, GPT-4V proves its ability to understand the content of the images and accurately arrange them in the right sequence that matches the given action