November 20, 2023
The release of ChatGPT in late 2022 catalyzed a wave of public fascination with AI and made OpenAI a household name seemingly overnight. Now, OpenAI seeks to evolve their model’s capabilities beyond text with the rollout of new multimodal features.
In a September 25th, 2023 update, OpenAI introduced voice and image understanding to their GPT-4 model and introduced to users new modalities for chatbot interaction. After a year of textual conversing, GPT-4 is processing photographs, diagrams, and spoken questions.
Many users may be surprised to learn GPT-4 supports more than simple text input – even prior to the recent September update – and that not all queries to ChatGPT have the same context length. For convenience, we provide the different model types and their maximum tokens:
Name |
Maximum Tokens |
Description |
Default (GPT-3.5) |
8191 |
“Our fastest model, great for most everyday tasks.” |
GPT-4 |
4095 |
“Our most capable model, great for tasks that require creativity and advanced reasoning." |
Web Browsing |
8191 |
“An experimental model that knows when and how to browse the internet.” |
Advanced Data Analysis |
8192 |
“An experimental model that can solve tasks by generating Python code and executing it in a Jupyter notebook. You can upload any kind of file, and ask model to analyze it, or produce a new file which you can download.” |
Plugins |
8192 |
“An experimental model that knows when and how to use plugins.” |
DALL·E 3 |
8192 |
“Try out GPT-4 with DALL·E 3” |
Some users have reported receiving an increased maximum token size of 8192 for the GPT-4 model, but this is likely a result of silent A/B testing undertaken by OpenAI. OpenAI also provides 32k models through the Chat completions API which are unavailable in the chat site’s UI.
The only model that currently supports speech and image input is the default GPT-4. GPT-3.5 has not received a multimodality update despite the OpenAI release blogpost stating, “image understanding is powered by multimodal GPT-3.5 and GPT-4,” so users can expect the smaller GPT-3.5 model to support image processing in the future.
The new image capabilities of ChatGPT mark a significant leap forward in AI interaction. Users can now share images with the model to empower a more visual and intuitive form of communication – whether it’s a photograph of a historical monument, a screenshot of a technical issue, or a scanned document, ChatGPT can see what you see.
The image input supports image/jpeg, image/png, image/webp, and image/gif mime types. The option to input a GIF filetype was interesting, so I decided to test it against “unexpected” GIFs where the ending cannot be properly guessed by the first frame. Through my testing, the model seemed to plainly guess at an ending which leads to my conclusion that the model only processes the first frame of the GIF. Disappointing but understandable as the processing required to understand a GIF in full is invariably more complex than for a single image.
To use this feature, users can tap the photo button in the ChatGPT interface (or paste an image into the chat field from the clipboard) and engage in a conversation about the content. For more specific queries, users can optionally use a drawing tool on the mobile app to draw attention to portions of the image.
Voice interaction with ChatGPT has taken a transformative step, not by altering the core model’s functionality, but by seamlessly integrating advanced pre- and post-processing layers. At its heart, the voice feature is not truly modal, but a sophisticated tapestry laid over the base model to allow for more human interaction.
The true marvel is the text-to-speech model which generates human-like audio responses which are astonishingly realistic. OpenAI developed a voice technology to craft “realistic synthetic voices from just a few seconds of real speech.” This technology should not be expected to release to the public as OpenAI noted the “potential for malicious actors to impersonate public figures or commit fraud,” but you may see the technology come to some of your favorite services. Spotify, for example, is using the technology to pilot a feature which translate podcasts into multiple languages in the podcasters’ own voice.
To engage in a voice conversation, users can navigate to the settings on the mobile app, opt into voice conversations, and tap the headphone button which appears. The user will be presented with a selection of five voices to choose from with all five voices trained on professional voice actors.
Some users have shown GPT-4 to report an updated knowledge cutoff date of January 2022 (up from September 2021 at launch). Although there is no specific comment on the knowledge cutoff date from OpenAI, it is possible to get an idea of its legitimacy with specific and clever prompts.
Q: What happened with Bob Saget in January 2022? (Short answer)
A: In January 2022, Bob Saget, the American comedian and actor best known for his role as Danny Tanner on "Full House," was found dead in his hotel room in Orlando, Florida. He was 65 years old. The cause of death was later determined to be a result of head trauma.
Q: What happened with Novak Djokovic in January 2022? (Short answer)
A: In January 2022, Novak Djokovic, the Serbian tennis player, faced controversy when he traveled to Australia for the Australian Open. He was initially granted a medical exemption to enter the country without being vaccinated against COVID-19. However, upon arrival, his visa was canceled due to discrepancies in his application. After a legal battle and public debate, his visa was canceled again, and he was deported, preventing him from competing in the tournament.
These results are compelling but not conclusive. OpenAI would need to provide direct context on the length of the knowledge cutoff; however, users shouldn’t hold their breath for an announcement as OpenAI has shown itself to become increasingly tight-lipped.
ChatGPT's expansion into new modalities like image and voice represents an advancement for human-AI interaction. While still reliant on its underlying text foundation for speech, these intuitive interfaces enable users to query AI systems in more natural ways. The addition of image and audio inputs provide exciting new opportunities to enrich ChatGPT’s knowledge with real-world data; however, a fully realized multimodal AI will require prudent, ethical development across the industry.
While this article focused on the excitement surrounding new GPT-4 features, the release has further underscored OpenAI’s lack of commitment to transparency around their model’s capabilities and potential for harm. Without vigorous public scrutiny, we risk sleepwalking into an AI future shaped more by corporate priorities and consolidation than common good.