Friday, September 29, 2023

Multimodal GPT-4 is on the way, it will work with text, images, video and music

An even more advanced artificial intelligence system from the workshop of Microsoft and OpenAI could be presented to the public as early as next week. In addition to natural language, they will also work with images and video.


2 min read

ChatGPT is still the most sought-after artificial intelligence system and application that attracts more and more users, but something new is brewing in the background. OpenAI and Microsoft continued the development of the language model GPT-3, then GPT-3.5, which is currently current, and as early as next week the public could get a first look at the next iteration, GPT-4.
It was announced, albeit unofficially, at the German conference “AI in Focus – Digital Kickoff”, where Andreas Braun, technical director of Microsoft for Germany, mentioned this fact by the way.

Versatile AI
According to him, GPT-4 will not only be an upgrade of the language model but will also gain multimodality, a function that Microsoft recently demonstrated in the form of its own Kosmos-1 system. This means that the new AI model will include input information from images, videos, as well as from text, it will be able to combine them and understand the context, just as it now “understands” instructions given only in natural language, and in almost all languages of the world.

The system could also work in the opposite direction – instead of taking multimedia content as input, it will probably be able to produce images, video, and even music, based only on linguistic “prompts”. These possibilities would lead to a situation where a publicly available AI system solves visual intelligence tests created for people, has the ability to “read” any multimedia content and then use the information obtained in further processing, is able to autonomously narrate a video, talk about it, and the like.

500 times more powerful?
According to unofficial information, GPT-4 will be based on 500 times more parameters than the ChatGPT model, so they could be counted in tens of trillions. That something of this type is “cooking” is confirmed by the paper published this week, which describes “Visual ChatGPT”, a combination of advanced chatbot and visual generative models.


After the presentation of Kosmos-1, and the already known capabilities of the DALL-E 2 system, it would not be unusual for these technologies to merge into one so that under OpenAI we get a unique, comprehensive, and multimodal system of generative artificial intelligence.






DISCLAIMER is a participant in the Amazon Services LLC Associates Program, an affiliate advertising program designed to provide a means for sites to earn advertising fees by advertising and linking to