AI technology is seeing a shift from the text-only models, such as ChatGPT, towards multimodal models that can process various types of data including images, audio, and even sensory data from robots. This revolution in AI development, announced at Google’s annual conference, pushes the technology beyond written language, aiming for a more comprehensive understanding of the world.
Multimodal models are believed to offer a more human-like understanding of intelligence, approximating how a child learns by observing the world. They could also help companies build AI that can perform more tasks and therefore be incorporated into more products. They are capable of processing text, images, audio, infrared radiation, and information about motion and position.
The language-only model of AI has reached its limits in terms of text available for training and the size and complexity of programs that can be handled efficiently. The transition to multimodal AI aims to make models more capable by using a variety of data types. However, the effectiveness of these multimodal models is still under debate.
Even though multimodal models might provide a better business proposition, they could be more susceptible to certain types of manipulation and could perpetuate existing problems with bias and fabrication. Moreover, they are still far from emulating how humans think and learn.