Google turbocharges its genAI engine with Gemini 1.5

February 15, 2024

81

Only a week after releasing its latest generative artificial intelligence (genAI) model, Google on Thursday unveiled that model’s successor, Gemini 1.5. The company boasts that the new version bests the earlier version on almost every front.

Gemini 1.5 is a multimodal AI model now ready for early testing. Unlike OpenAI’s popular ChatGPT, Google said, users can feed into its query engine a much larger amount of information to get more accurate responses.

(OpenAI also announced a new AI model today: Sora, a text-to-video model that can generate complex video scenes with multiple characters, specific types of motion, and accurate details of the subject and background “while maintaining visual quality and adherence to the user’s prompt.” The model understands not only what the user asked for in the prompt, but also how those things exist in the physical world.)

openais sora movie scene — A movie scene generated by Sora.

Google’s Gemini models are the industry’s only native, multimodal large language models (LLMs); both Gemini 1.0 and Gemini 1.5 can ingest and generate content through text, images, audio, video and code prompts. For example, user prompts in the Gemini model can be in the form of JPEG, WEBP, HEIC or HEIF images.

“Both OpenAI and Gemini recognize the importance of multi-modality and are approaching it in different ways. Let us not forget that Sora is a mere preview/limited availability model and not something that will be generally available in the near-term,” said Arun Chandrasekaran, a Gartner distinguished vice president analyst.

OpenAI’s Sora will compete with start-ups such as text-to-video model maker Runway AI, he said.

Gemini 1.0, first announced in December 2023, was released last week. With that move, Google said it had reconstructed and renamed its Bard chatbot.

Gemini has the flexibility to run on everything from data centers to mobile devices.

Though ChatGPT 4, OpenAI’s latest LLM, is multimodal, it only offers a couple of modalities such as images and text or text to video, according to Chirag Dekate, a Gartner vice president analyst.

“Google is seizing its role as the leader as an AI cloud provider. They’re no longer playing catch up. Others are,” Dekate said. “If you’re a registered user of Google Cloud, today you can access more than 132 models. Its breadth of models is insane.”

“Media and entertainment will be the vertical industry that may be early adopters of models like these, while business functions such as marketing and design within technology companies and enterprises could also be early adopters,” Chandrasekaran said.

Currently, OpenAI is working on its next-generation GPT 5; that model is likely to also be multimodal. Dekate, however, argued that GPT 5 will consist of many smaller models cobbled together, and won’t be not natively multimodal. That will likely result in a less-efficient architecture.

The first Gemini 1.5 model Google has offered for early testing is Gemini 1.5 Pro, which the company described as “a mid-size multimodal model optimized for scaling across a wide-range of tasks.” The model performs at a similar level to Gemini 1.0 Ultra, its largest model to date, but requires vastly fewer GPU cycles, the company said.

Gemin 1.5 Pro also introduces an experimental feature in long-context understanding, meaning it allows developers to prompt the engine with up to 1 million context tokens.

Developers can sign up for a Private Preview of Gemini 1.5 Pro in Google AI Studio.

Google AI Studio is the fastest way to build with Gemini models and enables developers to integrate the Gemini API in their applications. It’s available in 38 languages across more than 180 countries and territories.

gemini 1.5 graphic — A comparison between Gemini 1.5 and other AI models in terms of token context windows.

Google’s Gemini model was built from the ground up to be multimodal, and doesn’t consist of multiple parts layered atop one another as competitors’ models are. Google calls Gemini 1.5 “a mid-size multimodal model” optimized for scaling across a wide range of tasks; while it performs at a similar level to 1.0 Ultra, it does so by applying many smaller models under one architecture for specific tasks.

Google achieves the same performance in a smaller LLM by using an increasingly popular framework known as “Mixture of Experts,” or MoE. Based on two key architecture elements, MoE layers a combination of smaller neuro networks together and it runs a series of neuro-network routers that dynamically drive query outputs.

“Depending on the type of input given, MoE models learn to selectively activate only the most relevant expert pathways in its neural network. This specialization massively enhances the model’s efficiency,” Demis Hassabis, CEO of Google DeepMind, said in a blog post. “Google has been an early adopter and pioneer of the MoE technique for deep learning through research such as Sparsely-Gated MoE, GShard-Transformer, Switch-Transformer, M4 and more.”

The MoE architecture allows a user to input an enormous amount of information but enables that input to be processed with vastly fewer compute cycles in the inference stage. It can then deliver what Dekate called “have hyper-accurate responses.”

“Their competitors are struggling to keep up, but their competitors don’t have DeepMind or the GPU [capacity] Google has to deliver results,” Dekate said.

With the new long-context understanding feature, Gemini 1.5 has a 1.5 million-token context window, meaning it can allow a user to type in a single sentence or upload several books worth of information to the chatbot interface and receive back a targeted, accurate response. By comparison, Gemini 1.0, had a 32,000 token context window.

Rival LLMs are typically limited to about 10,000 token context windows — with the expection of GPT 4, which can accept up to 125,000 tokens.

Natively, Gemini 1.5 Pro comes with a standard 128,000 token context window. Google, however, is allowing a limited group of developers and enterprise customers to try it in private preview with a context window of up to 1 million tokens via AI Studio and Vertex AI; it will grow from there, Google said.

“As we roll out the full one-million token context window, we’re actively working on optimizations to improve latency, reduce computational requirements and enhance the user experience,” Hassabis said.

This story originally appeared on Computerworld

Google turbocharges its genAI engine with Gemini 1.5

LinkedIn is developing in-app games to further distract you from your job hunt

I’m here for the hoverboard

Apple can’t get out of facing a class-action lawsuit over AirTags stalking claims

Most Popular

Electric Transmission Buildout Could Cost Americans Trillions of Dollars | The Gateway Pundit

positive interest rates By Reuters

Exploring Omega’s Constellation Meteorite Collection

Khris Middleton sparks Bucks past Suns after 16-game absence

Recent Comments

WORLD NEWS

Israel launches night raid on Gaza’s al-Shifa hospital

Putin poised to rule for another six years after re-election in Russia

North Korea fires ballistic missile as top US diplomat visits Seoul

TRENDING NEWS

Judy Garland ‘Wizard of Oz’ Ruby Slippers Theft: Second Man Charged

Justin Timberlake’s ‘Everything I Thought It Was’ Voted Best New Music

North West Gives First Interview on ‘Elementary School Dropout’ Album

POPULAR CATEGORY

ABOUT US

FOLLOW US