In summary
- This has been a good week for open source artificial intelligence.
- Meta announced an update to its large language model, Llama 3.2, which can adapt to smartphones without losing quality.
- Flame 3.2 comes in four variants, including models with 11B and 90B parameters, and two smaller models with 1B and 3B parameters.
This has been a good week for open source artificial intelligence.
On Wednesday, Meta announced an update to its next-generation large language model, Llama 3.2, and it doesn’t just talk, it also has big vision.
Even more intriguing, some versions can fit your smartphone without losing quality, meaning you could potentially have local private AI interactions, apps, and customizations without sending your data to third-party servers.
Unveiled on Wednesday during Meta Connect, Llama 3.2 comes in four variants, each with a different focus. The heavyweight contenders, 11B and 90B, parameterized models, demonstrate their power with text and image processing capabilities.
They can tackle complex tasks such as analyzing graphs, describing images, and even identifying objects in photos based on natural language descriptions.
Llama 3.2 arrived the same week as the Allen Institute’s Molmo, which claims to be the best open source multimodal vision LLM in synthetic testing, with performance in our tests on par with GPT-4o, Claude 3.5 Sonnet, and Reka Core.
Zuckerberg’s company also introduced two new lightweight champions: a pair of 1B and 3B parameter models designed for efficiency, speed, and limited but repetitive tasks that don’t require too much computing.
These small models are multilingual masters of text with “tool calling” abilities, meaning they can better integrate with programming tools. Despite their diminutive size, they boast an impressive 128K token context window, just like GPT4o and other powerful models, making them ideal for on-device summarization, instruction tracing, and rewriting tasks.
The Meta engineering team performed some serious digital stunts to achieve this. First, they used structured pruning to trim unnecessary data from larger models, then they used knowledge distillation—transferring knowledge from large models to smaller models—to include additional intelligence.
The result was a set of compact models that outperformed rival competitors in their weight class, such as Google’s Gemma 2 2.6B and Microsoft’s Phi-2 2.7B in several benchmarks.
Meta is also working hard to power AI on devices. They have forged partnerships with hardware titans like Qualcomm, MediaTek, and Arm to ensure that Llama 3.2 works properly with mobile chips from day one. The cloud computing giants aren’t far behind either—AWS, Google Cloud, Microsoft Azure, and a host of others offer instant access to the new models on their platforms.
Under the hood, Llama 3.2’s vision capabilities come from clever architectural tweaks. Meta engineers incorporated adapter weights into the existing language model, creating a bridge between pre-trained image encoders and the text processing core.
In other words, the model’s vision capabilities do not come at the expense of its text processing proficiency, so users can expect similar or better text results compared to Llama 3.1.
The Llama 3.2 release is open source, at least by Meta standards. Meta is making the models available for download on Llama.com and Hugging Face, as well as through its extensive partner ecosystem.
Those interested in running it in the cloud can use their own Google Colab Notebook or use Groq for text-based interactions, generating almost 5,000 tokens in less than 3 seconds.
Testing Llama
We put Llama 3.2 to the test, quickly evaluating its capabilities on various tasks.
In text-based interactions, the model performs at the level of its predecessors. However, their coding skills yielded mixed results.
When tested on the Groq platform, Llama 3.2 successfully generated code for popular games and simple programs. However, the smaller 70B model ran into difficulties when trying to create working code for a custom game we designed. The more powerful 90B model, on the other hand, was much more efficient and generated a functional game on the first try.
You can see the full code generated by Llama-3.2 and all the other models we tested by clicking this link.
Identification of styles and subjective elements in images
Llama 3.2 stands out in the identification of subjective elements in images. When presented with a futuristic and cyberpunk style image and asked if it fit the steampunk aesthetic, the model accurately identified the style and its elements. He provided a satisfactory explanation, noting that the image did not align with steampunk due to the absence of key elements associated with that genre.
Graphics Analysis (and SD image recognition)
Graphics analysis is another strength of Llama 3.2, although it requires high-resolution images for optimal performance. When we entered a screenshot that contained a graph, one that other models like Molmo or Reka managed to interpret, while Llama’s visual capabilities failed. The model apologized, explaining that he could not read the letters correctly due to the quality of the image.
Text Identification in Images
While Llama 3.2 struggled with small text on our graph, it worked perfectly when reading text on larger images. We showed it a presentation slide featuring a person, and the model successfully understood the context, distinguishing between name and title without errors.
Verdict
Overall, Llama 3.2 is a huge improvement over its previous generation and is a great addition to the open source AI industry. Its strengths are in image interpretation and large text recognition, with some areas for possible improvement, especially in processing low-quality images and performing complex, custom coding tasks.
The promise of device compatibility is also beneficial for the future of private and local AI tasks, and represents a great counterweight to closed offerings like the Gemini Nano and Apple’s proprietary models.
Edited by Josh Quittner and Sebastian Sinclair
Generally Intelligent Newsletter
A weekly AI journey narrated by Gen, a generative AI model.
Crypto Keynote USA
For the Latest Crypto News, Follow ©KeynoteUSA on Twitter Or Google News.