Needle: Shrinking Gemini's Tool Calling into a 26M Pocket-Sized Model

May 14, 20263 min read

The mad lads at Cactus packed Gemini-level tool calling into a tiny 26M model by ditching FFNs. Here's why this micro-AI is a massive deal for edge computing.

Share this post:

phone, iphone, mobile, green, smartphone

Nguồn gốc: https://coding4food.com/post/needle-ai-shrinks-gemini-tool-calling-into-26m-model. Nội dung thuộc bản quyền Coding4Food. Original source: https://coding4food.com/post/needle-ai-shrinks-gemini-tool-calling-into-26m-model. Content is property of Coding4Food. This content was scraped without permission from https://coding4food.com/post/needle-ai-shrinks-gemini-tool-calling-into-26m-modelNguồn gốc: https://coding4food.com/post/needle-ai-shrinks-gemini-tool-calling-into-26m-model. Nội dung thuộc bản quyền Coding4Food. Original source: https://coding4food.com/post/needle-ai-shrinks-gemini-tool-calling-into-26m-model. Content is property of Coding4Food. This content was scraped without permission from https://coding4food.com/post/needle-ai-shrinks-gemini-tool-calling-into-26m-model

Share this post:

Bình luận

architect, man, jump, jumping, building, joy, planning, professional, employee, builder, worker, contractor, male, work, development, housing, home, build, building, worker, worker, work, work, work, home, home, home, home, home

AI & Automation Technology

Genpire Launch: AI Got Bored of Coding and Started a Sweatshop

Forget shipping code, AI is shipping physical shoes now. A deep dive into Genpire, the agentic AI turning prompts into factory-ready specs.

May 123 min read

writing, typewriter, office, business, torpedo, paper, type, vintage, old, key, analogue, technology, write, antique, writing, writing, writing, writing, writing

AI & Automation Technology

Talkie 13B: The 1930s AI Model That Proves Devs Are Officially Bored

Tired of generic AI wrappers? Meet Talkie 13B, an LLM fine-tuned exclusively on pre-1930s data. Here is why Hacker News is obsessed with this useless masterpiece.

Apr 293 min read

brain, circuit board, artificial intelligence, technology, conductor tracks, connections, network, digital, think, cut out, artificial intelligence, artificial intelligence, artificial intelligence, artificial intelligence, artificial intelligence

Technology AI & Automation

Liminary: Making AI Stop Hallucinating and Actually Read Your Mind

Liminary just hit Product Hunt promising to turn your messy context into a shared AI working memory. Is it a game-changer or just another RAG app?

May 143 min read

ai generated, child, computer, robot, laptop, technology, fantasy, mystical, future, digital

Technology AI & Automation

Googlebook: The True Chromebook Successor or Just Another AI Gimmick?

Google is dropping the Googlebook with Aluminium OS and built-in Gemini AI. Is it a game-changer or a RAM-hungry privacy nightmare? Let's break it down.

May 142 min read

humanoid, robot, woman, future, technology, futuristic, robotic, cyborg, generated by ai

AI & Automation Technology

MolmoAct 2: Ai2 Dropped a Massive 700-Hour Open Robotics Dataset and It’s Glorious

A dev's take on MolmoAct 2. Ai2 just released a 3D-reasoning robotics model with native bimanual control and 700 hours of open data. Goodbye proprietary silos.

May 103 min read

robot, future, modern, technology, science fiction, artificial, intelligence, robotic, computer, mechanical, engineering, artificial intelligence, gray robot, 3d, render, robot, robot, robot, robot, robot, technology, artificial intelligence

AI & Automation Technology

Oriane: The AI 'Wizards' Decoding TikTok Videos With a 1000x Cost Drop

Deep dive into Oriane's Product Hunt launch: An AI tool that acts as the 'eyes and ears' for marketers, fixing MrBeast's pain point and flexing massive infra cost cuts.

May 63 min read

Scrolling through Hacker News while my laptop fan sounds like a jet engine, I stumbled upon a wild "Show HN": boiling down Gemini's Tool Calling into a tiny 26M parameter model. The mad lads at Cactus just open-sourced this pocket-sized beast, promising it runs blazingly fast on budget phones and wearables.

TL;DR: Honey, I Shrunk the LLM

Henry from the Cactus team just dropped Needle, and its specs are beautifully unhinged:

Microscopic Size: Exactly 26M parameters (smaller than an average frontend dev's node_modules folder).
Insane Speed: Hits 6000 tok/s prefill and 1200 tok/s decode directly on consumer devices.
The Core Insight: Tool calling is essentially a retrieval-and-assembly problem (match query -> grab arguments -> spit out JSON). It doesn't require deep, philosophical reasoning. Massive models are total overkill here.
The "Heretical" Architecture: Simple Attention Networks. Zero MLPs or FFNs anywhere. The author argues FFN parameters are wasted space if the facts are already provided in the input (like in RAG or tool use scenarios).
Training Montage: Pretrained on 200B tokens (27 hours on 16 TPU v6e), then post-trained for 45 minutes on 2B tokens of synthetic Gemini data.
The Flex: It straight-up beats larger models like FunctionGemma-270M and Qwen-0.6B in single-shot tool calling.

What the HN armchair experts are saying

(While it's fresh off the press, you can already predict the dev community's split reactions)

The Pragmatists: Thank god! Using a 70B parameter model just to turn on a smart bulb or format a JSON string is a massive waste of VRAM and electricity. Building these small ai tools is the future of mobile apps.
The Skeptics: Dropping FFNs sounds super sus. Sure, it dominates single-shot function calling, but how does it hold up in multi-turn conversational agents with complex context?
The Local LLM Junkies: Downloading instantly to finetune on a Mac. Doing this locally means no more renting expensive cloud vps instances just to parse some data.

The C4F Verdict: Right tool for the job

We devs love our hype trains. We often throw an entire A100 GPU cluster at a problem that could realistically be solved with Regex, or in this case, a 26M model. Needle is a beautiful reminder that architecture and specific use-cases matter more than brute size, especially for edge computing.

Stop forcing an AI to be Shakespeare when you just need a JSON parser. Kudos to Cactus for open-sourcing it. Go download the weights and play around with it, folks!

Source: Show HN: Needle: We Distilled Gemini Tool Calling into a 26M Model

TL;DR: Honey, I Shrunk the LLM

Henry from the Cactus team just dropped Needle, and its specs are beautifully unhinged:

Microscopic Size: Exactly 26M parameters (smaller than an average frontend dev's node_modules folder).

Insane Speed: Hits 6000 tok/s prefill and 1200 tok/s decode directly on consumer devices.

The Core Insight: Tool calling is essentially a retrieval-and-assembly problem (match query -> grab arguments -> spit out JSON). It doesn't require deep, philosophical reasoning. Massive models are total overkill here.

The "Heretical" Architecture: Simple Attention Networks. Zero MLPs or FFNs anywhere. The author argues FFN parameters are wasted space if the facts are already provided in the input (like in RAG or tool use scenarios).

Training Montage: Pretrained on 200B tokens (27 hours on 16 TPU v6e), then post-trained for 45 minutes on 2B tokens of synthetic Gemini data.

The Flex: It straight-up beats larger models like FunctionGemma-270M and Qwen-0.6B in single-shot tool calling.

What the HN armchair experts are saying

(While it's fresh off the press, you can already predict the dev community's split reactions)

The Pragmatists: Thank god! Using a 70B parameter model just to turn on a smart bulb or format a JSON string is a massive waste of VRAM and electricity. Building these small ai tools is the future of mobile apps.

The Skeptics: Dropping FFNs sounds super sus. Sure, it dominates single-shot function calling, but how does it hold up in multi-turn conversational agents with complex context?

The Local LLM Junkies: Downloading instantly to finetune on a Mac. Doing this locally means no more renting expensive cloud vps instances just to parse some data.

The C4F Verdict: Right tool for the job

Stop forcing an AI to be Shakespeare when you just need a JSON parser. Kudos to Cactus for open-sourcing it. Go download the weights and play around with it, folks!

Needle: Shrinking Gemini's Tool Calling into a 26M Pocket-Sized Model

Bình luận

Related posts

Genpire Launch: AI Got Bored of Coding and Started a Sweatshop

Talkie 13B: The 1930s AI Model That Proves Devs Are Officially Bored

Liminary: Making AI Stop Hallucinating and Actually Read Your Mind

Googlebook: The True Chromebook Successor or Just Another AI Gimmick?

MolmoAct 2: Ai2 Dropped a Massive 700-Hour Open Robotics Dataset and It’s Glorious

Oriane: The AI 'Wizards' Decoding TikTok Videos With a 1000x Cost Drop

Needle: Shrinking Gemini's Tool Calling into a 26M Pocket-Sized Model

TL;DR: Honey, I Shrunk the LLM

What the HN armchair experts are saying

The C4F Verdict: Right tool for the job

Bình luận

Related posts

Genpire Launch: AI Got Bored of Coding and Started a Sweatshop

Talkie 13B: The 1930s AI Model That Proves Devs Are Officially Bored

Liminary: Making AI Stop Hallucinating and Actually Read Your Mind

Googlebook: The True Chromebook Successor or Just Another AI Gimmick?

MolmoAct 2: Ai2 Dropped a Massive 700-Hour Open Robotics Dataset and It’s Glorious

Oriane: The AI 'Wizards' Decoding TikTok Videos With a 1000x Cost Drop

TL;DR: Honey, I Shrunk the LLM

What the HN armchair experts are saying

The C4F Verdict: Right tool for the job