M5 Max 128GB Local LLM Benchmarks: Raw MLX Numbers

We've all been hearing the whispers about the Apple M5 Max, but the waiting game is over. A madlad on Reddit going by cryingneko just got their hands on the M5 Max 14-inch with a beefy 128GB of RAM and immediately decided to torture test it with massive Local LLMs. Why bother setting up a cloud vps when you can literally melt your new shiny laptop, right?

The "Hold My Beer" Moment and the Python Venv Curse

OP came in hot, promising raw numbers, no fluff, no 20-minute YouTube video telling you to hit subscribe. Just straight-up benchmarks. But as any dev knows, the universe hates a cocky programmer.

The numbers got delayed. Why? Because OP initially ran the tests using BatchGenerator, and the token generation speeds were absolute garbage. Instead of posting bogus stats, OP did what any sane developer would do: panicked, trashed the setup, spun up a pristine fresh Python virtual environment, and re-ran everything using pure mlx_lm with stream_generate.

Moral of the story: Your $4,000 machine is only as fast as your spaghetti code and the dependencies you blindly pip install.

The RAM-Gobbling Numbers

Once the environment was sorted, OP dropped the logs. Here's what happens when you push the M5 Max to its limits with AI models:

Qwen3.5-122B-A10B-4bit: This beast casually chewed through 76.397 GB of peak memory. Prompt processing went brrr at over 1239 tokens/sec, while generation hovered steadily between 54 - 65 tokens/sec.
Qwen3-Coder-Next-8bit: Say goodbye to your RAM. This model peaked at 92.605 GB when dealing with a 65k context window. Prompt speeds spiked to 1887 tokens/sec, but generation dropped to 48 - 79 tokens/sec depending on the load.
gpt-oss-120b-MXFP4-Q8: The absolute speed demon of the bunch. It processed prompts at an insane 2710 tokens/sec. Generation was smooth at 64 - 87 t/s, and surprisingly, it was gentle on the RAM, peaking at only 65 GB.

The only slight disappointment was the Qwen3.5-27B-Claude-4.6-Opus-Distilled-MLX-6bit, which crawled at 14 - 23 tokens/sec. OP also wanted to test the Qwen 35B but forgot to download it. Classic.

The Reddit Peanut Gallery Reacts

With over 1.3k upvotes, the post blew up, and the LocalLLaMA Discord went nuts. But while OP was fighting with Python packages, the comment section was doing its thing:

The Impatient Ones: User No_Afternoon_4260 brought the sarcasm early: "Been 10 minutes, where are the benchmarks? /S". Another chimed in: "Its already 14min without benchmarks. What is OP even doing".
The Copium Inhalers: sammcj was eagerly waiting for the 27B model numbers, crying in the corner because "Mine arrives in two weeks!".

The Senior Dev Takeaway

Beyond seeing the ridiculous capabilities of Apple's Unified Memory architecture (which makes running 100B+ parameter models locally actually viable), there's a vital lesson here.

Always double-check your tooling. OP almost published garbage benchmarks just because BatchGenerator wasn't playing nice. If your numbers look weird, don't blame the silicon immediately—check your packages, your environment, and your code.

The M5 Max is clearly a beast for local AI. If you have the budget, go nuts. As for the rest of us mere mortals, we'll just keep paying for API calls and crying in 16GB RAM.

Source: Reddit - r/LocalLLaMA

The "Hold My Beer" Moment and the Python Venv Curse

OP came in hot, promising raw numbers, no fluff, no 20-minute YouTube video telling you to hit subscribe. Just straight-up benchmarks. But as any dev knows, the universe hates a cocky programmer.

Moral of the story: Your $4,000 machine is only as fast as your spaghetti code and the dependencies you blindly pip install.

The RAM-Gobbling Numbers

Once the environment was sorted, OP dropped the logs. Here's what happens when you push the M5 Max to its limits with AI models:

Qwen3.5-122B-A10B-4bit: This beast casually chewed through 76.397 GB of peak memory. Prompt processing went brrr at over 1239 tokens/sec, while generation hovered steadily between 54 - 65 tokens/sec.

Qwen3-Coder-Next-8bit: Say goodbye to your RAM. This model peaked at 92.605 GB when dealing with a 65k context window. Prompt speeds spiked to 1887 tokens/sec, but generation dropped to 48 - 79 tokens/sec depending on the load.

gpt-oss-120b-MXFP4-Q8: The absolute speed demon of the bunch. It processed prompts at an insane 2710 tokens/sec. Generation was smooth at 64 - 87 t/s, and surprisingly, it was gentle on the RAM, peaking at only 65 GB.

The only slight disappointment was the Qwen3.5-27B-Claude-4.6-Opus-Distilled-MLX-6bit, which crawled at 14 - 23 tokens/sec. OP also wanted to test the Qwen 35B but forgot to download it. Classic.

The Reddit Peanut Gallery Reacts

With over 1.3k upvotes, the post blew up, and the LocalLLaMA Discord went nuts. But while OP was fighting with Python packages, the comment section was doing its thing:

The Impatient Ones: User No_Afternoon_4260 brought the sarcasm early: "Been 10 minutes, where are the benchmarks? /S". Another chimed in: "Its already 14min without benchmarks. What is OP even doing".

The Copium Inhalers: sammcj was eagerly waiting for the 27B model numbers, crying in the corner because "Mine arrives in two weeks!".

The Senior Dev Takeaway

Beyond seeing the ridiculous capabilities of Apple's Unified Memory architecture (which makes running 100B+ parameter models locally actually viable), there's a vital lesson here.

The M5 Max is clearly a beast for local AI. If you have the budget, go nuts. As for the rest of us mere mortals, we'll just keep paying for API calls and crying in 16GB RAM.

M5 Max 128GB Put to the Local LLM Test: A Python Venv Nightmare and Raw Benchmarks

Bình luận

Related posts

MiniMax M2.7 Released: A Brutal VRAM Reality Check for the GPU-Poor

The Hilarious State of Local LLaMA: Sycophant Bots and Concrete Banana Bread

Google's Gemma 4 Launch: Blood, Sweat, Bugs, and Reddit Conspiracy Theories

Qwen 3.5 Mini Drops: Christmas Came Early for the Potato GPU Squad

Alibaba's Massive Qwen Ad at Changi Airport: Big Tech Flexing in the Wild

Getting Roasted by the 'Vibe Coding' Trend: Building AI Apps for an Audience of One

M5 Max 128GB Put to the Local LLM Test: A Python Venv Nightmare and Raw Benchmarks

The "Hold My Beer" Moment and the Python Venv Curse

The RAM-Gobbling Numbers

The Reddit Peanut Gallery Reacts

The Senior Dev Takeaway

Bình luận

Related posts

MiniMax M2.7 Released: A Brutal VRAM Reality Check for the GPU-Poor

The Hilarious State of Local LLaMA: Sycophant Bots and Concrete Banana Bread

Google's Gemma 4 Launch: Blood, Sweat, Bugs, and Reddit Conspiracy Theories

Qwen 3.5 Mini Drops: Christmas Came Early for the Potato GPU Squad

Alibaba's Massive Qwen Ad at Changi Airport: Big Tech Flexing in the Wild

Getting Roasted by the 'Vibe Coding' Trend: Building AI Apps for an Audience of One

The "Hold My Beer" Moment and the Python Venv Curse

The RAM-Gobbling Numbers

The Reddit Peanut Gallery Reacts

The Senior Dev Takeaway