Mark Zuckerberg announces Muse Spark: what you need to know about the new Meta AI model: How to try it, benchmark results | Find a Way

Mark Zuckerberg announces Muse Spark: what you need to know about the first AI model from Meta

Nine months after founding Meta Superintelligence Labs, Zuckerberg is ready to show his cards.
By on 
Credit: David Paul Morris/Bloomberg via Getty Images

Matthews Martins

Perhaps facing reality head on is the most honest way to try to escape it.

119 Comments

Stay informed!

  1. So cheeky to highlight every score in blue so people who aren't paying attention think they've scored higher on every single benchmark.

    What's the point of the column header if the blue is saying "this is our one"?

    ReplyDelete
    Replies
    1. Yes indeed
      Not surprising from meta. Sometimes they can act cartoonishly evil (not in this case necessarily but sometimes).

      Delete
  2. Muse Spark is very good meta new super intelligence ai what we will see from now on AI muse spark .

    ReplyDelete
  3. I tried it and it's the better experience then Gemini 3.1 in Daily tasks

    ReplyDelete
  4. considering it’s completely free, muse spark is pretty good.

    ReplyDelete
  5. Seems kind of skeezy that they are putting the self reported numbers on there if they are lower. Just give us the numbers on a fair playing field.

    ReplyDelete
  6. Benchmarks arent the moat; deployment latency, inference cost, and safety evals decide whether this is real or theater.

    ReplyDelete
  7. Did you try it though? Because it's absolute trash

    ReplyDelete
  8. They have a history with benchmarks, don’t they?

    ReplyDelete
  9. The key to winning is simple: no censorship, support for NSFW, and no quantification of LLM; always deploy a fully accurate version.

    ReplyDelete
    Replies
    1. What you call censorship for models is just alignment
      You know what model didn't have alignment? Microsoft Tay and it wasn't exactly successful.

      Have you actually ever used an uncensored model?
      I have, since the first llama models were released and finetuned to be uncensored. Let me tell you these things will comply to anything, say anything, the most sexist, racist and even sexually monstruous things, way worse things than just "how to build a bomb".

      Now do you really think that if a model with "No censorship" (aka with no alignment) was released and started to comply to any user prompt no matter how monstruous, you really genuinely think that it won't be insta-baned and shut down because of what people will make it do and say just to post these results online?
      Further tarnishing Meta whose image is already shit, can you imagine the media title? "Meta makes a P3d0 AI" "Meta makes a nazi AI" "Meta makes an AI that teaches you how to beat your dog" etc, etc and for once these articles won't be clickbait, it would be true.
      Can you imagine what this would do to their stocks, to the user of their products? elon's "roman salute (lmao)" will be nothing in comparison to what it would do to meta's stocks and investors.

      So no, uncensored models is not what meta or any company should release, they know better, it would mean their ruin.

      And besides the public image and loss of billions, it's just straight out dangerous, it was fine with the stupid early small open weights model, but release a SOTA uncensored model today and the harm that it can do is exponentially more dangerous and harmful.

      Delete
    2. Saying that "censorship" = alignment is bad and not true. I always wondered what people really mean when they say they want uncensored models. Some models, especially the early ones, suffer from over-refusal, but for some it has gotten much better. Fortunately, the top comment made it clear they just want NSFW. You can have a perfectly aligned model that has no problem writing smut but won't help you with developing bioweapons.

      Delete
  10. Got to play around with it, pretty unimpressed. It feels benchmaxxed for sure, can handle these but definitely lacks the general competence and ability to understand context and cut a bit deeper like Opus 4.6

    ReplyDelete
    Replies
    1. Opus costs the same as gemini deepthink. It should be compared to that and grok heavy, not general use models.

      Delete
  11. When will Apple get in the game too

    ReplyDelete
  12. 343 Muse Spark, descendent of 343 Guilty Spark from Halo

    ReplyDelete
  13. Is it open source?

    ReplyDelete
  14. It's time Meta played upto its billions of "investment" into AI through poaching talent left and right.

    Its sad that they pioneered the Llama series and then lost it all in the middle of the race and went for a total overhaul.

    Talks cheap, but Meta definitely has to step up the game now. This is a race to bottom for price and race to the top for intelligence.

    Gotta go, my Claude Pro subscription is getting its limit reset at 3 AM in the morning....can't miss the tokens.

    ReplyDelete
  15. It's one of those weeks isn't it

    ReplyDelete
  16. Kudos to Meta for not giving up. It looked hopeless.

    ReplyDelete
  17. Thank God they highlighted the entire first column. I wouldn't be able to tell which scores correspond to their model.

    ReplyDelete
  18. Finally a real fourth contender.

    ReplyDelete
  19. Nice, another model we can't actually use.

    ReplyDelete
  20. So, where is Apple? Siri seems stuck in the xx century

    ReplyDelete
  21. Would this be the first blackwell model? I imagine it is right can't imagine them still using hoppers.

    ReplyDelete
  22. Who said scaling laws were dead?

    ReplyDelete
  23. Look like ass model from the benchmark

    ReplyDelete
  24. Good. Disappointing GDPVal score.

    Is there a mythos GDPVal score anywhere?

    ReplyDelete
  25. How many parameters is this model?

    ReplyDelete
  26. Pretty funny that it is better then Grok. Zuck can finally teabag Elon after failing so hard.

    ReplyDelete
  27. I’m glad their lab didn’t just implode and actually made something out of all those resources thrown at it

    ReplyDelete
  28. i assume this one isn't oss...?

    ReplyDelete
  29. Reminder: Meta just lied about all their benchmarks last time with Maverick.

    ReplyDelete
    Replies
    1. Yes, but context is important.
      The lie from the AI team (amongst other reasons) led to LeCun blowing the whistle, Zuckerberg was allegedly unaware of the manipulation and was furious when the revelation came out, it led to a total restructuration of meta AI organisation, a change in leadership, and a lot of the old team was let go.

      Delete
  30. “Spark” sounds like it’s a relatively small model, maybe similar“Flash”

    ReplyDelete
  31. We need product built around it. Claude is Claude because of its product; not just because of thier model.

    ReplyDelete
  32. Considering that this would've been SOTA a bit ago, it's highly impressive that they still were able to ship (what seems to be) a good model. Hopefully this isn't a case of benchmaxxing.

    ReplyDelete
  33. This looks like something competitive.

    ReplyDelete
  34. "Meta isn’t positioning Muse Spark as a top-of-the-line model, but is instead highlighting its efficiency and “competitive performance” on various tasks." https://www.cnbc.com/2026/04/08/meta-debuts-first-major-ai-model-since-14-billion-deal-to-bring-in-alexandr-wang.html

    ReplyDelete
  35. Doesn't beat mainstream models from 2 months ago, if it isn't Open sourced nobody should even care about this model

    ReplyDelete
  36. Okay i got to say, i was dubious about Alexandr, but maybe Zuck saw something. Like, i think Zuck's thing is ruthless execution. He moves forward no matter what. That's how he built the empire. Often of course messing up things, but he fucking moves.

    Anyways i digress, Alexandr probably has the same energy. And they both learn shit fast. They might actually understand about the problem and it's solution space enough so they know how to hire and manage some actual experts who have now built, in a relatively short time a pretty decent model. Most likely benchmaxxed and wont replace my Opus4.6, but still good job guys lol

    ReplyDelete
  37. https://imgur.com/a/CnPWDrh

    ReplyDelete
  38. Given the supposedly 60 trillion tokens Meta spent on Claude tokens last month, we know that whatever this model says on benchmarks, it's like a generation behind for actual work.

    I suppose the only question is, is it actually better than the Chinese models? But not sure if it matters if they don't open weight it in comparison

    ReplyDelete
  39. Especially after the delay to get this right, this seems quite underwhelming. They are just now barely catching up to what others have delivered last quarter.

    I ll put them in the grok pile for now.

    ReplyDelete
  40. But can it pass the carwash benchmark

    ReplyDelete
  41. Remember how they benchmaxed last time and actual experience was garbage. Let's hope this one is not like that.

    ReplyDelete
  42. Pretty solid numbers. So all five big players are in the game.

    ReplyDelete
  43. Impressive but Genini and Claude already scored that 2 months ago so regardless I won't bother with it

    ReplyDelete
  44. Will it be on openrouter?

    ReplyDelete
  45. They put the most impressive number on top, while the rest are either not that good, or just marginally better.

    ReplyDelete
    Replies
    1. Being marginally better than the other 3 in some benchmarks is genuinely impressive though. It's a very high bar. But they're a few months behind so let's see if they can catch up.

      Delete
    2. Are we still at the point of being impressed with benchmarks?

      Delete
    3. it's either that or vibes

      Delete
    4. Basically just good at multimodal

      Delete
  46. Someone tell me how to feel about this

    ReplyDelete
    Replies
    1. Depends on which company's products you are stanning right now.

      Delete
    2. Ask your ai of choice to feel for you

      Delete
    3. The benchmarks are good and reflects the amount of compute they've put into it. But benchmarks are cherry picked, in that they've allocated their highest possible amount of compute to inferencing and the longest reasoning time to achieve the result. That performance is typically not something you as the consumer get to experience, particularly on the lower price plans.

      It would be interesting to see how it writes, since the model name sounds like they're going for creativity.

      edit: I tried it. For writing it's very...basic. For code, same thing. The responses feel like a 20B+ parameter local llm.

      Delete
    4. More players = more pressure on pricing to get users

      Delete
    5. More competition = good for us.

      Not SOTA, but definitely among the top spots. Putting at the very least a bit of pressure on the other top labs.

      Delete
    6. its absolutely SOTA, probably the weaker of the 4 though.

      Delete
    7. 3. Grok is trained on misinformation and should not be considered an LLM because of that.

      Delete
    8. Arc Agi 2 is really important benchmark, most opensource models are not very good at it.

      Delete
    9. SOTA is mythos now

      Delete
    10. "its absolutely SOTA"

      "probably the weaker of the 4 though."

      That defeats the meaning of SOTA lol.

      Delete
    11. It beats other SOTA models on a number of these benchmarks, so it is SOTA. Now, whether it is true remains to be seen given their history but if it is, it is the best model available for some benchmarks currently.

      Delete
    12. it can compete with other SOTA models which makes it SOTA.
      State of the art does not colloquially mean "the single best model".
      It allows for a range. If you can compete within that range, you're included in it.

      Delete
    13. OpenAI: GPT‑5 is a significant leap in intelligence over all our previous models, featuring state-of-the-art performance across coding, math, writing, health, visual perception, and more.

      Google: Gemini 3. Introducing our most intelligent model yet. With state-of-the-art reasoning to help you learn, build, and plan anything.

      Anthropic: Claude Opus 4.6 is state-of-the-art across a wide range of coding and agentic capabilities.

      Meta: No mention of SOTA anywhere. Why?

      Delete
    14. Could be any number of PR reasons and posturing / positioning reasons, including not wanting to talk about being SOTA until they are not the tail. I wouldn't know, I don't work for Meta. But that doesn't change the valid use of the term.

      Delete
    15. It beats Opus on a handful of MM benchmarks. Even among those Gemini is better on a few. This model is nothing burger

      Delete
    16. The more competition the better - they’re at the races and enough people at the races is what keeps the race running.

      Delete
    17. yeah optimized for vision clearly but gonna be a downgrade on code. unlikely they’re even going after that market atp

      Delete
    18. With Instagram, Facebook and perhaps VR/AR (which has heavy CV stuff, not to mention 3D worlds), vision is their strength. Similarly ByteDance has a good video model.

      Delete
  47. Goddamn, I thought Meta was down and out. Guess they were just gathering themselves.

    ReplyDelete
    Replies
    1. Have you seen their capex spend? That money has to go somewhere...

      Delete
    2. Weren't they guilty of benchmaxxing in the past though? With Llama 4?

      Delete
    3. lol after spending billions poaching talent I should hope they came out with something competitive. kinda underwhelming that this is closed weight and still lagging behind opus, meanwhile the frontier labs are prepping mythos and spud

      if this was open weight it’d be sweet though

      Delete
    4. There is no MOAT.

      Delete
    5. Moat is money and compute

      Delete
  48. Looks like ARC AGI 2 was released just past the benchmaxxing deadline.

    ReplyDelete
  49. Interesting, seems like Meta is back to the frontlines. Not SOTA leading, but definitely breathing behind the top labs necks now if the benchmarks are representative of the experiences of the users....

    Competition is good, bring in more.

    ReplyDelete
    Replies
    1. Mythos is the new yardstick mark. Not Opus.

      A benchmaxxed model that is worse than a year old model is still behind, and irrelevant when it’s closed source.

      At least it’s not Scout level embarrassing.

      Delete
    2. "Not SOTA leading"

      It's literally just Llama 5, doesnt look like they've gained much from all the money they spent

      Delete
    3. Meta’s reputation should preclude people from trusting their benchmarks

      They notoriously benchmaxed with Llama

      Delete
    4. I mean look at the numbers they reported. I doubt they’re lying as it’s pretty modest

      Delete
    5. It was also stylemaxxed.

      Delete
    6. Lied about Llama 4. All the previous versions of Llama were great and used a lot by the OS community (before the Chinese models took over).

      Delete
    7. I wont be too concerned. Meta was extremely embarassed by this, and Mark was personally furious when he found out they were cooking the books. Reputation matters, so I doubt they'll do that again, considering even the CEO is serious about it.

      Delete
    8. Meta has no reputation left to lose at this point

      Delete
    9. Which is exactly why they wouldn't benefit from lying, which would be quickly uncovered, and completely ruin their reputation further and for good... At a time when they desperately are trying to recover.

      Delete
    10. "Benchmaxed" 🤣 love the term.

      What next, token mogging? 😂

      Delete
    11. Benchmaxed just means overfitting to benchmarks

      Delete
    12. Yeah I know, just funny how the term arrives at the same time people are making fun of the term looks maxing.

      Delete
    13. YES! It's unclear whether the people who made those decisions are still there (personally, I doubt it was YL or his engineers), but it definitely suggests a culture that's not averse to bending the truth.

      Delete
    14. The entire team got fired soo... i think the management wasn't with that.

      Delete
    15. You think a bunch of researchers would overfit benchmarks by themselves? There was clear top down pressure. And when things went sideways the ones at the top sacrificed the ones below. Tale as old as time.

      Delete
    16. It was the leadership yes and they all got fired leaders included.

      Delete
    17. Could be from the middle who capitalized on the ignorance of the too

      Delete
    18. I wouldn't doubt the possiblility of a team and their PM that decided under pressure to do that. This happens at a lot of high pressure places.

      Delete
    19. I work on one of the other 4 models in this table. It's never individuals making these decisions, it's either the leadership or the shitty systems they put in place.

      Delete
    20. I was just going to say that they had issues with benching before

      Delete
    21. It wasn't benchmaxed, they just lied.

      Delete
    22. Is there evidence that it was a lie vs benchmaxxing, or is this speculation? Not challenging, I'd just be interested in the evidence if so.

      Delete
    23. Independent testing never showed they did well on any benchmarks. They only got good scores before independent testing.

      "One issue is Llama 4, Meta's flagship language model that shipped in April 2025. LeCun admits the published benchmarks were misleading. "Results were fudged a little bit," he says. The team used different models for different benchmarks to game the numbers. This came out almost immediately after Llama 4's release."

      https://the-decoder.com/you-certainly-dont-tell-a-researcher-like-me-what-to-do-says-lecun-as-he-exits-meta-for-his-own-startup/

      Benchmaxing is just a model over fitting benchmarks (which is a benchmark problem). What they did was fraud.

      Delete
    24. Then what do you mean by "It wasn't benchmaxed, they just lied."?

      Delete
    25. Benchmaxxed means that the model is too trained on the tests, or problems just like the tests. It's roughly equivalent to passing a test at school by studying the answers to the questions instead of learning all the material.

      Fraud is outright lying. Model actually scores a 40% → "Our model scored a 62.3%"

      Delete
  50. That arc-agi 2 score is rough. Will have to test it to know more though.

    ReplyDelete
    Replies
    1. Im pretty sure most labs rl or train for arc (with arc style questions, etc) which can inflate scores. Arc is far from immune from benchmaxing.

      So who knows

      Delete
    2. Yes they do but arc is suppose to test true intelligence, like an IQ test, so theoretically training on it shouldn't improve scores very much.

      Delete
    3. Training on a handful of arc puzzles isn’t supposed to help the models much on unseen puzzles. However it’s suspected the labs trained on thousands of thousands of synthetically generated arc puzzles. This may have resulted in the models kind of seeing “everything” and understanding the possible arc puzzle landscape too well.

      Delete
    4. training on IQ tests does improve scores very very much lol

      Delete
Post a Comment
Previous Post Next Post