What you call censorship for models is just alignment You know what model didn't have alignment? Microsoft Tay and it wasn't exactly successful.
Have you actually ever used an uncensored model? I have, since the first llama models were released and finetuned to be uncensored. Let me tell you these things will comply to anything, say anything, the most sexist, racist and even sexually monstruous things, way worse things than just "how to build a bomb".
Now do you really think that if a model with "No censorship" (aka with no alignment) was released and started to comply to any user prompt no matter how monstruous, you really genuinely think that it won't be insta-baned and shut down because of what people will make it do and say just to post these results online? Further tarnishing Meta whose image is already shit, can you imagine the media title? "Meta makes a P3d0 AI" "Meta makes a nazi AI" "Meta makes an AI that teaches you how to beat your dog" etc, etc and for once these articles won't be clickbait, it would be true. Can you imagine what this would do to their stocks, to the user of their products? elon's "roman salute (lmao)" will be nothing in comparison to what it would do to meta's stocks and investors.
So no, uncensored models is not what meta or any company should release, they know better, it would mean their ruin.
And besides the public image and loss of billions, it's just straight out dangerous, it was fine with the stupid early small open weights model, but release a SOTA uncensored model today and the harm that it can do is exponentially more dangerous and harmful.
Saying that "censorship" = alignment is bad and not true. I always wondered what people really mean when they say they want uncensored models. Some models, especially the early ones, suffer from over-refusal, but for some it has gotten much better. Fortunately, the top comment made it clear they just want NSFW. You can have a perfectly aligned model that has no problem writing smut but won't help you with developing bioweapons.
Got to play around with it, pretty unimpressed. It feels benchmaxxed for sure, can handle these but definitely lacks the general competence and ability to understand context and cut a bit deeper like Opus 4.6
Yes, but context is important. The lie from the AI team (amongst other reasons) led to LeCun blowing the whistle, Zuckerberg was allegedly unaware of the manipulation and was furious when the revelation came out, it led to a total restructuration of meta AI organisation, a change in leadership, and a lot of the old team was let go.
Considering that this would've been SOTA a bit ago, it's highly impressive that they still were able to ship (what seems to be) a good model. Hopefully this isn't a case of benchmaxxing.
"Meta isn’t positioning Muse Spark as a top-of-the-line model, but is instead highlighting its efficiency and “competitive performance” on various tasks." https://www.cnbc.com/2026/04/08/meta-debuts-first-major-ai-model-since-14-billion-deal-to-bring-in-alexandr-wang.html
Okay i got to say, i was dubious about Alexandr, but maybe Zuck saw something. Like, i think Zuck's thing is ruthless execution. He moves forward no matter what. That's how he built the empire. Often of course messing up things, but he fucking moves.
Anyways i digress, Alexandr probably has the same energy. And they both learn shit fast. They might actually understand about the problem and it's solution space enough so they know how to hire and manage some actual experts who have now built, in a relatively short time a pretty decent model. Most likely benchmaxxed and wont replace my Opus4.6, but still good job guys lol
Given the supposedly 60 trillion tokens Meta spent on Claude tokens last month, we know that whatever this model says on benchmarks, it's like a generation behind for actual work.
I suppose the only question is, is it actually better than the Chinese models? But not sure if it matters if they don't open weight it in comparison
Especially after the delay to get this right, this seems quite underwhelming. They are just now barely catching up to what others have delivered last quarter.
Being marginally better than the other 3 in some benchmarks is genuinely impressive though. It's a very high bar. But they're a few months behind so let's see if they can catch up.
The benchmarks are good and reflects the amount of compute they've put into it. But benchmarks are cherry picked, in that they've allocated their highest possible amount of compute to inferencing and the longest reasoning time to achieve the result. That performance is typically not something you as the consumer get to experience, particularly on the lower price plans.
It would be interesting to see how it writes, since the model name sounds like they're going for creativity.
edit: I tried it. For writing it's very...basic. For code, same thing. The responses feel like a 20B+ parameter local llm.
It beats other SOTA models on a number of these benchmarks, so it is SOTA. Now, whether it is true remains to be seen given their history but if it is, it is the best model available for some benchmarks currently.
it can compete with other SOTA models which makes it SOTA. State of the art does not colloquially mean "the single best model". It allows for a range. If you can compete within that range, you're included in it.
OpenAI: GPT‑5 is a significant leap in intelligence over all our previous models, featuring state-of-the-art performance across coding, math, writing, health, visual perception, and more.
Google: Gemini 3. Introducing our most intelligent model yet. With state-of-the-art reasoning to help you learn, build, and plan anything.
Anthropic: Claude Opus 4.6 is state-of-the-art across a wide range of coding and agentic capabilities.
Could be any number of PR reasons and posturing / positioning reasons, including not wanting to talk about being SOTA until they are not the tail. I wouldn't know, I don't work for Meta. But that doesn't change the valid use of the term.
With Instagram, Facebook and perhaps VR/AR (which has heavy CV stuff, not to mention 3D worlds), vision is their strength. Similarly ByteDance has a good video model.
lol after spending billions poaching talent I should hope they came out with something competitive. kinda underwhelming that this is closed weight and still lagging behind opus, meanwhile the frontier labs are prepping mythos and spud
Interesting, seems like Meta is back to the frontlines. Not SOTA leading, but definitely breathing behind the top labs necks now if the benchmarks are representative of the experiences of the users....
I wont be too concerned. Meta was extremely embarassed by this, and Mark was personally furious when he found out they were cooking the books. Reputation matters, so I doubt they'll do that again, considering even the CEO is serious about it.
Which is exactly why they wouldn't benefit from lying, which would be quickly uncovered, and completely ruin their reputation further and for good... At a time when they desperately are trying to recover.
YES! It's unclear whether the people who made those decisions are still there (personally, I doubt it was YL or his engineers), but it definitely suggests a culture that's not averse to bending the truth.
You think a bunch of researchers would overfit benchmarks by themselves? There was clear top down pressure. And when things went sideways the ones at the top sacrificed the ones below. Tale as old as time.
I work on one of the other 4 models in this table. It's never individuals making these decisions, it's either the leadership or the shitty systems they put in place.
Independent testing never showed they did well on any benchmarks. They only got good scores before independent testing.
"One issue is Llama 4, Meta's flagship language model that shipped in April 2025. LeCun admits the published benchmarks were misleading. "Results were fudged a little bit," he says. The team used different models for different benchmarks to game the numbers. This came out almost immediately after Llama 4's release."
Benchmaxxed means that the model is too trained on the tests, or problems just like the tests. It's roughly equivalent to passing a test at school by studying the answers to the questions instead of learning all the material.
Fraud is outright lying. Model actually scores a 40% → "Our model scored a 62.3%"
Training on a handful of arc puzzles isn’t supposed to help the models much on unseen puzzles. However it’s suspected the labs trained on thousands of thousands of synthetically generated arc puzzles. This may have resulted in the models kind of seeing “everything” and understanding the possible arc puzzle landscape too well.
So cheeky to highlight every score in blue so people who aren't paying attention think they've scored higher on every single benchmark.
ReplyDeleteWhat's the point of the column header if the blue is saying "this is our one"?
Yes indeed
DeleteNot surprising from meta. Sometimes they can act cartoonishly evil (not in this case necessarily but sometimes).
Muse Spark is very good meta new super intelligence ai what we will see from now on AI muse spark .
ReplyDeleteI tried it and it's the better experience then Gemini 3.1 in Daily tasks
ReplyDeleteconsidering it’s completely free, muse spark is pretty good.
ReplyDeleteSeems kind of skeezy that they are putting the self reported numbers on there if they are lower. Just give us the numbers on a fair playing field.
ReplyDeleteBenchmarks arent the moat; deployment latency, inference cost, and safety evals decide whether this is real or theater.
ReplyDeleteDid you try it though? Because it's absolute trash
ReplyDeleteThey have a history with benchmarks, don’t they?
ReplyDeleteThe key to winning is simple: no censorship, support for NSFW, and no quantification of LLM; always deploy a fully accurate version.
ReplyDeleteWhat you call censorship for models is just alignment
DeleteYou know what model didn't have alignment? Microsoft Tay and it wasn't exactly successful.
Have you actually ever used an uncensored model?
I have, since the first llama models were released and finetuned to be uncensored. Let me tell you these things will comply to anything, say anything, the most sexist, racist and even sexually monstruous things, way worse things than just "how to build a bomb".
Now do you really think that if a model with "No censorship" (aka with no alignment) was released and started to comply to any user prompt no matter how monstruous, you really genuinely think that it won't be insta-baned and shut down because of what people will make it do and say just to post these results online?
Further tarnishing Meta whose image is already shit, can you imagine the media title? "Meta makes a P3d0 AI" "Meta makes a nazi AI" "Meta makes an AI that teaches you how to beat your dog" etc, etc and for once these articles won't be clickbait, it would be true.
Can you imagine what this would do to their stocks, to the user of their products? elon's "roman salute (lmao)" will be nothing in comparison to what it would do to meta's stocks and investors.
So no, uncensored models is not what meta or any company should release, they know better, it would mean their ruin.
And besides the public image and loss of billions, it's just straight out dangerous, it was fine with the stupid early small open weights model, but release a SOTA uncensored model today and the harm that it can do is exponentially more dangerous and harmful.
Saying that "censorship" = alignment is bad and not true. I always wondered what people really mean when they say they want uncensored models. Some models, especially the early ones, suffer from over-refusal, but for some it has gotten much better. Fortunately, the top comment made it clear they just want NSFW. You can have a perfectly aligned model that has no problem writing smut but won't help you with developing bioweapons.
Deleteok,karen
DeleteGot to play around with it, pretty unimpressed. It feels benchmaxxed for sure, can handle these but definitely lacks the general competence and ability to understand context and cut a bit deeper like Opus 4.6
ReplyDeleteOpus costs the same as gemini deepthink. It should be compared to that and grok heavy, not general use models.
DeleteWhen will Apple get in the game too
ReplyDelete343 Muse Spark, descendent of 343 Guilty Spark from Halo
ReplyDeleteImpressive
ReplyDeleteIs it open source?
ReplyDeleteIt's time Meta played upto its billions of "investment" into AI through poaching talent left and right.
ReplyDeleteIts sad that they pioneered the Llama series and then lost it all in the middle of the race and went for a total overhaul.
Talks cheap, but Meta definitely has to step up the game now. This is a race to bottom for price and race to the top for intelligence.
Gotta go, my Claude Pro subscription is getting its limit reset at 3 AM in the morning....can't miss the tokens.
It's one of those weeks isn't it
ReplyDeleteKudos to Meta for not giving up. It looked hopeless.
ReplyDeleteThank God they highlighted the entire first column. I wouldn't be able to tell which scores correspond to their model.
ReplyDeleteFinally a real fourth contender.
ReplyDeleteNice, another model we can't actually use.
ReplyDeleteSo, where is Apple? Siri seems stuck in the xx century
ReplyDeleteWould this be the first blackwell model? I imagine it is right can't imagine them still using hoppers.
ReplyDeletei like it
ReplyDeleteWho said scaling laws were dead?
ReplyDeleteIs this avocado?
ReplyDeleteLook like ass model from the benchmark
ReplyDeleteGood. Disappointing GDPVal score.
ReplyDeleteIs there a mythos GDPVal score anywhere?
How many parameters is this model?
ReplyDeletePretty funny that it is better then Grok. Zuck can finally teabag Elon after failing so hard.
ReplyDeleteI’m glad their lab didn’t just implode and actually made something out of all those resources thrown at it
ReplyDeletei assume this one isn't oss...?
ReplyDeleteReminder: Meta just lied about all their benchmarks last time with Maverick.
ReplyDeleteYes, but context is important.
DeleteThe lie from the AI team (amongst other reasons) led to LeCun blowing the whistle, Zuckerberg was allegedly unaware of the manipulation and was furious when the revelation came out, it led to a total restructuration of meta AI organisation, a change in leadership, and a lot of the old team was let go.
“Spark” sounds like it’s a relatively small model, maybe similar“Flash”
ReplyDeleteI guess, bro
ReplyDeleteWe need product built around it. Claude is Claude because of its product; not just because of thier model.
ReplyDeleteConsidering that this would've been SOTA a bit ago, it's highly impressive that they still were able to ship (what seems to be) a good model. Hopefully this isn't a case of benchmaxxing.
ReplyDeleteThis looks like something competitive.
ReplyDelete"Meta isn’t positioning Muse Spark as a top-of-the-line model, but is instead highlighting its efficiency and “competitive performance” on various tasks." https://www.cnbc.com/2026/04/08/meta-debuts-first-major-ai-model-since-14-billion-deal-to-bring-in-alexandr-wang.html
ReplyDeleteDoesn't beat mainstream models from 2 months ago, if it isn't Open sourced nobody should even care about this model
ReplyDeleteOkay i got to say, i was dubious about Alexandr, but maybe Zuck saw something. Like, i think Zuck's thing is ruthless execution. He moves forward no matter what. That's how he built the empire. Often of course messing up things, but he fucking moves.
ReplyDeleteAnyways i digress, Alexandr probably has the same energy. And they both learn shit fast. They might actually understand about the problem and it's solution space enough so they know how to hire and manage some actual experts who have now built, in a relatively short time a pretty decent model. Most likely benchmaxxed and wont replace my Opus4.6, but still good job guys lol
https://imgur.com/a/CnPWDrh
ReplyDeleteGiven the supposedly 60 trillion tokens Meta spent on Claude tokens last month, we know that whatever this model says on benchmarks, it's like a generation behind for actual work.
ReplyDeleteI suppose the only question is, is it actually better than the Chinese models? But not sure if it matters if they don't open weight it in comparison
Especially after the delay to get this right, this seems quite underwhelming. They are just now barely catching up to what others have delivered last quarter.
ReplyDeleteI ll put them in the grok pile for now.
But can it pass the carwash benchmark
ReplyDeleteRemember how they benchmaxed last time and actual experience was garbage. Let's hope this one is not like that.
ReplyDeletePretty solid numbers. So all five big players are in the game.
ReplyDeleteImpressive but Genini and Claude already scored that 2 months ago so regardless I won't bother with it
ReplyDeleteWill it be on openrouter?
ReplyDeleteThey put the most impressive number on top, while the rest are either not that good, or just marginally better.
ReplyDeleteBeing marginally better than the other 3 in some benchmarks is genuinely impressive though. It's a very high bar. But they're a few months behind so let's see if they can catch up.
DeleteAre we still at the point of being impressed with benchmarks?
Deleteit's either that or vibes
DeleteI am
DeleteBasically just good at multimodal
DeleteSomeone tell me how to feel about this
ReplyDeleteDepends on which company's products you are stanning right now.
DeleteAsk your ai of choice to feel for you
DeleteThe benchmarks are good and reflects the amount of compute they've put into it. But benchmarks are cherry picked, in that they've allocated their highest possible amount of compute to inferencing and the longest reasoning time to achieve the result. That performance is typically not something you as the consumer get to experience, particularly on the lower price plans.
DeleteIt would be interesting to see how it writes, since the model name sounds like they're going for creativity.
edit: I tried it. For writing it's very...basic. For code, same thing. The responses feel like a 20B+ parameter local llm.
More players = more pressure on pricing to get users
DeleteMore competition = good for us.
DeleteNot SOTA, but definitely among the top spots. Putting at the very least a bit of pressure on the other top labs.
its absolutely SOTA, probably the weaker of the 4 though.
Delete3. Grok is trained on misinformation and should not be considered an LLM because of that.
DeleteArc Agi 2 is really important benchmark, most opensource models are not very good at it.
DeleteSOTA is mythos now
Delete"its absolutely SOTA"
Delete"probably the weaker of the 4 though."
That defeats the meaning of SOTA lol.
It beats other SOTA models on a number of these benchmarks, so it is SOTA. Now, whether it is true remains to be seen given their history but if it is, it is the best model available for some benchmarks currently.
Deleteit can compete with other SOTA models which makes it SOTA.
DeleteState of the art does not colloquially mean "the single best model".
It allows for a range. If you can compete within that range, you're included in it.
OpenAI: GPT‑5 is a significant leap in intelligence over all our previous models, featuring state-of-the-art performance across coding, math, writing, health, visual perception, and more.
DeleteGoogle: Gemini 3. Introducing our most intelligent model yet. With state-of-the-art reasoning to help you learn, build, and plan anything.
Anthropic: Claude Opus 4.6 is state-of-the-art across a wide range of coding and agentic capabilities.
Meta: No mention of SOTA anywhere. Why?
Could be any number of PR reasons and posturing / positioning reasons, including not wanting to talk about being SOTA until they are not the tail. I wouldn't know, I don't work for Meta. But that doesn't change the valid use of the term.
DeleteIt beats Opus on a handful of MM benchmarks. Even among those Gemini is better on a few. This model is nothing burger
DeleteThe more competition the better - they’re at the races and enough people at the races is what keeps the race running.
Deleteyeah optimized for vision clearly but gonna be a downgrade on code. unlikely they’re even going after that market atp
DeleteWith Instagram, Facebook and perhaps VR/AR (which has heavy CV stuff, not to mention 3D worlds), vision is their strength. Similarly ByteDance has a good video model.
DeleteGoddamn, I thought Meta was down and out. Guess they were just gathering themselves.
ReplyDeleteHave you seen their capex spend? That money has to go somewhere...
DeleteWeren't they guilty of benchmaxxing in the past though? With Llama 4?
Deletelol after spending billions poaching talent I should hope they came out with something competitive. kinda underwhelming that this is closed weight and still lagging behind opus, meanwhile the frontier labs are prepping mythos and spud
Deleteif this was open weight it’d be sweet though
There is no MOAT.
DeleteMoat is money and compute
DeleteLooks like ARC AGI 2 was released just past the benchmaxxing deadline.
ReplyDeleteFirst thing I saw too.
DeleteYeah it would look that way
DeleteInteresting, seems like Meta is back to the frontlines. Not SOTA leading, but definitely breathing behind the top labs necks now if the benchmarks are representative of the experiences of the users....
ReplyDeleteCompetition is good, bring in more.
Mythos is the new yardstick mark. Not Opus.
DeleteA benchmaxxed model that is worse than a year old model is still behind, and irrelevant when it’s closed source.
At least it’s not Scout level embarrassing.
"Not SOTA leading"
DeleteIt's literally just Llama 5, doesnt look like they've gained much from all the money they spent
Meta’s reputation should preclude people from trusting their benchmarks
DeleteThey notoriously benchmaxed with Llama
I mean look at the numbers they reported. I doubt they’re lying as it’s pretty modest
DeleteIt was also stylemaxxed.
DeleteLied about Llama 4. All the previous versions of Llama were great and used a lot by the OS community (before the Chinese models took over).
DeleteI wont be too concerned. Meta was extremely embarassed by this, and Mark was personally furious when he found out they were cooking the books. Reputation matters, so I doubt they'll do that again, considering even the CEO is serious about it.
DeleteMeta has no reputation left to lose at this point
DeleteWhich is exactly why they wouldn't benefit from lying, which would be quickly uncovered, and completely ruin their reputation further and for good... At a time when they desperately are trying to recover.
Delete"Benchmaxed" 🤣 love the term.
DeleteWhat next, token mogging? 😂
Benchmaxed just means overfitting to benchmarks
DeleteYeah I know, just funny how the term arrives at the same time people are making fun of the term looks maxing.
DeleteYES! It's unclear whether the people who made those decisions are still there (personally, I doubt it was YL or his engineers), but it definitely suggests a culture that's not averse to bending the truth.
DeleteThe entire team got fired soo... i think the management wasn't with that.
DeleteYou think a bunch of researchers would overfit benchmarks by themselves? There was clear top down pressure. And when things went sideways the ones at the top sacrificed the ones below. Tale as old as time.
DeleteIt was the leadership yes and they all got fired leaders included.
DeleteCould be from the middle who capitalized on the ignorance of the too
DeleteI wouldn't doubt the possiblility of a team and their PM that decided under pressure to do that. This happens at a lot of high pressure places.
DeleteI work on one of the other 4 models in this table. It's never individuals making these decisions, it's either the leadership or the shitty systems they put in place.
DeleteI was just going to say that they had issues with benching before
DeleteIt wasn't benchmaxed, they just lied.
DeleteIs there evidence that it was a lie vs benchmaxxing, or is this speculation? Not challenging, I'd just be interested in the evidence if so.
DeleteIndependent testing never showed they did well on any benchmarks. They only got good scores before independent testing.
Delete"One issue is Llama 4, Meta's flagship language model that shipped in April 2025. LeCun admits the published benchmarks were misleading. "Results were fudged a little bit," he says. The team used different models for different benchmarks to game the numbers. This came out almost immediately after Llama 4's release."
https://the-decoder.com/you-certainly-dont-tell-a-researcher-like-me-what-to-do-says-lecun-as-he-exits-meta-for-his-own-startup/
Benchmaxing is just a model over fitting benchmarks (which is a benchmark problem). What they did was fraud.
Then what do you mean by "It wasn't benchmaxed, they just lied."?
DeleteBenchmaxxed means that the model is too trained on the tests, or problems just like the tests. It's roughly equivalent to passing a test at school by studying the answers to the questions instead of learning all the material.
DeleteFraud is outright lying. Model actually scores a 40% → "Our model scored a 62.3%"
That arc-agi 2 score is rough. Will have to test it to know more though.
ReplyDeleteIm pretty sure most labs rl or train for arc (with arc style questions, etc) which can inflate scores. Arc is far from immune from benchmaxing.
DeleteSo who knows
Yes they do but arc is suppose to test true intelligence, like an IQ test, so theoretically training on it shouldn't improve scores very much.
DeleteTraining on a handful of arc puzzles isn’t supposed to help the models much on unseen puzzles. However it’s suspected the labs trained on thousands of thousands of synthetically generated arc puzzles. This may have resulted in the models kind of seeing “everything” and understanding the possible arc puzzle landscape too well.
Deletetraining on IQ tests does improve scores very very much lol
Delete