> For extreme performance, we discover and use an out-of-doc PTX instruction: ld.global.nc.L1::no_allocate.L2::256B. This instruction will lead to an undefined behavior: accessing volatile GPU memory with non-coherent read-only PTX modifiers .nc. But the correctness is tested to be guaranteed with .L1::no_allocate on Hopper architectures, and performance will be much better.
Practically speaking, is it possible for NVIDIA to "pull the rug" later, intentionally or otherwise, by subtly changing the behaviour of this out-of-doc instruction on new architectures?
I feel like a kid in a candy shop. Some of these tricks would take way too long to reverse engineer correctly based on the papers. I hope that the releases this week start a renaissance of the use of MoE as baseline academic models.
From this point of view I don't understand what's happening between the actual SOTA models practice and the academic models. The former at this point are all MoEs, starting with GPT4. But then the open models, if not for DeepSeek V3 and Mixtral, are always dense models.
It would be hilarious if this scenario played out.
OpenAI starts as a nonprofit, aiming to benefit all humanity. Eventually, they discover a path to AGI and engage in intense internal debates: Should they abandon their original mission and chase profit, knowing it could bring generational wealth? They ultimately decide, "To hell with humanity—let’s go for the money."
As they pivot to prioritizing profit, DeepSeek emerges. Staying true to OpenAI’s original vision, DeepSeek open-sources everything, benefiting humanity and earning global admiration. Unintentionally, this move tanks OpenAI’s valuation. In the end, OpenAI fails to become the hero or secure the massive profits they chased. Instead, they leave behind a legacy rebranded as "ClosedAI"
Now. It’s amazing to me that everyone is like fuck OpenAI deepseek is the savior, when OpenAI’s papers and code jump started an AI revolution just a few years ago. Let’s wait the same number of years and see what deepseek does.
Indeed. And the papers were about doing better translation of char sequences, essentially the tech emerged as linguistics improvement for language. Then someone realised the parrot learns enough ZIP and JPEG alongside and can spit back hazy memories of it all.
the one still super useful thing OpenAI ever released must’ve been Whisper. But they could’ve been much more open for sure.
Too late, there is no more innovation from openai all the people that were the drivers left for Anthropic and the others. They had some of the biggest funding, had the advance... And yet they lost it.
That’s an impossible ask. Sam is the pinnacle of capitalist ruling class, he’s a pure businessman. He has no interest in giving anything for free unless there’s a business plan. He doesn’t care about humanity. He’ll pretend to change the world and tell you that they’re inventing AGI, Q*, strawberry or whatever they’re branding it, but the reality is he knows it’s all over and unless there’s a major breakthrough this company will be in major financial trouble. Sorry for the rant but he doesn’t deserve much respect for turning all this science to grift. He’s actually the person the old openai board warned everyone about.
- Efficient and optimized all-to-all communication
- Both intranode and internode support with NVLink and RDMA
- High-throughput kernels for training and inference prefilling
- Low-latency kernels for inference decoding
- Native FP8 dispatch support
- Flexible GPU resource control for computation-communication overlapping
X: https://x.com/deepseek_ai/status/1894211757604049133
Parallel Thread Execution. Think of them as opcodes for the Nvidia GPUs.
They are a bit more complex that your traditional opcodes (the lowest level of abstraction accessible to users) in CPUs, as you can specify cache parameters, memory barriers etc.
There are documented combinations of parameters for those instructions but if you fuzz (search new combinations in a random or organized way because you hope some will work the way you want) you can find new ones with unexpected effects or with advantages (in various ways like not polluting caches, speed...)
Which is the case for example for ld.global.nc.L1::no_allocate.L2::256B that they use in deepseek that provides significant acceleration while beeing reliable (although not working on all architectures so they have ways to disable it)
Gonna check what SASS it get translated to and whether it makes any sense.
I wonder if they had SASS assembler for Hopper (either by reverse engineering nvdisasm or by fuzzing instructions + nvdisasm + stare hard) and don't want to say it out :p
Much of the hype around DeepSeek is due to their extraordinarily low training and inference costs. They achieved this by optimizing their training code, apparently using PTX in addition to CUDA. PTX is kind of an intermediate assembly language for NVIDIA GPUs and people are eager to see how it was used.
"For extreme performance, we discover and use a behavior-out-of-doc PTX instruction: ld.global.nc.L1::no_allocate.L2::256B. This instruction will lead to an undefined behavior: accessing volatile GPU memory with non-coherent read-only PTX modifiers .nc. But the correctness is tested to be guaranteed with .L1::no_allocate on Hopper architectures, and performance will be much better. If you find kernels not working on some other platforms, you may add DISABLE_AGGRESSIVE_PTX_INSTRS=1 to setup.py and disable this, or file an issue."
So non-coherent refers to bypassing cache coherency, ie don't care about what other units might have written to that address? And the L1/L2 modifiers are to avoid L1 thrashing, keeping the value in L2 only?
My understanding of the L2 part is that it asks for a 256b pre-fetch (only available on some platforms it seems) but they use vectors of 4 32bits signed ints max so not sure why only the 256 would work or if the fact that it did fetch the next 128 helps.
Now it includes the highly anticipated PTX! Of course, I don’t understand it, but I’ve already click the star and even the fork button, which basically means I’ve mastered it, right? I feel incredibly powerful right now...
> For extreme performance, we discover and use an out-of-doc PTX instruction: ld.global.nc.L1::no_allocate.L2::256B. This instruction will lead to an undefined behavior: accessing volatile GPU memory with non-coherent read-only PTX modifiers .nc. But the correctness is tested to be guaranteed with .L1::no_allocate on Hopper architectures, and performance will be much better.
Practically speaking, is it possible for NVIDIA to "pull the rug" later, intentionally or otherwise, by subtly changing the behaviour of this out-of-doc instruction on new architectures?
They could. That's why there is a switch to disable it.
> If you find kernels not working on some other platforms, you may add DISABLE_AGGRESSIVE_PTX_INSTRS=1 to setup.py and disable this, or file an issue.
I feel like a kid in a candy shop. Some of these tricks would take way too long to reverse engineer correctly based on the papers. I hope that the releases this week start a renaissance of the use of MoE as baseline academic models.
From this point of view I don't understand what's happening between the actual SOTA models practice and the academic models. The former at this point are all MoEs, starting with GPT4. But then the open models, if not for DeepSeek V3 and Mixtral, are always dense models.
You gotta love these guys, they're really pushing the open source frontier for all of us, thanks for sharing
Open AI™ (with a space)
Kind of ironic that DeepSeek is more Open than ChatGPT
They do it for their own reasons, but OpenAI are straight up liars and they are neither open nor give a fuck about humanity.
It would be hilarious if this scenario played out.
OpenAI starts as a nonprofit, aiming to benefit all humanity. Eventually, they discover a path to AGI and engage in intense internal debates: Should they abandon their original mission and chase profit, knowing it could bring generational wealth? They ultimately decide, "To hell with humanity—let’s go for the money."
As they pivot to prioritizing profit, DeepSeek emerges. Staying true to OpenAI’s original vision, DeepSeek open-sources everything, benefiting humanity and earning global admiration. Unintentionally, this move tanks OpenAI’s valuation. In the end, OpenAI fails to become the hero or secure the massive profits they chased. Instead, they leave behind a legacy rebranded as "ClosedAI"
Well, they do give us a great free tool to use, but that's where it ends and probably has some agenda behind it.
OpenAyyyyI swear babe I’m gonna open it up any day. Yeah for that grated good or whatever it is you keep yappin about.
Now. It’s amazing to me that everyone is like fuck OpenAI deepseek is the savior, when OpenAI’s papers and code jump started an AI revolution just a few years ago. Let’s wait the same number of years and see what deepseek does.
I thought the papers that jump started the revolution came from Google?
Indeed. And the papers were about doing better translation of char sequences, essentially the tech emerged as linguistics improvement for language. Then someone realised the parrot learns enough ZIP and JPEG alongside and can spit back hazy memories of it all.
the one still super useful thing OpenAI ever released must’ve been Whisper. But they could’ve been much more open for sure.
Hinton. And if you'd ask himself probably Schmidthuber.
I hope you're reading this Sam Altman:
Make Open AI open.
Or else you'll lose to the ecosystem.
Too late, there is no more innovation from openai all the people that were the drivers left for Anthropic and the others. They had some of the biggest funding, had the advance... And yet they lost it.
Sam is busy with his new kiddo
That’s an impossible ask. Sam is the pinnacle of capitalist ruling class, he’s a pure businessman. He has no interest in giving anything for free unless there’s a business plan. He doesn’t care about humanity. He’ll pretend to change the world and tell you that they’re inventing AGI, Q*, strawberry or whatever they’re branding it, but the reality is he knows it’s all over and unless there’s a major breakthrough this company will be in major financial trouble. Sorry for the rant but he doesn’t deserve much respect for turning all this science to grift. He’s actually the person the old openai board warned everyone about.
- Efficient and optimized all-to-all communication - Both intranode and internode support with NVLink and RDMA - High-throughput kernels for training and inference prefilling - Low-latency kernels for inference decoding - Native FP8 dispatch support - Flexible GPU resource control for computation-communication overlapping X: https://x.com/deepseek_ai/status/1894211757604049133
Is the PTX that everyone was looking forward to included this time?
Yes, there's some in the csrc/kernels directory. Search for 'asm' to find uses of it.
> the PTX that everyone was looking forward to
explanation for the rest of us why this is so important?
Parallel Thread Execution. Think of them as opcodes for the Nvidia GPUs. They are a bit more complex that your traditional opcodes (the lowest level of abstraction accessible to users) in CPUs, as you can specify cache parameters, memory barriers etc.
There are documented combinations of parameters for those instructions but if you fuzz (search new combinations in a random or organized way because you hope some will work the way you want) you can find new ones with unexpected effects or with advantages (in various ways like not polluting caches, speed...)
Which is the case for example for ld.global.nc.L1::no_allocate.L2::256B that they use in deepseek that provides significant acceleration while beeing reliable (although not working on all architectures so they have ways to disable it)
Gonna check what SASS it get translated to and whether it makes any sense.
I wonder if they had SASS assembler for Hopper (either by reverse engineering nvdisasm or by fuzzing instructions + nvdisasm + stare hard) and don't want to say it out :p
Much of the hype around DeepSeek is due to their extraordinarily low training and inference costs. They achieved this by optimizing their training code, apparently using PTX in addition to CUDA. PTX is kind of an intermediate assembly language for NVIDIA GPUs and people are eager to see how it was used.
The PTX instructions they talked about in the tech report should be pointing to the code here?
"For extreme performance, we discover and use a behavior-out-of-doc PTX instruction: ld.global.nc.L1::no_allocate.L2::256B. This instruction will lead to an undefined behavior: accessing volatile GPU memory with non-coherent read-only PTX modifiers .nc. But the correctness is tested to be guaranteed with .L1::no_allocate on Hopper architectures, and performance will be much better. If you find kernels not working on some other platforms, you may add DISABLE_AGGRESSIVE_PTX_INSTRS=1 to setup.py and disable this, or file an issue."
So non-coherent refers to bypassing cache coherency, ie don't care about what other units might have written to that address? And the L1/L2 modifiers are to avoid L1 thrashing, keeping the value in L2 only?
Or did I get that wrong?
My understanding of the L2 part is that it asks for a 256b pre-fetch (only available on some platforms it seems) but they use vectors of 4 32bits signed ints max so not sure why only the 256 would work or if the fact that it did fetch the next 128 helps.
this might help: https://x.com/main_horse/status/1894215779521794058/photo/1
Spring showers bring may flowers!
Round 2 of open source releases from an actual "Open AI™" company and licensed under MIT.
Once again, DeepSeek is more open than the $157B+ one that is claiming to be "Open".
Almost no-one is talking about Meta's Llama and everyone should expect them to release Llama 4 with reasoning.
The objective is to not be squeezed in the middle of the race to zero.
https://www.llama.com/events/llamacon/signup/
Now it includes the highly anticipated PTX! Of course, I don’t understand it, but I’ve already click the star and even the fork button, which basically means I’ve mastered it, right? I feel incredibly powerful right now...
[dead]
[dead]