mohsen1 4 hours ago

> For extreme performance, we discover and use an out-of-doc PTX instruction: ld.global.nc.L1::no_allocate.L2::256B. This instruction will lead to an undefined behavior: accessing volatile GPU memory with non-coherent read-only PTX modifiers .nc. But the correctness is tested to be guaranteed with .L1::no_allocate on Hopper architectures, and performance will be much better.

  • k_sze 3 hours ago

    Practically speaking, is it possible for NVIDIA to "pull the rug" later, intentionally or otherwise, by subtly changing the behaviour of this out-of-doc instruction on new architectures?

    • ammo1662 3 hours ago

      They could. That's why there is a switch to disable it.

      > If you find kernels not working on some other platforms, you may add DISABLE_AGGRESSIVE_PTX_INSTRS=1 to setup.py and disable this, or file an issue.

pama 6 hours ago

I feel like a kid in a candy shop. Some of these tricks would take way too long to reverse engineer correctly based on the papers. I hope that the releases this week start a renaissance of the use of MoE as baseline academic models.

  • antirez 2 hours ago

    From this point of view I don't understand what's happening between the actual SOTA models practice and the academic models. The former at this point are all MoEs, starting with GPT4. But then the open models, if not for DeepSeek V3 and Mixtral, are always dense models.

ofou 7 hours ago

You gotta love these guys, they're really pushing the open source frontier for all of us, thanks for sharing

  • grg0 7 hours ago

    Open AI™ (with a space)

    • hackit2 6 hours ago

      Kind of ironic that DeepSeek is more Open than ChatGPT

      • gostsamo 6 hours ago

        They do it for their own reasons, but OpenAI are straight up liars and they are neither open nor give a fuck about humanity.

        • WiSaGaN an hour ago

          It would be hilarious if this scenario played out.

          OpenAI starts as a nonprofit, aiming to benefit all humanity. Eventually, they discover a path to AGI and engage in intense internal debates: Should they abandon their original mission and chase profit, knowing it could bring generational wealth? They ultimately decide, "To hell with humanity—let’s go for the money."

          As they pivot to prioritizing profit, DeepSeek emerges. Staying true to OpenAI’s original vision, DeepSeek open-sources everything, benefiting humanity and earning global admiration. Unintentionally, this move tanks OpenAI’s valuation. In the end, OpenAI fails to become the hero or secure the massive profits they chased. Instead, they leave behind a legacy rebranded as "ClosedAI"

        • amelius 17 minutes ago

          Well, they do give us a great free tool to use, but that's where it ends and probably has some agenda behind it.

        • chefandy 4 hours ago

          OpenAyyyyI swear babe I’m gonna open it up any day. Yeah for that grated good or whatever it is you keep yappin about.

      • azinman2 4 hours ago

        Now. It’s amazing to me that everyone is like fuck OpenAI deepseek is the savior, when OpenAI’s papers and code jump started an AI revolution just a few years ago. Let’s wait the same number of years and see what deepseek does.

        • gertop 3 hours ago

          I thought the papers that jump started the revolution came from Google?

          • larodi 2 hours ago

            Indeed. And the papers were about doing better translation of char sequences, essentially the tech emerged as linguistics improvement for language. Then someone realised the parrot learns enough ZIP and JPEG alongside and can spit back hazy memories of it all.

            the one still super useful thing OpenAI ever released must’ve been Whisper. But they could’ve been much more open for sure.

          • jeffreygoesto 3 hours ago

            Hinton. And if you'd ask himself probably Schmidthuber.

    • echelon 6 hours ago

      I hope you're reading this Sam Altman:

      Make Open AI open.

      Or else you'll lose to the ecosystem.

      • ta988 3 hours ago

        Too late, there is no more innovation from openai all the people that were the drivers left for Anthropic and the others. They had some of the biggest funding, had the advance... And yet they lost it.

      • sciencesama 4 hours ago

        Sam is busy with his new kiddo

      • alpb 3 hours ago

        That’s an impossible ask. Sam is the pinnacle of capitalist ruling class, he’s a pure businessman. He has no interest in giving anything for free unless there’s a business plan. He doesn’t care about humanity. He’ll pretend to change the world and tell you that they’re inventing AGI, Q*, strawberry or whatever they’re branding it, but the reality is he knows it’s all over and unless there’s a major breakthrough this company will be in major financial trouble. Sorry for the rant but he doesn’t deserve much respect for turning all this science to grift. He’s actually the person the old openai board warned everyone about.

helloericsf 8 hours ago

- Efficient and optimized all-to-all communication - Both intranode and internode support with NVLink and RDMA - High-throughput kernels for training and inference prefilling - Low-latency kernels for inference decoding - Native FP8 dispatch support - Flexible GPU resource control for computation-communication overlapping X: https://x.com/deepseek_ai/status/1894211757604049133

deyiao 6 hours ago

Is the PTX that everyone was looking forward to included this time?

  • find0x90 6 hours ago

    Yes, there's some in the csrc/kernels directory. Search for 'asm' to find uses of it.

  • swyx 5 hours ago

    > the PTX that everyone was looking forward to

    explanation for the rest of us why this is so important?

    • ta988 3 hours ago

      Parallel Thread Execution. Think of them as opcodes for the Nvidia GPUs. They are a bit more complex that your traditional opcodes (the lowest level of abstraction accessible to users) in CPUs, as you can specify cache parameters, memory barriers etc.

      There are documented combinations of parameters for those instructions but if you fuzz (search new combinations in a random or organized way because you hope some will work the way you want) you can find new ones with unexpected effects or with advantages (in various ways like not polluting caches, speed...)

      Which is the case for example for ld.global.nc.L1::no_allocate.L2::256B that they use in deepseek that provides significant acceleration while beeing reliable (although not working on all architectures so they have ways to disable it)

      • rfoo 2 hours ago

        Gonna check what SASS it get translated to and whether it makes any sense.

        I wonder if they had SASS assembler for Hopper (either by reverse engineering nvdisasm or by fuzzing instructions + nvdisasm + stare hard) and don't want to say it out :p

    • find0x90 4 hours ago

      Much of the hype around DeepSeek is due to their extraordinarily low training and inference costs. They achieved this by optimizing their training code, apparently using PTX in addition to CUDA. PTX is kind of an intermediate assembly language for NVIDIA GPUs and people are eager to see how it was used.

Bimos 7 hours ago

The PTX instructions they talked about in the tech report should be pointing to the code here?

  • zardinality 5 hours ago

    "For extreme performance, we discover and use a behavior-out-of-doc PTX instruction: ld.global.nc.L1::no_allocate.L2::256B. This instruction will lead to an undefined behavior: accessing volatile GPU memory with non-coherent read-only PTX modifiers .nc. But the correctness is tested to be guaranteed with .L1::no_allocate on Hopper architectures, and performance will be much better. If you find kernels not working on some other platforms, you may add DISABLE_AGGRESSIVE_PTX_INSTRS=1 to setup.py and disable this, or file an issue."

    • magicalhippo 4 hours ago

      So non-coherent refers to bypassing cache coherency, ie don't care about what other units might have written to that address? And the L1/L2 modifiers are to avoid L1 thrashing, keeping the value in L2 only?

      Or did I get that wrong?

      • ta988 3 hours ago

        My understanding of the L2 part is that it asks for a 256b pre-fetch (only available on some platforms it seems) but they use vectors of 4 32bits signed ints max so not sure why only the 256 would work or if the fact that it did fetch the next 128 helps.

kennyloginz 2 hours ago

Spring showers bring may flowers!

rvz 6 hours ago

Round 2 of open source releases from an actual "Open AI™" company and licensed under MIT.

Once again, DeepSeek is more open than the $157B+ one that is claiming to be "Open".

Almost no-one is talking about Meta's Llama and everyone should expect them to release Llama 4 with reasoning.

The objective is to not be squeezed in the middle of the race to zero.

deyiao 4 hours ago

Now it includes the highly anticipated PTX! Of course, I don’t understand it, but I’ve already click the star and even the fork button, which basically means I’ve mastered it, right? I feel incredibly powerful right now...