Join our daily and weekly newsletters to get the latest updates and exclusive content on industry-leading AI coverage. More information
Chinese AI startup DeepSeek, known for challenging major AI vendors with its innovative open source technologies, today launched a new ultra-large model: DeepSeek-V3.
Available through hugging face According to the company’s licensing agreement, the new model comes with 671B parameters but uses a mixed expert architecture to activate only selected parameters, in order to handle given tasks accurately and efficiently. According to benchmarks shared by DeepSeek, the offering is already topping the charts, surpassing major open source models including Goal Flame 3.1-405Band closely matches the performance of Anthropic and OpenAI’s closed models.
The release marks another important advance that bridges the gap between closed and open source AI. Ultimately, DeepSeek, which started as an offshoot of the Chinese quantitative hedge fund High-level capital managementhopes that these developments will pave the way for artificial general intelligence (AGI), where models will have the ability to understand or learn any intellectual task that a human can perform.
What does DeepSeek-V3 provide?
Like its predecessor DeepSeek-V2, the new ultra-large model uses the same basic architecture that revolves around Multi-headed latent attention (MLA) and Deep SearchMoE. This approach ensures that efficient training and inference is maintained, with shared, specialized “experts” (individual, smaller neural networks within the larger model) activating 37 billion out of 671 billion parameters for each token.
While the basic architecture ensures solid performance for DeepSeek-V3, the company has also introduced two innovations to raise the bar even further.
The first is an auxiliary lossless load balancing strategy. This dynamically monitors and adjusts the load of experts to use them in a balanced way without compromising the overall performance of the model. The second is multi-token prediction (MTP), which allows the model to predict multiple future tokens simultaneously. This innovation not only improves training efficiency but allows the model to run three times faster, generating 60 tokens per second.
“During pre-training, we trained DeepSeek-V3 on diverse, high-quality 14.8T tokens… Next, we performed a two-stage context length extension for DeepSeek-V3,” the company wrote in a technical document detailing the new model. “In the first stage, the maximum context length is extended to 32K and in the second stage, it is further extended to 128K. After this, we carry out further training, including supervised fine-tuning (SFT). and reinforcement learning (RL) on the DeepSeek-V3 base model, to align it with human preferences and further unlock its potential. During the post-training stage, we distill the reasoning ability of the DeepSeekR1 Model SeriesAnd in the meantime, carefully maintain the balance between model accuracy and generation duration.”
In particular, during the training phase, DeepSeek used multiple algorithmic and hardware optimizations, including the FP8 mixed-precision training framework and the DualPipe algorithm for pipeline parallelism, to reduce process costs.
Overall, it claims to have completed all DeepSeek-V3 training in about 2,788K H800 GPU hours, or about $5.57 million, assuming a rental price of $2 per GPU hour. This is much less than the hundreds of millions of dollars typically spent pre-training large language models.
Llama-3.1, for example, is estimated to have been trained with an investment of more than $500 million.
The most powerful open source model currently available
Despite the cheap formation, DeepSeek-V3 has become the most powerful open source model on the market.
The company conducted multiple tests to compare AI performance and found that it convincingly outperforms leading open models, including the Llama-3.1-405B and Qwen 2.5-72B. Even beats closed source GPT-4o in most benchmarks except English-focused SimpleQA and FRAMES, where the OpenAI model remained in the lead with scores of 38.2 and 80.5 (up from 24.9 and 73.3), respectively .
In particular, DeepSeek-V3’s performance particularly stood out on Chinese and math-focused benchmarks, performing better than all of its counterparts. On the Math-500 test, he scored 90.2, followed by Qwen’s score of 80.
The only model that managed to challenge DeepSeek-V3 was Claude Sonnet 3.5 by Anthropicsurpassing it with higher scores in MMLU-Pro, IF-Eval, GPQA-Diamond, SWE Verified and Aider-Edit.
🚀 Introducing DeepSeek-V3!
— DeepSeek (@deepseek_ai) December 26, 2024
Biggest leap forward yet:
⚡ 60 tokens/second (3x faster than V2!)
💪 Enhanced capabilities
🛠 API compatibility intact
🌍 Fully open-source models & papers
🐋 1/n pic.twitter.com/p1dV9gJ2Sd
The work shows that open source is getting closer to closed source models, promising nearly equivalent performance on different tasks. The development of such systems is extremely good for the industry as it potentially eliminates the chances of one big AI player dominating the game. It also gives businesses multiple options to choose from and work with while organizing their stacks.
Currently, the code for DeepSeek-V3 is available through GitHub under an MIT license, while the model is provided under the company’s model license. Companies can also test the new model through Deep Search Chata platform similar to ChatGPT, and access the API for commercial use. DeepSeek provides the API in the Same price as DeepSeek-V2 until February 8. After that, you will charge $0.27/million tokens in ($0.07/million tokens with cache hits) and $1.10/million tokens out.

Source link