A Guide to Reasoning with Qwen QwQ 32B

Written by:
Hatice Ozen
Share:

Scaling Reinforcement Learning is All You Need for the Rise of Smaller, Smarter Models

On March 5th, Alibaba Cloud’s Qwen team broke the internet with the release of QwQ-32B less than two months after the world was shocked with DeepSeek-R1.

Models are getting smaller and smarter – the intelligence and reasoning capabilities of DeepSeek-R1 (671 billion parameters) not only shocked developers, but was a huge win for the developer community with it being fully open-sourced. Now there is a 20x smaller, mightier open-source model rivaling its performance with only 32 billion parameters.

The Qwen team is proving again that we can unlock huge gains when we scale reinforcement learning (RL). QwQ-32B shows that RL on a strong base model can unlock reasoning capabilities for smaller models that enhances their performance to be on par with giant models.

As displayed in the figure below, QwQ-32B matches or beats DeepSeek-R1 and OpenAI’s o1-mini across industry benchmarks like AIME24, LiveBench, and BFCL. Industry benchmarks aside, I think the best benchmark out there for you is you and your use case, so make sure to try for yourself.

Image source: https://qwenlm.github.io/blog/qwq-32b/

Zooming out to see the bigger picture, what QwQ-32B matching or outperforming DeepSeek-R1 on key benchmarks while using only ~5% of the parameters really means is lower inference costs without sacrificing quality or capability.

I work at Groq so I know I may be a bit biased, but check out QwQ-32B via Groq API for insanely fast inference (absolutely necessary for output-heavy reasoning models such as this) at ~400 tokens/second for only $0.29/$0.39 per million input/output tokens (see all our pricing here). Or, you can explore via our Free Tier on GroqCloud™ with 30 free requests per minute.

QwQ-32B Nuances & Best Practices

As my team and I continue to test, these are some of the nuances we’d like to highlight for further testing and exploring with QwQ 32B:

1. Tool Use & Function Calling Capabilities

The model was explicitly designed for tool use and adapting its reasoning based on environmental feedback, which is a huge win for AI agents that need to reason, plan, and adapt based on context (outperforms R1 and o1-mini on the Berkeley Function Calling Leaderboard 🤯). 

We recently saw another example of improvements to agentic thinking being prioritized by frontier lab Anthropic with their release of Claude 3.7 Sonnet and their Pokémon benchmark in which Claude was given an environment and tools – memory, screen pixel input, and function calls to press buttons and navigate around the screen – and tasked with playing Pokémon! The whimsy of such a non-traditional benchmark overshadowed how important it is that model makers are focusing on agentic capabilities and the fact that the real-world contains many environment variables that we need models to keep focus and accomplish tasks within. This is a huge step in the right direction and we’re all excited to see more.

2. Handling Chinese Characters in Thinking Tokens

When the model was in preview, the team mentioned “language mixing and code-switching” as a limitation where the model might mix languages or switch between them unexpectedly. I think that this limitation was mitigated for the most part before releasing the version Alibaba released that Groq offers now, but you will notice a sprinkle of Chinese characters within the reasoning chains. 

This is actually a more general phenomena where we’re seeing reasoning models mix languages throughout their thinking process. As a bilingual human who also sprinkles Turkish words into my English thoughts, I totally understand, but you can mitigate this by prompting to not use any Chinese characters in the response if the reasoning chains are important for your application. 

Bonus: This is a good blog post if you’re interested in learning more about why this happens from a technical standpoint.

3. Managing Output-heavy Responses 

QwQ stands for “Qwen with Questions” and I’d argue that it should be renamed to “QwW”, or “Qwen with Waits” because, boy, does this model think and say “wait” a lot. This is actually a good thing and the “waits” that we might think are repetitive at first glance are actually the reason why the model performs better. My colleague, Rick Lamers, summarized it the best.

I personally have a lot of empathy for QwQ. We humans always tell ourselves “think before you speak” and QwQ is simply doing the same, but with the added bonus of giving us a very intimate look into its inner workings through the full reasoning chains that we get as output in addition to our final result. 

But if you’re looking to manage the output-heavy responses, we’d recommend the following:

  • Shorten the thinking by prompting it to be concise and remember that reasoning models in general are sensitive to the length of their chain of thought and it affects their answer quality.
  • Make use of max_completion_tokens on Groq API and give the model enough room to think thoroughly to avoid cutting off its chain of thought. Even then, we’ve noticed that QwQ-32B will still only think without getting to a final answer at times, in which case prompting for conciseness can help. Or just try running your query again.

4. Handling the Missing First <think> Token

QwQ-32B does not output the first <think> token, which makes it hard to parse through the reasoning programmatically. The Qwen team confirmed that this is normal behavior. 

You don’t have to worry about this when using Groq API – we’ve programmatically ensured that the model starts with “<think>\n” to prevent generating empty thinking content, which can degrade output quality.

5. Optimal API Request Parameters

The Qwen team recommends using temperature=0.6 and top_p=0.95 to avoid endless repetitions in the model’s reasoning. We’re also seeing that slightly lower temperatures result in better answers.

6. Managing Conversation History
If you’re having a multi-turn conversation or building a a chat application, the chat history you send to the model should only include the final output and not the thinking content (i.e. the reasoning chains or <think>...</think> blocks).  Thinking is excluded via the chat template for the model and including the thinking content can lead to degraded output.

Try QwQ-32B On Groq for Instant Reasoning

If you don’t have one already, create a free GroqCloud account and generate a Groq API key. As mentioned above, we have a generous free tier you can play on and a Developer Tier to upgrade to for more serious token consumption.

Here’s a cURL command you can use to immediately see qwen-qwq-32b in action:

curl "https://api.groq.com/openai/v1/chat/completions" \
  -X POST \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer ${GROQ_API_KEY}" \
  -d '{
         "messages": [
           {
             "role": "user",
             "content": "why is fast inference so important for reasoning models?"
           }
         ],
         "model": "qwen-qwq-32b",
         "temperature": 0.6,
         "max_completion_tokens": 131072,
         "top_p": 0.95,
         "stream": true,
         "stop": null
       }'

If you want to learn more about QwQ-32B and how to get started with our Python or TypeScript SDKs, see our full API documentation

So, What Next?

The present and future of AI is pretty exciting, especially with huge wins like QwQ-32B gifted to the open-source community. By following best practices and understanding the nuances of powerful reasoning models like this one, you can leverage their capabilities more effectively while enjoying the benefits of faster inference at lower costs with Groq.

Have you tried QwQ-32B yet? Let me know your experiences on X or in our Discord Community. As always, happy building!

The latest Groq news. Delivered to your inbox.