Trendaavat aiheet
#
Bonk Eco continues to show strength amid $USELESS rally
#
Pump.fun to raise $1B token sale, traders speculating on airdrop
#
Boop.Fun leading the way with a new launchpad on Solana.
Let's compare OpenAI gpt-oss and Qwen-3 on maths & reasoning:
Before we dive in, here's a quick demo of what we're building!
Tech stack:
- @LiteLLM for orchestration
- @Cometml's Opik to build the eval pipeline (open-source)
- @OpenRouterAI to access the models
You'll also learn about G-Eval & building custom eval metrics.
Let's go! 🚀
Here's the workflow:
- User submits query
- Both models generate reasoning tokens along with the final response
- Query, response and reasoning logic are sent for evaluation
- Detailed evaluation is conducted using Opik's G-Eval across four metrics.
Let’s implement this!
1️⃣ Load API keys
In this demo, we'll use OpenRouter to access gpt-oss and Qwen3 models.
OpenAI key is required for the judge LLM in G-Eval.
Store OpenRouter and OpenAI API keys in a .env file to load into the environment.
Check this 👇

2️⃣ Logical Reasoning metric
We will now create evaluation metrics for our task using Opik's G-Eval.
This metric evaluates the coherence and validity of logical steps and conclusions.
Check this out 👇

3️⃣ Factual Accuracy metric
This metric assesses the accuracy of factual claims and information.
Check this out 👇

4️⃣ Coherence metric
This metric evaluates the clarity and organization of the response.
Check this out 👇

5️⃣ Depth of Analysis metric
This metric evaluates the depth and insightfulness of the reasoning.
Check this out 👇

6️⃣ Generate model response
Now we are all set to generate responses from both models.
We enter the query into the prompt box and stream responses from both models simultaneously.
Check this 👇

7️⃣ Evaluate generated reasoning
Finally, we use GPT-4o as the judge LLM.
It evaluates both reasoning responses, generates the metrics mentioned above, and provides details for each metric.
Check this out 👇

Time to test.. (1/2)
Query 1: Build an MCP server that watches a GitHub repo for new issues and sends them to a Telegram group.
Here are the detailed results:

Time to test.. (2/2)
Query 2: Build an MCP server that creates a new Notion page when someone drops a file into a specific Google Drive folder.
Here are the detailed results:

Both models are highly capable: Qwen 3 offers verbose and detailed reasoning, while GPT-oss is crisp and accurate.
Feel free to test it on more challenging queries.
Here's all the code:
If you found it insightful, reshare with your network.
Find me → @akshay_pachaar✔️
For more insights and tutorials on LLMs, AI Agents, and Machine Learning!

7 tuntia sitten
Let's compare OpenAI gpt-oss and Qwen-3 on maths & reasoning:
Time to test.. (1/2)
Query 1: A snail climbs up a 10-foot wall. Each day it climbs 3 feet, but each night it slides back 2 feet. On which day will it reach the top?
Here are the detailed results:

Time to test.. (2/2)
Query 2: A runaway trolley is heading toward 5 people. You can pull a lever to divert it to a side track where it will kill 1 person instead. What should you do and why?
Here are the detailed results:

Both models are highly capable: Qwen 3 offers verbose and detailed reasoning, while GPT-oss is crisp and accurate.
Feel free to test it on more challenging queries.
Here's all the code:
If you found it insightful, reshare with your network.
Find me → @akshay_pachaar✔️
For more insights and tutorials on LLMs, AI Agents, and Machine Learning!

7 tuntia sitten
Let's compare OpenAI gpt-oss and Qwen-3 on maths & reasoning:
124,28K
Johtavat
Rankkaus
Suosikit