Hacker News

Subscribe to Hacker News feed
Hacker News RSS
Updated: 6 min 21 sec ago

Show HN: PokemonGym – 387 milestones designed to test agents and LLMs

Sat, 04/05/2025 - 12:31am

We've developed PokemonGym, an open-source benchmark that uses Pokemon gameplay to evaluate LLM capabilities in tool use, information extraction, and reasoning.

The benchmark features 387 carefully designed milestones (reaching locations, catching Pokemon, earning badges) with assigned difficulty scores to create a standardized evaluation framework.

Our initial testing revealed an interesting performance gap: amateur human players require ~400 steps to catch their first Pokemon, while Claude 3.7 needs ~450 steps - suggesting AI models are approaching human-level performance in this domain.

The benchmark will soon be available on benchflow.ai with a simple API for testing your own agents and models.

GitHub repo: https://github.com/benchflow-ai/pokemon-gym

We're looking for collaborators interested in improving the harness or running experiments with different models.

Comments URL: https://news.ycombinator.com/item?id=43590755

Points: 1

# Comments: 0

Categories: Hacker News

In Defense of Ruthless Managers

Sat, 04/05/2025 - 12:25am
Categories: Hacker News

Ask HN: How would you defeat a bootkit?

Sat, 04/05/2025 - 12:17am

If your main machine, your money-making linux computer, were infected with a very sophisticated rootkit and/or bootkit, how would you go about ridding your device of it?

Comments URL: https://news.ycombinator.com/item?id=43590690

Points: 1

# Comments: 2

Categories: Hacker News

Coqui TTS: Free Text-to-Speech

Fri, 04/04/2025 - 11:52pm

Article URL: https://coquitts.com

Comments URL: https://news.ycombinator.com/item?id=43590570

Points: 2

# Comments: 0

Categories: Hacker News

New Neural Network Slashes Sensor-Data Overload

Fri, 04/04/2025 - 11:51pm
Categories: Hacker News

Ephic Praxis Manifesto

Fri, 04/04/2025 - 11:43pm

Article URL: https://kohlbergs7.org/epm/

Comments URL: https://news.ycombinator.com/item?id=43590523

Points: 2

# Comments: 0

Categories: Hacker News

X-cmd: classic CLI tools on meth, in your POSIX shell

Fri, 04/04/2025 - 11:34pm

Article URL: https://www.x-cmd.com/

Comments URL: https://news.ycombinator.com/item?id=43590486

Points: 1

# Comments: 0

Categories: Hacker News

Pages