med-mastodon.com is one of the many independent Mastodon servers you can use to participate in the fediverse.
Medical community on Mastodon

Administered by:

Server stats:

310
active users

#benchmarks

2 posts2 participants0 posts today

"AI agents have been developed for complex real-world tasks from coding to customer service. But AI agent evaluations suffer from many challenges that undermine our understanding of how well agents really work. We introduce the Holistic Agent Leaderboard (HAL) to address these challenges. We make three main contributions. First, we provide a standardized evaluation harness that orchestrates parallel evaluations across hundreds of VMs, reducing evaluation time from weeks to hours while eliminating common implementation bugs. Second, we conduct three-dimensional analysis spanning models, scaffolds, and benchmarks. We validate the harness by conducting 21,730 agent rollouts across 9 models and 9 benchmarks in coding, web navigation, science, and customer service with a total cost of about $40,000. Our analysis reveals surprising insights, such as higher reasoning effort reducing accuracy in the majority of runs. Third, we use LLM-aided log inspection to uncover previously unreported behaviors, such as searching for the benchmark on HuggingFace instead of solving a task, or misusing credit cards in flight booking tasks. We share all agent logs, comprising 2.5B tokens of language model calls, to incentivize further research into agent behavior. By standardizing how the field evaluates agents and addressing common pitfalls in agent evaluation, we hope to shift the focus from agents that ace benchmarks to agents that work reliably in the real world."

arxiv.org/abs/2510.11977

arXiv logo
arXiv.orgHolistic Agent Leaderboard: The Missing Infrastructure for AI Agent EvaluationAI agents have been developed for complex real-world tasks from coding to customer service. But AI agent evaluations suffer from many challenges that undermine our understanding of how well agents really work. We introduce the Holistic Agent Leaderboard (HAL) to address these challenges. We make three main contributions. First, we provide a standardized evaluation harness that orchestrates parallel evaluations across hundreds of VMs, reducing evaluation time from weeks to hours while eliminating common implementation bugs. Second, we conduct three-dimensional analysis spanning models, scaffolds, and benchmarks. We validate the harness by conducting 21,730 agent rollouts across 9 models and 9 benchmarks in coding, web navigation, science, and customer service with a total cost of about $40,000. Our analysis reveals surprising insights, such as higher reasoning effort reducing accuracy in the majority of runs. Third, we use LLM-aided log inspection to uncover previously unreported behaviors, such as searching for the benchmark on HuggingFace instead of solving a task, or misusing credit cards in flight booking tasks. We share all agent logs, comprising 2.5B tokens of language model calls, to incentivize further research into agent behavior. By standardizing how the field evaluates agents and addressing common pitfalls in agent evaluation, we hope to shift the focus from agents that ace benchmarks to agents that work reliably in the real world.

One of the major channels starting to test on Linux ?

Triggered by the evolutions Microsoft takes for Windows as a product ?

Now that's interesting.

[] Adding : Best Distributions for Gaming Tests, ft. Wendell of Level1 Techs
youtube.com/watch?v=5O6tQYJSEM

Welcome to my mini ISA VGA shootout!
TL;DR: ISA Matrox cards are really, really slow in DOS.

I recently built an original Pentium 60MHz system, built on an ECS motherboard. Around the same time I received a "mystery" VGA card: A Matrox MGA Impression ISA card. And since most of my builds are "open builds" and therefore easily accessible, that machine got the pleasure of becoming the test bench for the Matrox.

As already revealed, the Matrox performs atrociously bad. So bad, in fact, that I had to test a couple other ISA cards to make sure it wasn't a system issue. I used my go-to benchmarking tool from Phil's DOS Benchmark Pack. I really don't want to experience Doom with this card..

And without further ado, the contestants and their results in this spur-of-the-moment benchmark run:
- Baseline: A 32-bit PCI S3 Virge/DX based card with 4MB RAM: A perfectly workable 48.2
- The low-end Trident TVGA9000C with 512KB RAM (this is a real garbage card): A pretty shitty 14.2
- The mid-range Cirrus Logic CL-GD-5422 with 1MB RAM (this is a decent card, know for compatibility but not necessarily speed): A barely bearable 24.7
- And finally, the "star" of the show, the Matrox: A whopping 10.9!

I said it was atrocious, didn't I? But hey, I'm gonna use this one with anyway, so who cares about DOS performance, right? ;)

NASA is calling for public submissions after astronaut captures something spectacular

Sprites, also called red sprites, are TLEs that occur high above thunderstorm clouds or cumulonimbus, triggered by intense…

newsbeep.com/us/91896/

Good point.
EU study warns over the shortcomings of AI benchmarking. Paper by EU researchers highlights problems with how AI models are currently measured and urges regulators to signal which benchmarks are trustworthy
"Measuring AI capabilities and risks is a challenge, and benchmarks have been found to promise too much, be easily gamed, and measure the wrong thing"
euractiv.com/section/tech/news

Durch sollen KI-Modelle vergleichbar sein. Firmen zeigen mit Tests & Ergebnissen Fähigkeiten der Modelle, die Aussagekraft ist aber oft unklar. Forschende: Etablierte Benchmarks machen vergleichbar, sind aber nur Indiz für reale Leistung: sciencemediacenter.de/angebote