Nick Mark, MD @nick

**Chris Smart** @VE3RWJ@mastodon.radio · 2d

Latest short video from Pivot to #AI:
#Oxford pretends #AI #benchmarks are #science not #marketing
https://www.youtube.com/watch?v=KcYZN6sTZjQ

YouTubeOxford pretends AI benchmarks are science not marketingBy Pivot to AI

**Don Curren** @dbcurren.bsky.social@bsky.brid.gy · 3d

Don Curren @dbcurren.bsky.social@bsky.brid.gy

Bloomberg: The first year of Trump’s second presidency has brought an unusual reversal to a long-running trend: #Benchmarks in #China, #Europe and #Canada have all outperformed the #S&P500 in dollar terms since his election victory a year ago. #markets #stockmarkets

#sp500

**Ireland** @ie@pubeurope.com · Oct 23

Oct 23

Ireland @ie@pubeurope.com

https://www.europesays.com/ie/141590/ RedMagic mocks Apple and others for camera bumps #Apple #benchmarks #CameraBump #Éire #GraphicsCard #IE #Ireland #laptop #Mobile #netbook #notebook #processor #redmagic #reports #Review #Reviews #Technology #test #tests

**Miguel Afonso Caetano** @remixtures@tldr.nettime.org · Oct 17

Oct 17

Miguel Afonso Caetano @remixtures@tldr.nettime.org

"AI agents have been developed for complex real-world tasks from coding to customer service. But AI agent evaluations suffer from many challenges that undermine our understanding of how well agents really work. We introduce the Holistic Agent Leaderboard (HAL) to address these challenges. We make three main contributions. First, we provide a standardized evaluation harness that orchestrates parallel evaluations across hundreds of VMs, reducing evaluation time from weeks to hours while eliminating common implementation bugs. Second, we conduct three-dimensional analysis spanning models, scaffolds, and benchmarks. We validate the harness by conducting 21,730 agent rollouts across 9 models and 9 benchmarks in coding, web navigation, science, and customer service with a total cost of about $40,000. Our analysis reveals surprising insights, such as higher reasoning effort reducing accuracy in the majority of runs. Third, we use LLM-aided log inspection to uncover previously unreported behaviors, such as searching for the benchmark on HuggingFace instead of solving a task, or misusing credit cards in flight booking tasks. We share all agent logs, comprising 2.5B tokens of language model calls, to incentivize further research into agent behavior. By standardizing how the field evaluates agents and addressing common pitfalls in agent evaluation, we hope to shift the focus from agents that ace benchmarks to agents that work reliably in the real world."

https://arxiv.org/abs/2510.11977

arXiv.orgHolistic Agent Leaderboard: The Missing Infrastructure for AI Agent EvaluationAI agents have been developed for complex real-world tasks from coding to customer service. But AI agent evaluations suffer from many challenges that undermine our understanding of how well agents really work. We introduce the Holistic Agent Leaderboard (HAL) to address these challenges. We make three main contributions. First, we provide a standardized evaluation harness that orchestrates parallel evaluations across hundreds of VMs, reducing evaluation time from weeks to hours while eliminating common implementation bugs. Second, we conduct three-dimensional analysis spanning models, scaffolds, and benchmarks. We validate the harness by conducting 21,730 agent rollouts across 9 models and 9 benchmarks in coding, web navigation, science, and customer service with a total cost of about $40,000. Our analysis reveals surprising insights, such as higher reasoning effort reducing accuracy in the majority of runs. Third, we use LLM-aided log inspection to uncover previously unreported behaviors, such as searching for the benchmark on HuggingFace instead of solving a task, or misusing credit cards in flight booking tasks. We share all agent logs, comprising 2.5B tokens of language model calls, to incentivize further research into agent behavior. By standardizing how the field evaluates agents and addressing common pitfalls in agent evaluation, we hope to shift the focus from agents that ace benchmarks to agents that work reliably in the real world.

#AI #GenerativeAI #AIAgents

UK @uk@pubeurope.com · Oct 16

Oct 16

UK @uk@pubeurope.com

https://www.europesays.com/uk/505406/ Amazfit brings affordable smartwatch to more customers #amazfit #AmazfitBalance2 #AmazfitBalance2XT #AmazfitBalance2XTEurope #AmazfitBalanceXT #Balance2XT #Benchmarks #Gadgets #GraphicsCard #laptop #netbook #notebook #processor #reports #Review #Reviews #Technology #test #tests #UK #UnitedKingdom

**Europe Says** @europesays@pubeurope.com · Oct 6

Oct 6

Europe Says @europesays@pubeurope.com

https://www.europesays.com/2469987/ Discord users impacted by 3rd-party service provider’s customer service data breach #Benchmarks #BillingInformation #compromised #CustomerService #Data #DataBreach #discord #GraphicsCard #hack #IDVerification #laptop #Netbook #Notebook #processor #reports #review #reviews #test #Tests

**rijo** @rijo@frankfurt.social · Oct 5

Oct 5

rijo @rijo@frankfurt.social

Google Analytics expands benchmarking to include absolute metrics https://ppc.land/google-analytics-expands-benchmarking-to-include-absolute-metrics/ #GoogleAnalytics #DigitalMarketing #DataAnalytics #Benchmarks #MarketingStrategy

**PPC Land** @ppcland@mastodon.social · Oct 5

Oct 5

PPC Land @ppcland@mastodon.social

Google Analytics expands benchmarking to include absolute metrics: Google Analytics now offers benchmarking for 20 unnormalized metrics like total revenue and new users, estimating performance ranges based on active user counts. https://ppc.land/google-analytics-expands-benchmarking-to-include-absolute-metrics/ #GoogleAnalytics #DigitalMarketing #DataAnalytics #Benchmarks #MarketingStrategy

PPC Land · Oct 5Google Analytics expands benchmarking to include absolute metricsGoogle Analytics now offers benchmarking for 20 unnormalized metrics like total revenue and new users, estimating performance ranges based on active user counts.

**coucouf ⏚** @coucouf@framapiaf.org · Sep 27

Sep 27

coucouf ⏚ @coucouf@framapiaf.org

One of the major #gaming #hardware channels starting to test on Linux ?

Triggered by the evolutions Microsoft takes for Windows as a product ?

Now that's interesting.

[#GamersNexus] Adding #Linux #GPU #Benchmarks: Best Distributions for Gaming Tests, ft. Wendell of Level1 Techs
https://www.youtube.com/watch?v=5O6tQYJSEMw

YouTubeAdding Linux GPU Benchmarks: Best Distributions for Gaming Tests, ft. Wendell of Level1 TechsBy Gamers Nexus

#LinuxGaming #YearOfTheLinuxGamer

**it's B! Cavello** @b_cavello@mastodon.publicinterest.town · Sep 26

Sep 26

it's B! Cavello @b_cavello@mastodon.publicinterest.town

If we're going to build #AI at all, it'd BETTER BE to do good, to empower people, & to help us solve the most important problems that face humanity and our planet!

I am seeking input from #FoodSystems and #FoodSecurity practitioners by Sept 30 to help build that future:
https://www.aspendigital.org/feeding-the-future #UNGA #SDG2 #CommunityAlignedAI #Benchmarks

**Politico.eu (Unofficial RSS)** @politico_eu_bot@social.espeweb.net · Sep 8

**Ltning** @ltning@weirdr.net · Sep 7

Sep 7

Ltning @ltning@weirdr.net

Welcome to my mini ISA VGA shootout!
TL;DR: ISA Matrox cards are really, really slow in DOS.

I recently built an original Pentium 60MHz system, built on an ECS motherboard. Around the same time I received a "mystery" VGA card: A Matrox MGA Impression ISA card. And since most of my builds are "open builds" and therefore easily accessible, that machine got the pleasure of becoming the test bench for the Matrox.

As already revealed, the Matrox performs atrociously bad. So bad, in fact, that I had to test a couple other ISA cards to make sure it wasn't a system issue. I used my go-to benchmarking tool #3DBench from Phil's DOS Benchmark Pack. I really don't want to experience Doom with this card..

And without further ado, the contestants and their results in this spur-of-the-moment benchmark run:
- Baseline: A 32-bit PCI S3 Virge/DX based card with 4MB RAM: A perfectly workable 48.2
- The low-end Trident TVGA9000C with 512KB RAM (this is a real garbage card): A pretty shitty 14.2
- The mid-range Cirrus Logic CL-GD-5422 with 1MB RAM (this is a decent card, know for compatibility but not necessarily speed): A barely bearable 24.7
- And finally, the "star" of the show, the Matrox: A whopping 10.9!

I said it was atrocious, didn't I? But hey, I'm gonna use this one with #OS2 anyway, so who cares about DOS performance, right? ;)

#RetroComputing #Benchmarks #VGA #SlowVGA

**Europe Says** @europesays@pubeurope.com · Sep 4

Sep 4

Europe Says @europesays@pubeurope.com

https://www.europesays.com/2384448/ Tesla Model Y overtakes Nissan Leaf to become Norway’s best-selling EV of all time #Benchmarks #BestSellingElectricCarNorway #EVAdoptionNorway #GraphicsCard #laptop #ModelYSalesFigures #Netbook #NissanLeaf #Noreg #Norge #norway #NorwayElectricVehicleMarket #NorwayEVSales #Notebook #nyheter #processor #reports #review #reviews #TeslaModelY #TeslaModelYNorway #TeslaVsNissanLeaf #test #Tests

**Nicole Hennig** @nic221@techhub.social · Aug 20

Aug 20

Nicole Hennig @nic221@techhub.social

Stop benchmarking in the lab: Inclusion Arena shows how LLMs perform in production https://venturebeat.com/ai/stop-benchmarking-in-the-lab-inclusion-arena-shows-how-llms-perform-in-production/ #AI #benchmarks #evals

**United States News Beep** @us@newsbeep.org · Aug 18

Aug 18

United States News Beep @us@newsbeep.org

NASA is calling for public submissions after astronaut captures something spectacular

Sprites, also called red sprites, are TLEs that occur high above thunderstorm clouds or cumulonimbus, triggered by intense…
#NewsBeep #News #US #USA #UnitedStates #UnitedStatesOfAmerica #Science #benchmarks #graphicscard #laptop #LTEs #NASA #netbook #NicoleAyers #notebook #processor #redsprites #reports #review #reviews #sprites #test #tests
https://www.newsbeep.com/us/91896/

**Erik Jonker** @ErikJonker@mastodon.social · Aug 18 *

Aug 18 *

Erik Jonker @ErikJonker@mastodon.social

Good point.
EU study warns over the shortcomings of AI benchmarking. Paper by EU researchers highlights problems with how AI models are currently measured and urges regulators to signal which benchmarks are trustworthy
"Measuring AI capabilities and risks is a challenge, and benchmarks have been found to promise too much, be easily gamed, and measure the wrong thing"
https://www.euractiv.com/section/tech/news/eu-study-warns-over-the-shortcomings-of-ai-benchmarking/?utm_source=mastodon&utm_medium=dlvr.it
#AI #benchmarking #benchmarks