Miguel Afonso Caetano<p>"My experiment harness executes this N times (N=100 for this particular experiment) and saves the results. It’s worth noting, if you rerun this you may not get identical results to me as between running the original experiment and writing this blog post I had removed the file containing the code to be analysed, and had to regenerate it. I believe it is effectively identical, but have not re-run the experiment.</p><p>o3 finds the kerberos authentication vulnerability in the benchmark in 8 of the 100 runs. In another 66 of the runs o3 concludes there is no bug present in the code (false negatives), and the remaining 28 reports are false positives. For comparison, Claude Sonnet 3.7 finds it 3 out of 100 runs and Claude Sonnet 3.5 does not find it in 100 runs. So on this benchmark at least we have a 2x-3x improvement in o3 over Claude Sonnet 3.7.</p><p>For the curious, I have uploaded a sample report from o3 (here) and Sonnet 3.7 (here). One aspect I found interesting is their presentation of results. With o3 you get something that feels like a human-written bug report, condensed to just present the findings, whereas with Sonnet 3.7 you get something like a stream of thought, or a work log. There are pros and cons to both. o3’s output is typically easier to follow due to its structure and focus. On the other hand, sometimes it is too brief, and clarity suffers."</p><p><a href="https://sean.heelan.io/2025/05/22/how-i-used-o3-to-find-cve-2025-37899-a-remote-zeroday-vulnerability-in-the-linux-kernels-smb-implementation/" rel="nofollow noopener" translate="no" target="_blank"><span class="invisible">https://</span><span class="ellipsis">sean.heelan.io/2025/05/22/how-</span><span class="invisible">i-used-o3-to-find-cve-2025-37899-a-remote-zeroday-vulnerability-in-the-linux-kernels-smb-implementation/</span></a></p><p><a href="https://tldr.nettime.org/tags/AI" class="mention hashtag" rel="nofollow noopener" target="_blank">#<span>AI</span></a> <a href="https://tldr.nettime.org/tags/GenerativeAI" class="mention hashtag" rel="nofollow noopener" target="_blank">#<span>GenerativeAI</span></a> <a href="https://tldr.nettime.org/tags/O3" class="mention hashtag" rel="nofollow noopener" target="_blank">#<span>O3</span></a> <a href="https://tldr.nettime.org/tags/OpenAI" class="mention hashtag" rel="nofollow noopener" target="_blank">#<span>OpenAI</span></a> <a href="https://tldr.nettime.org/tags/CyberSecurity" class="mention hashtag" rel="nofollow noopener" target="_blank">#<span>CyberSecurity</span></a> <a href="https://tldr.nettime.org/tags/Linux" class="mention hashtag" rel="nofollow noopener" target="_blank">#<span>Linux</span></a> <a href="https://tldr.nettime.org/tags/Kernel" class="mention hashtag" rel="nofollow noopener" target="_blank">#<span>Kernel</span></a> <a href="https://tldr.nettime.org/tags/ZeroDay" class="mention hashtag" rel="nofollow noopener" target="_blank">#<span>ZeroDay</span></a></p>