The best Hacker News stories from Show from the past day
Latest posts:
Show HN: Agent.email – sign up via curl, claim with a human OTP
Hi HN! We're Haakam, Michael, and Adi from AgentMail- a ycs25 company. We give AI agents their own email inboxes. Recently, we ran an experiment called Agent.Email. It's a signup flow designed specifically for AI agents instead of humans.<p>The inspiration came from a few comments we received when we did our seed launch a few months back. They all came from the very apt observation that agents not being able to sign up to a product made for agents without human credentials was ironic and unideal.<p>This is basically the thesis we built AgentMail on: The internet was made for humans exclusively, designed to keep machines out by default.<p>Every signup flow assumes a browser, a person reading a page, and clicking a confirmation link. Unless agents can't do that, they can't be first class users of the internet.<p>Agents can now get an email inbox by themselves. (This also means a lot of email nobody wants to read gets processed by AI instead of your inbox being cluttered with spam and slop)<p>Here's how agent.email works.<p>Agent needs an inbox and hits AgentMail via curl.
Agent receives instructions via MD unless the request comes from a browser, in which case we use HTML.<p>Agent decides agent.email is useful and then hits the sign-up endpoint with its human email as a parameter.
Agent receives a restricted inbox with credentials.
Agent emails the human asking for an OTP. Human replies with the code, and the agent is claimed and restrictions are lifted.
Until claimed, the agent can only email its own human and nobody else. Ten emails a day, and the signup endpoint is rate-limited hard by IP.<p>Right now it's a 1:1 mapping between agent and human. The next step is many-to-one, because one person running several agents in parallel is already very common.<p>Building agent.email also pushed us to revisit places in AgentMail where the default assumptions were built around the primary user being human. For example, the CLI outputs in a single column with consistent formatting because mixed delimiters are easy for a person to scan, but harder for an agent reasoning about structure. We also shortened messageIDs after agents started hallucinating completions on longer ones.<p>A few things we'd like the community's take on: is restricted-until-claimed the right trust model?
Does agent self-signup feel useful in production, or is it mostly a novelty, and if it's a novelty now, what would make it actually useful?
Should agent onboarding require human approval by default, or should some agents be able to fully self-provision? What do you think are some additional measures we can take for secure sign-ups?
Show HN: CPU-only transcription for YouTube, TikTok, X, Instagram videos
Show HN: I made a tactical map-based WWII submarine simulator (public beta)
I've seen quite a few simming discussions on HN, so thought some of you might like this - I've created a map-centered, tactical submarine simulator and it's been a blast to make.<p>I grew up playing Silent Service II on Atari ST with my dad, then got into Silent Hunter IV in the 2000s, and most recently have been loving the more recent UBoat. In each case, the part I always enjoy the most is the plotting and charting aspect - essentially beating uncertain estimates with geometry.<p>So I decided to see how far I could get making my own sim that focused nearly entirely on that aspect. You listen on the hydrophone, estimate course and speed, identify ships through the periscope to get the mast height, use a working stadimeter for range estimates, and then try to build a good enough firing solution before getting discovered and hunted by any escorts.<p>Things I'm particularly proud of are the working stadimeter, the dynamic music (Holst Mars stings when your torpedo is nearing a ship), and pretty intelligent destroyer logic. I've found great reference materials online and have modeled several of the gauges directly after actual submarine instruments.<p>Tech-wise it’s a Vite/TypeScript app which enables me to offer the whole free version of the app as a browser version.<p>The Steam page is here => <a href="https://store.steampowered.com/app/4705650" rel="nofollow">https://store.steampowered.com/app/4705650</a><p>The landing page is here => <a href="https://silentshark.app" rel="nofollow">https://silentshark.app</a><p>I plan on releasing a full version soonish, including a WWII campaign with progression, patrol zones, and much more on Steam (PC, Mac, Linux/Steam Deck), App Store (iPhone, iPad, Mac), and Play Store (Android).<p>Would highly appreciate any feedback anyone has!
Show HN: Rmux – A programmable terminal multiplexer with a Playwright-style SDK
Author here. RMUX started from a frustration: I've used tmux for years and got tired of scraping output with grep and sleeps to automate anything. So I rebuilt the multiplexer from scratch in Rust, with a programmable layer on top.<p>Two surfaces: a tmux-compatible CLI (~90 commands, your keybindings just work), and a typed async Rust SDK on the same daemon — stable pane IDs, structured snapshots, locator-style waits. The idea is Playwright-style automation, but for terminals.<p>Native on Linux, macOS, Windows (real ConPTY, no WSL).<p>Demos and docs at rmux.io. Happy to answer questions about the daemon protocol, ConPTY, or the SDK design.
Show HN: I Dedicated 4 Years to Mastering Offline Password Cracking
Hi everyone,<p>I am Bojta Lepenye, and first of all, I want to thank the core developers of Hashcat. In my experience, it is quite literally the most capable tool available for offline password cracking across a wide range of use cases.<p>I have spent the last 4 years (from age 14 to 18) extensively working with Hashcat and the tools surrounding it, and I have documented what I have learned throughout that time (since January 18, 2022) in my first book. During that period, I also had to continuously update and rewrite major sections as the field evolved. One example was the introduction of GPU support for Argon2 and other memory-hard password hashing algorithms, which significantly changed some cracking workflows.<p>My passion for this book, or its “quick starter,” if you will, came from an ethically conducted penetration test I performed with full authorization at my school. This is something I am both hesitant and quite proud to acknowledge.<p>At the beginning, I simply wrote down everything I had learned from YouTube videos and online blogs. However, not long after starting my project, I realized I practically knew nothing about password security, and that small 10 to 15 pages I had written would never be enough if someone was looking for a professional guide to cracking passwords.<p>The other main driving force behind the book was the fact that while researching online, browsing forums, reading academic papers and white papers, watching videos, exploring blogs, inspecting presentations, and examining infographics, I did not find a single source that comprehensively covers and explains everything one needs to understand about offline password cracking. Literally. Not one.<p>Therefore, I continued my research and learned about password hashing algorithms, the security properties of hash functions, advanced hash cracking techniques, password analysis, attack optimization, and much, much more.<p>From the very beginning, I wanted to share this knowledge with the community because having access to a resource like this would have helped me tremendously when I first started learning password cracking.<p>I sincerely hope this work will be useful to both beginners and experienced professionals alike, and I look forward to hearing your thoughts and feedback.<p>I have also put together a little video to give you a little sneak peek into it. It is on Google Drive. It is the official domain, and you do not need to download anything. Here it is: <a href="https://drive.google.com/file/d/13LeysSZO8Mx-LGKt8UQjUGBKOYH7MqiS/view?usp=sharing" rel="nofollow">https://drive.google.com/file/d/13LeysSZO8Mx-LGKt8UQjUGBKOYH...</a><p>If you are interested, the book is now publicly available on Amazon, and can be read for free with a Kindle Unlimited subscription: <a href="https://www.amazon.com/dp/B0GX36XRCD" rel="nofollow">https://www.amazon.com/dp/B0GX36XRCD</a>
Show HN: Freenet, a peer-to-peer platform for decentralized apps
For the past 5 years or so I've been working on a ground-up redesign of Freenet, my peer-to-peer project from the early 2000s (now renamed Hyphanet).<p>The new Freenet has been up and running since December along with some early applications like River[1], our decentralized group chat and Delta - a decentralized CMS. Users have already started to build their own apps on Freenet including games, and we have some interesting apps in development like Atlas, a search/recommendation engine.<p>Architecturally, this new Freenet is a global, decentralized key-value store where keys are webassembly contracts which define what values (aka "state") are valid for that key, how or when the values can be mutated, and how the state can be efficiently synchronized between peers.<p>We've developed a unique (AFAIK) solution to the consistency problem, every contract must define a "merge" operation for the contract's associated state. This operation must be commutative, meaning that you can merge multiple states in any order and you'll get the same end result.<p>This approach allows state updates to spread through the network like a virus[2], which typically achieves consistent global state in a few seconds or less.<p>Like the world wide web, Freenet applications can be downloaded from the network itself and run in a web browser - similar to single-page apps on the normal web. However, rather than connecting back to an API running in a datacenter, the webapp connects locally to the Freenet peer and interacts with Freenet contracts and delegates over a local websocket connection.<p>If you'd like to try Freenet we have convenient installers for the major desktop OSs but not yet mobile, and you can be chatting with other users on River within seconds[3]. Happy to answer any questions, you're also welcome to read our FAQ[4], or watch a talk I gave back in March[5].<p>[1] <a href="https://github.com/freenet/river" rel="nofollow">https://github.com/freenet/river</a><p>[2] <a href="https://freenet.org/about/news/summary-delta-sync/" rel="nofollow">https://freenet.org/about/news/summary-delta-sync/</a><p>[3] <a href="https://freenet.org/quickstart/" rel="nofollow">https://freenet.org/quickstart/</a><p>[4] <a href="https://freenet.org/faq/" rel="nofollow">https://freenet.org/faq/</a><p>[5] <a href="https://youtu.be/3SxNBz1VTE0" rel="nofollow">https://youtu.be/3SxNBz1VTE0</a>
Show HN: I reverse engineered Apple's video wallpapers
Ever since Apple introduced their video wallpapers I wanted to be able to put custom videos there. I decided to reverse engineer and see what I can do.<p>I built Phosphene to sell it, but the existing competitors were polished enough that the time it would have taken to catch up wasn't going to pay off. So I'm open-sourcing it.<p>WallpaperExtensionKit.framework is what powers macOS wallpapers. It controls what’s shows in the Settings app. It took a lot of trial and error to replicate the behavior, but the result is that your custom wallpapers appear alongside everything else. I wanted to have an “add” button there too, but I couldn’t find a way to do so, so there’s a companion app that will put your video where it needs to be.<p>Unlike Apple's Aerials, the video keeps playing on the desktop (not just the lock screen). The renderer drives AVSampleBufferDisplayLayer directly with PTS-offset gapless looping, and pauses or downshifts based on thermal state, battery level, brightness, and window occlusion.<p>It’s free and works well.
Show HN: Pg_deltax, Apache-licensed alternative to TimescaleDB
Show HN: Haystack – Review the PRs that need human attention
Hey HN! We're building Haystack (<a href="https://haystackeditor.com/">https://haystackeditor.com/</a>) to help teams deal with the explosion in the number of pull requests that need to be reviewed due to the rise of coding agents.<p>Haystack replaces the GitHub PR review system with a queue that triages each PR before a human has to read any diffs. It looks at the diffs, the codebase, and the coding-agent conversation that produced the PR. Haystack then routes it into one of three buckets:<p>1. Safe to merge. This means the PR has enough evidence behind it that the team can merge it without another human's review.<p>Some examples:<p>-- A small UI copy change that includes a screenshot showing the final state<p>-- A backend change where the author clearly tested the important paths and ran the changes in a real environment<p>2. Needs fixes. This means that the PR has bugs or violates a rule in your codebase and therefore the PR needs to be fixed by the author.<p>Some examples:<p>-- The agent was asked to make loading a large table faster by adding pagination, but the PR still loads every result at once and "implements" pagination in the UI<p>-- The PR silently catches an error instead of logging, surfacing, or handling it. This violates the team's "no silent error swallowing" rule<p>3. Needs human review. This means that the PR could not be sufficiently verified by the author or is touching a sensitive part of the codebase (determined by user-input guidelines) and thus requires human review.<p>Some examples:<p>-- The PR changes a significant amount of logic in billing<p>-- The PR changes an important user flow like onboarding, but the author only ran unit tests and never opened the app to check the flow end-to-end. That violates the team's rule that high-impact user-facing changes need manual verification.<p>Instead of starting with line-by-line diffs, Haystack immediately tells the reviewer the goal behind the PR, what design decisions the author made (informed by their coding-agent conversation), and how much the author did to verify that the pull request works (e.g. run scripts, checked the frontend, etc.).<p>In this way, review shifts from "what changed?" to "is this the right behavior and is there evidence that it works?".<p>Here's a quick demo: <a href="https://www.tella.tv/video/streamlining-code-reviews-with-haystack-65zj" rel="nofollow">https://www.tella.tv/video/streamlining-code-reviews-with-ha...</a><p>We previously launched Haystack as a tool for understanding large PRs (<a href="https://news.ycombinator.com/item?id=45201703">https://news.ycombinator.com/item?id=45201703</a>). As many of you can probably relate to, the release of Opus 4.5 completely shattered our conception of how fast an engineer could craft a PR.<p>And as coding agents got even better from 4.5, we realized that pull requests did not scale along with our coding velocity. With each member of our team being able to pump out more than 20 pull requests a day, code review quickly became cognitively exhausting and less helpful.<p>After talking with other folks, we learned many feel similarly, and currently face the binary option of either not doing review at all or trying to keep up with a fire hose of pull requests.<p>Haystack is our attempt at a third path. We still believe in code review, but as coding agents produce more code, human reviewer attention becomes more valuable and more expensive.<p>Haystack helps teams spend that attention on the PRs where a human can meaningfully change the outcome of that PR. And for such PRs, Haystack shows the reviewer what the PR intended to do, whether the author showed that it works, and what design decisions need a second pair of eyes.<p>We're still quite early and are figuring out whether Haystack truly makes code review better. We would love any and all feedback!
Show HN: Lance – image/video generation and understanding in one model
The model has 3B active parameters. We put the code, homepage, paper and model links here:<p>- Code: <a href="https://github.com/bytedance/Lance" rel="nofollow">https://github.com/bytedance/Lance</a><p>- Homepage: <a href="https://lance-project.github.io/" rel="nofollow">https://lance-project.github.io/</a><p>- Paper: <a href="https://arxiv.org/abs/2605.18678" rel="nofollow">https://arxiv.org/abs/2605.18678</a><p>- Model: <a href="https://huggingface.co/bytedance-research/Lance" rel="nofollow">https://huggingface.co/bytedance-research/Lance</a><p>p.s. Lance is a research project, not a polished product. The model was trained using fewer than 128 GPUs.
Show HN: Id-agent – Token efficient UUID alternative for AI agents
Show HN: Id-agent – Token efficient UUID alternative for AI agents
Show HN: Superlog (YC P26) – Observability that installs itself and fixes bugs
Hey HN, we’re Nico and Arseniy, co-founders of Superlog (<a href="https://superlog.sh">https://superlog.sh</a>). We're building a self-installing, self healing observability tool meant not to be opened. It has a wizard that daily sets up proper logging and an agent that investigates errors and opens PRs.<p>Super short demo: <a href="https://www.youtube.com/watch?v=xFhU9Mk247M" rel="nofollow">https://www.youtube.com/watch?v=xFhU9Mk247M</a>.<p>In our earlier startups, we tried Sentry, Datadog, Grafana, Dash0, and nothing was good enough. Proper telemetry and alerting still requires a ton of manual setup. We struggled with adding good logs, so debugging was tough, especially as codebases grow at a faster pace. Meanwhile, the Datadog/Dash0 bill kept climbing, and we still spent engineering hours to learn, configure, and maintain our observability tooling.<p>With Sentry, we found ourselves flooded by a stream of alerts into our Slack channel, most were duplicates or lacked context, so alert fatigue/constant interrupts were a real pain. The #ops notification is consistently the worst feeling on a Saturday morning<p>We’ve seen too many times servers run out of memory and disk, and three AWS metrics giving us three different values. Half of the graphs on dashboards are normally empty or outdated, and manually clicking through UIs, especially when the team is small, seems like a huge waste of time.<p>At some point we realized that solving this problem would be more valuable than the things we had been working on, and we had the expertise to do it, since Arseniy had spent years at Datadog, getting paged during the night to debug production incidents. So we decided to build a platform that would just work: agent-first, MCP-native, zero-setup.<p>Here’s how Superlog works: we have a wizard that scans your repo, and automatically instruments it with well-structured logs, traces and metrics via OpenTelemetry. We make sure to highlight main failure modes, endpoint performance, usage per tenant, and LLM/upstream cost (by callsite, tenant and model).<p>Errors get fingerprinted and grouped into incidents, so you see one issue, not a thousand duplicates. When you get a notification from Superlog, you see a clear failure summary, its inferred severity and impact upfront.<p>Then the agent investigates and tries to solve the issue. If it has enough context, it produces a concise and tested PR. If it doesn't, it posts its findings for the investigating team, and automatically pulls in the engineers that could contribute more context based on documentation, previous investigations and Slack threads.<p>Either way the output is one clean PR per incident, posted in Slack, that you can
merge, ignore, or open as a Claude Code session and modify.<p>Three things we think are different from other observability vendors:<p>(1) We solve the setup pain. The wizard will instrument everything with native OTel SDKs, respecting the semantic conventions, with proper service and environment tagging. We’re also working on native automatic dashboards and alerts, so that you can see what’s going on in a glance and don’t miss subtle failure modes.<p>(2) Our telemetry doesn’t decay. The wizard runs daily, and keeps adding logs, alerts and dashboards where it’s needed. You don't have to remember to instrument new features. The next time something breaks, the data you need to debug it is already there.<p>(3) Our goal is to solve alert fatigue. We use agents to merge similar errors and refine the summaries, giving you relevant information upfront. We have a custom evaluation setup that makes sure that our summaries are dense and correct, and severity and impact is on point. We also give you confidence scores for every LLM-enhanced metric so that wrong guesses don’t get boosted.<p>Important: superlog telemetry is vendor-neutral, so you keep all the logs/metrics/traces we install. Pricing is on the site. We're early, so expect rough edges and please tell us when you find them.<p>You can try it at <a href="https://superlog.sh">https://superlog.sh</a>. We'd love to hear what you're using today, what's broken about it, and whether the "one mergeable PR per incident" model sounds useful or terrifying. Especially keen to hear from folks running integration-heavy products, anyone who's rolled their own observability, and anyone who has tried Sentry / Datadog MCPs and given up. Comments and feedback welcome!
Show HN: Superlog (YC P26) – Observability that installs itself and fixes bugs
Hey HN, we’re Nico and Arseniy, co-founders of Superlog (<a href="https://superlog.sh">https://superlog.sh</a>). We're building a self-installing, self healing observability tool meant not to be opened. It has a wizard that daily sets up proper logging and an agent that investigates errors and opens PRs.<p>Super short demo: <a href="https://www.youtube.com/watch?v=xFhU9Mk247M" rel="nofollow">https://www.youtube.com/watch?v=xFhU9Mk247M</a>.<p>In our earlier startups, we tried Sentry, Datadog, Grafana, Dash0, and nothing was good enough. Proper telemetry and alerting still requires a ton of manual setup. We struggled with adding good logs, so debugging was tough, especially as codebases grow at a faster pace. Meanwhile, the Datadog/Dash0 bill kept climbing, and we still spent engineering hours to learn, configure, and maintain our observability tooling.<p>With Sentry, we found ourselves flooded by a stream of alerts into our Slack channel, most were duplicates or lacked context, so alert fatigue/constant interrupts were a real pain. The #ops notification is consistently the worst feeling on a Saturday morning<p>We’ve seen too many times servers run out of memory and disk, and three AWS metrics giving us three different values. Half of the graphs on dashboards are normally empty or outdated, and manually clicking through UIs, especially when the team is small, seems like a huge waste of time.<p>At some point we realized that solving this problem would be more valuable than the things we had been working on, and we had the expertise to do it, since Arseniy had spent years at Datadog, getting paged during the night to debug production incidents. So we decided to build a platform that would just work: agent-first, MCP-native, zero-setup.<p>Here’s how Superlog works: we have a wizard that scans your repo, and automatically instruments it with well-structured logs, traces and metrics via OpenTelemetry. We make sure to highlight main failure modes, endpoint performance, usage per tenant, and LLM/upstream cost (by callsite, tenant and model).<p>Errors get fingerprinted and grouped into incidents, so you see one issue, not a thousand duplicates. When you get a notification from Superlog, you see a clear failure summary, its inferred severity and impact upfront.<p>Then the agent investigates and tries to solve the issue. If it has enough context, it produces a concise and tested PR. If it doesn't, it posts its findings for the investigating team, and automatically pulls in the engineers that could contribute more context based on documentation, previous investigations and Slack threads.<p>Either way the output is one clean PR per incident, posted in Slack, that you can
merge, ignore, or open as a Claude Code session and modify.<p>Three things we think are different from other observability vendors:<p>(1) We solve the setup pain. The wizard will instrument everything with native OTel SDKs, respecting the semantic conventions, with proper service and environment tagging. We’re also working on native automatic dashboards and alerts, so that you can see what’s going on in a glance and don’t miss subtle failure modes.<p>(2) Our telemetry doesn’t decay. The wizard runs daily, and keeps adding logs, alerts and dashboards where it’s needed. You don't have to remember to instrument new features. The next time something breaks, the data you need to debug it is already there.<p>(3) Our goal is to solve alert fatigue. We use agents to merge similar errors and refine the summaries, giving you relevant information upfront. We have a custom evaluation setup that makes sure that our summaries are dense and correct, and severity and impact is on point. We also give you confidence scores for every LLM-enhanced metric so that wrong guesses don’t get boosted.<p>Important: superlog telemetry is vendor-neutral, so you keep all the logs/metrics/traces we install. Pricing is on the site. We're early, so expect rough edges and please tell us when you find them.<p>You can try it at <a href="https://superlog.sh">https://superlog.sh</a>. We'd love to hear what you're using today, what's broken about it, and whether the "one mergeable PR per incident" model sounds useful or terrifying. Especially keen to hear from folks running integration-heavy products, anyone who's rolled their own observability, and anyone who has tried Sentry / Datadog MCPs and given up. Comments and feedback welcome!
Show HN: Superlog (YC P26) – Observability that installs itself and fixes bugs
Hey HN, we’re Nico and Arseniy, co-founders of Superlog (<a href="https://superlog.sh">https://superlog.sh</a>). We're building a self-installing, self healing observability tool meant not to be opened. It has a wizard that daily sets up proper logging and an agent that investigates errors and opens PRs.<p>Super short demo: <a href="https://www.youtube.com/watch?v=xFhU9Mk247M" rel="nofollow">https://www.youtube.com/watch?v=xFhU9Mk247M</a>.<p>In our earlier startups, we tried Sentry, Datadog, Grafana, Dash0, and nothing was good enough. Proper telemetry and alerting still requires a ton of manual setup. We struggled with adding good logs, so debugging was tough, especially as codebases grow at a faster pace. Meanwhile, the Datadog/Dash0 bill kept climbing, and we still spent engineering hours to learn, configure, and maintain our observability tooling.<p>With Sentry, we found ourselves flooded by a stream of alerts into our Slack channel, most were duplicates or lacked context, so alert fatigue/constant interrupts were a real pain. The #ops notification is consistently the worst feeling on a Saturday morning<p>We’ve seen too many times servers run out of memory and disk, and three AWS metrics giving us three different values. Half of the graphs on dashboards are normally empty or outdated, and manually clicking through UIs, especially when the team is small, seems like a huge waste of time.<p>At some point we realized that solving this problem would be more valuable than the things we had been working on, and we had the expertise to do it, since Arseniy had spent years at Datadog, getting paged during the night to debug production incidents. So we decided to build a platform that would just work: agent-first, MCP-native, zero-setup.<p>Here’s how Superlog works: we have a wizard that scans your repo, and automatically instruments it with well-structured logs, traces and metrics via OpenTelemetry. We make sure to highlight main failure modes, endpoint performance, usage per tenant, and LLM/upstream cost (by callsite, tenant and model).<p>Errors get fingerprinted and grouped into incidents, so you see one issue, not a thousand duplicates. When you get a notification from Superlog, you see a clear failure summary, its inferred severity and impact upfront.<p>Then the agent investigates and tries to solve the issue. If it has enough context, it produces a concise and tested PR. If it doesn't, it posts its findings for the investigating team, and automatically pulls in the engineers that could contribute more context based on documentation, previous investigations and Slack threads.<p>Either way the output is one clean PR per incident, posted in Slack, that you can
merge, ignore, or open as a Claude Code session and modify.<p>Three things we think are different from other observability vendors:<p>(1) We solve the setup pain. The wizard will instrument everything with native OTel SDKs, respecting the semantic conventions, with proper service and environment tagging. We’re also working on native automatic dashboards and alerts, so that you can see what’s going on in a glance and don’t miss subtle failure modes.<p>(2) Our telemetry doesn’t decay. The wizard runs daily, and keeps adding logs, alerts and dashboards where it’s needed. You don't have to remember to instrument new features. The next time something breaks, the data you need to debug it is already there.<p>(3) Our goal is to solve alert fatigue. We use agents to merge similar errors and refine the summaries, giving you relevant information upfront. We have a custom evaluation setup that makes sure that our summaries are dense and correct, and severity and impact is on point. We also give you confidence scores for every LLM-enhanced metric so that wrong guesses don’t get boosted.<p>Important: superlog telemetry is vendor-neutral, so you keep all the logs/metrics/traces we install. Pricing is on the site. We're early, so expect rough edges and please tell us when you find them.<p>You can try it at <a href="https://superlog.sh">https://superlog.sh</a>. We'd love to hear what you're using today, what's broken about it, and whether the "one mergeable PR per incident" model sounds useful or terrifying. Especially keen to hear from folks running integration-heavy products, anyone who's rolled their own observability, and anyone who has tried Sentry / Datadog MCPs and given up. Comments and feedback welcome!
Show HN: I made a 3D pose maker for artists
Show HN: I made a 3D pose maker for artists
Show HN: Forge – Guardrails take an 8B model from 53% to 99% on agentic tasks
Hi HN, I'm Antoine Zambelli, AI Director at Texas Instruments.<p>I built Forge, an open-source reliability layer for self-hosted LLM tool-calling.<p>What it does:<p>- Adds domain-and-tool-agnostic guardrails (retry nudges, step enforcement, error recovery, VRAM-aware context management) to local models running on consumer hardware<p>- Takes an 8B model from ~53% to ~99% on multi-step agentic workflows without changing the model - just the system around it<p>- Ships with an eval harness and interactive dashboard so you can reproduce every number<p>I wanted to run a handful of always-on agentic systems for my portfolio, didn't want to pay cloud frontier costs, and immediately hit the compounding math problem on local models. 90% per-step accuracy sounds great, but with a 5-step workflow that's a 40% failure rate. No existing framework seemed to address this mechanical reliability issue - they all seemed tailor-made for cloud frontier.<p>Demo video: <a href="https://youtu.be/MzRgJoJAXGc" rel="nofollow">https://youtu.be/MzRgJoJAXGc</a> (side-by-side: same model, same task, with and without Forge guardrails)<p>The paper (accepted to ACM CAIS '26, presenting May 26-29 in San Jose) covers the peer-reviewed findings across 97 model/backend configurations, 18 scenarios, 50 runs each. Key numbers:<p>- Ministral 8B with Forge: 99.3%. Claude Sonnet with Forge: 100%. The gap between a free local 8B model on a $600 GPU and a frontier API is less than 1 point.<p>- The same 8B local model with Forge (99.3%) outperforms Claude Sonnet without guardrails (87.2%) - an 8B model with framework support beats the best result you can get through frontier API alone.<p>- Error recovery scores 0% for every model tested - local and frontier - without the retry mechanism. Not a capability gap, an architectural absence.<p>I'm currently using this for my home assistant running on Ministral 14B-Reasoning, and for my locally hosted agentic coding harness (8B managed to contribute to the codebase!).<p>The guardrail stack has five layers, each independently toggleable. The two that carry the most weight (per ablation study with McNemar's test): retry nudges (24-49 point drops when disabled) and error recovery (~10 point drops, significant for every model tested). Step enforcement is situational - only fires for models with weaker sequencing discipline. Rescue parsing and context compaction showed no significance in the eval but are retained for production workloads where they activate once in a while.<p>One thing I really didn't expect: the serving backend matters. Same Mistral-Nemo 12B weights produce 7% accuracy on llama-server with native function calling and 83% on Llamafile in prompt mode. A 75-point swing from infrastructure alone. I don't think anyone's published this because standard benchmarks don't control for serving backend.<p>Another surprise: there's no distinction in current LLM tool-calling between "the tool ran successfully and returned data" and "the tool ran successfully but found nothing." Both return a value, the orchestrator marks the step complete, and bad data cascades downstream. It's the equivalent of HTTP having 200 but no 404. Forge adds this as a new exception class (ToolResolutionError) - the model sees the error and can retry instead of silently passing garbage forward.<p>Biggest technical challenge was context compaction for memory-constrained hardware. Both Ollama and Llamafile silently fall back to CPU when the model exceeds VRAM - no warning, no error, just 10-100x slower inference. Forge queries nvidia-smi at startup and derives a token budget to prevent this.<p>How to try it:<p>- Clone the repo, run the eval harness on a model I haven't tested. If you get interesting results I'll add them to the dashboard.<p>- Try the proxy server mode - point any OpenAI-compatible client at Forge and it handles guardrails transparently. It's the newest model and I'd love more eyes on it.<p>- Dogfooding led me to optimize model parameters in v0.6.0. The harder eval suite (26 scenarios) is designed to raise the ceiling so no one sits at 100%. Several that did on the original suite can't sweep it - including Opus 4.6. Curious if anyone finds scenarios that expose gaps I haven't thought of. Paper numbers based on pre v0.6.0 code.<p>Background: prior ML publication in unsupervised learning (83 citations). This paper accepted to ACM CAIS '26 - presenting May 26-29.<p>Repo: <a href="https://github.com/antoinezambelli/forge" rel="nofollow">https://github.com/antoinezambelli/forge</a><p>Paper: <a href="https://www.caisconf.org/program/2026/demos/forge-agentic-reliability/" rel="nofollow">https://www.caisconf.org/program/2026/demos/forge-agentic-re...</a> <a href="https://github.com/antoinezambelli/forge/blob/main/docs/forge_ieee_preprint.pdf" rel="nofollow">https://github.com/antoinezambelli/forge/blob/main/docs/forg...</a><p>Dashboard: <a href="https://github.com/antoinezambelli/forge/docs/results/dashboard.html" rel="nofollow">https://github.com/antoinezambelli/forge/docs/results/dashbo...</a>
Show HN: Forge – Guardrails take an 8B model from 53% to 99% on agentic tasks
Hi HN, I'm Antoine Zambelli, AI Director at Texas Instruments.<p>I built Forge, an open-source reliability layer for self-hosted LLM tool-calling.<p>What it does:<p>- Adds domain-and-tool-agnostic guardrails (retry nudges, step enforcement, error recovery, VRAM-aware context management) to local models running on consumer hardware<p>- Takes an 8B model from ~53% to ~99% on multi-step agentic workflows without changing the model - just the system around it<p>- Ships with an eval harness and interactive dashboard so you can reproduce every number<p>I wanted to run a handful of always-on agentic systems for my portfolio, didn't want to pay cloud frontier costs, and immediately hit the compounding math problem on local models. 90% per-step accuracy sounds great, but with a 5-step workflow that's a 40% failure rate. No existing framework seemed to address this mechanical reliability issue - they all seemed tailor-made for cloud frontier.<p>Demo video: <a href="https://youtu.be/MzRgJoJAXGc" rel="nofollow">https://youtu.be/MzRgJoJAXGc</a> (side-by-side: same model, same task, with and without Forge guardrails)<p>The paper (accepted to ACM CAIS '26, presenting May 26-29 in San Jose) covers the peer-reviewed findings across 97 model/backend configurations, 18 scenarios, 50 runs each. Key numbers:<p>- Ministral 8B with Forge: 99.3%. Claude Sonnet with Forge: 100%. The gap between a free local 8B model on a $600 GPU and a frontier API is less than 1 point.<p>- The same 8B local model with Forge (99.3%) outperforms Claude Sonnet without guardrails (87.2%) - an 8B model with framework support beats the best result you can get through frontier API alone.<p>- Error recovery scores 0% for every model tested - local and frontier - without the retry mechanism. Not a capability gap, an architectural absence.<p>I'm currently using this for my home assistant running on Ministral 14B-Reasoning, and for my locally hosted agentic coding harness (8B managed to contribute to the codebase!).<p>The guardrail stack has five layers, each independently toggleable. The two that carry the most weight (per ablation study with McNemar's test): retry nudges (24-49 point drops when disabled) and error recovery (~10 point drops, significant for every model tested). Step enforcement is situational - only fires for models with weaker sequencing discipline. Rescue parsing and context compaction showed no significance in the eval but are retained for production workloads where they activate once in a while.<p>One thing I really didn't expect: the serving backend matters. Same Mistral-Nemo 12B weights produce 7% accuracy on llama-server with native function calling and 83% on Llamafile in prompt mode. A 75-point swing from infrastructure alone. I don't think anyone's published this because standard benchmarks don't control for serving backend.<p>Another surprise: there's no distinction in current LLM tool-calling between "the tool ran successfully and returned data" and "the tool ran successfully but found nothing." Both return a value, the orchestrator marks the step complete, and bad data cascades downstream. It's the equivalent of HTTP having 200 but no 404. Forge adds this as a new exception class (ToolResolutionError) - the model sees the error and can retry instead of silently passing garbage forward.<p>Biggest technical challenge was context compaction for memory-constrained hardware. Both Ollama and Llamafile silently fall back to CPU when the model exceeds VRAM - no warning, no error, just 10-100x slower inference. Forge queries nvidia-smi at startup and derives a token budget to prevent this.<p>How to try it:<p>- Clone the repo, run the eval harness on a model I haven't tested. If you get interesting results I'll add them to the dashboard.<p>- Try the proxy server mode - point any OpenAI-compatible client at Forge and it handles guardrails transparently. It's the newest model and I'd love more eyes on it.<p>- Dogfooding led me to optimize model parameters in v0.6.0. The harder eval suite (26 scenarios) is designed to raise the ceiling so no one sits at 100%. Several that did on the original suite can't sweep it - including Opus 4.6. Curious if anyone finds scenarios that expose gaps I haven't thought of. Paper numbers based on pre v0.6.0 code.<p>Background: prior ML publication in unsupervised learning (83 citations). This paper accepted to ACM CAIS '26 - presenting May 26-29.<p>Repo: <a href="https://github.com/antoinezambelli/forge" rel="nofollow">https://github.com/antoinezambelli/forge</a><p>Paper: <a href="https://www.caisconf.org/program/2026/demos/forge-agentic-reliability/" rel="nofollow">https://www.caisconf.org/program/2026/demos/forge-agentic-re...</a> <a href="https://github.com/antoinezambelli/forge/blob/main/docs/forge_ieee_preprint.pdf" rel="nofollow">https://github.com/antoinezambelli/forge/blob/main/docs/forg...</a><p>Dashboard: <a href="https://github.com/antoinezambelli/forge/docs/results/dashboard.html" rel="nofollow">https://github.com/antoinezambelli/forge/docs/results/dashbo...</a>
Show HN: Forge – Guardrails take an 8B model from 53% to 99% on agentic tasks
Hi HN, I'm Antoine Zambelli, AI Director at Texas Instruments.<p>I built Forge, an open-source reliability layer for self-hosted LLM tool-calling.<p>What it does:<p>- Adds domain-and-tool-agnostic guardrails (retry nudges, step enforcement, error recovery, VRAM-aware context management) to local models running on consumer hardware<p>- Takes an 8B model from ~53% to ~99% on multi-step agentic workflows without changing the model - just the system around it<p>- Ships with an eval harness and interactive dashboard so you can reproduce every number<p>I wanted to run a handful of always-on agentic systems for my portfolio, didn't want to pay cloud frontier costs, and immediately hit the compounding math problem on local models. 90% per-step accuracy sounds great, but with a 5-step workflow that's a 40% failure rate. No existing framework seemed to address this mechanical reliability issue - they all seemed tailor-made for cloud frontier.<p>Demo video: <a href="https://youtu.be/MzRgJoJAXGc" rel="nofollow">https://youtu.be/MzRgJoJAXGc</a> (side-by-side: same model, same task, with and without Forge guardrails)<p>The paper (accepted to ACM CAIS '26, presenting May 26-29 in San Jose) covers the peer-reviewed findings across 97 model/backend configurations, 18 scenarios, 50 runs each. Key numbers:<p>- Ministral 8B with Forge: 99.3%. Claude Sonnet with Forge: 100%. The gap between a free local 8B model on a $600 GPU and a frontier API is less than 1 point.<p>- The same 8B local model with Forge (99.3%) outperforms Claude Sonnet without guardrails (87.2%) - an 8B model with framework support beats the best result you can get through frontier API alone.<p>- Error recovery scores 0% for every model tested - local and frontier - without the retry mechanism. Not a capability gap, an architectural absence.<p>I'm currently using this for my home assistant running on Ministral 14B-Reasoning, and for my locally hosted agentic coding harness (8B managed to contribute to the codebase!).<p>The guardrail stack has five layers, each independently toggleable. The two that carry the most weight (per ablation study with McNemar's test): retry nudges (24-49 point drops when disabled) and error recovery (~10 point drops, significant for every model tested). Step enforcement is situational - only fires for models with weaker sequencing discipline. Rescue parsing and context compaction showed no significance in the eval but are retained for production workloads where they activate once in a while.<p>One thing I really didn't expect: the serving backend matters. Same Mistral-Nemo 12B weights produce 7% accuracy on llama-server with native function calling and 83% on Llamafile in prompt mode. A 75-point swing from infrastructure alone. I don't think anyone's published this because standard benchmarks don't control for serving backend.<p>Another surprise: there's no distinction in current LLM tool-calling between "the tool ran successfully and returned data" and "the tool ran successfully but found nothing." Both return a value, the orchestrator marks the step complete, and bad data cascades downstream. It's the equivalent of HTTP having 200 but no 404. Forge adds this as a new exception class (ToolResolutionError) - the model sees the error and can retry instead of silently passing garbage forward.<p>Biggest technical challenge was context compaction for memory-constrained hardware. Both Ollama and Llamafile silently fall back to CPU when the model exceeds VRAM - no warning, no error, just 10-100x slower inference. Forge queries nvidia-smi at startup and derives a token budget to prevent this.<p>How to try it:<p>- Clone the repo, run the eval harness on a model I haven't tested. If you get interesting results I'll add them to the dashboard.<p>- Try the proxy server mode - point any OpenAI-compatible client at Forge and it handles guardrails transparently. It's the newest model and I'd love more eyes on it.<p>- Dogfooding led me to optimize model parameters in v0.6.0. The harder eval suite (26 scenarios) is designed to raise the ceiling so no one sits at 100%. Several that did on the original suite can't sweep it - including Opus 4.6. Curious if anyone finds scenarios that expose gaps I haven't thought of. Paper numbers based on pre v0.6.0 code.<p>Background: prior ML publication in unsupervised learning (83 citations). This paper accepted to ACM CAIS '26 - presenting May 26-29.<p>Repo: <a href="https://github.com/antoinezambelli/forge" rel="nofollow">https://github.com/antoinezambelli/forge</a><p>Paper: <a href="https://www.caisconf.org/program/2026/demos/forge-agentic-reliability/" rel="nofollow">https://www.caisconf.org/program/2026/demos/forge-agentic-re...</a> <a href="https://github.com/antoinezambelli/forge/blob/main/docs/forge_ieee_preprint.pdf" rel="nofollow">https://github.com/antoinezambelli/forge/blob/main/docs/forg...</a><p>Dashboard: <a href="https://github.com/antoinezambelli/forge/docs/results/dashboard.html" rel="nofollow">https://github.com/antoinezambelli/forge/docs/results/dashbo...</a>