The best Hacker News stories from Show from the past day
Latest posts:
Show HN: Pg_deltax, Apache-licensed alternative to TimescaleDB
Show HN: Haystack – Review the PRs that need human attention
Hey HN! We're building Haystack (<a href="https://haystackeditor.com/">https://haystackeditor.com/</a>) to help teams deal with the explosion in the number of pull requests that need to be reviewed due to the rise of coding agents.<p>Haystack replaces the GitHub PR review system with a queue that triages each PR before a human has to read any diffs. It looks at the diffs, the codebase, and the coding-agent conversation that produced the PR. Haystack then routes it into one of three buckets:<p>1. Safe to merge. This means the PR has enough evidence behind it that the team can merge it without another human's review.<p>Some examples:<p>-- A small UI copy change that includes a screenshot showing the final state<p>-- A backend change where the author clearly tested the important paths and ran the changes in a real environment<p>2. Needs fixes. This means that the PR has bugs or violates a rule in your codebase and therefore the PR needs to be fixed by the author.<p>Some examples:<p>-- The agent was asked to make loading a large table faster by adding pagination, but the PR still loads every result at once and "implements" pagination in the UI<p>-- The PR silently catches an error instead of logging, surfacing, or handling it. This violates the team's "no silent error swallowing" rule<p>3. Needs human review. This means that the PR could not be sufficiently verified by the author or is touching a sensitive part of the codebase (determined by user-input guidelines) and thus requires human review.<p>Some examples:<p>-- The PR changes a significant amount of logic in billing<p>-- The PR changes an important user flow like onboarding, but the author only ran unit tests and never opened the app to check the flow end-to-end. That violates the team's rule that high-impact user-facing changes need manual verification.<p>Instead of starting with line-by-line diffs, Haystack immediately tells the reviewer the goal behind the PR, what design decisions the author made (informed by their coding-agent conversation), and how much the author did to verify that the pull request works (e.g. run scripts, checked the frontend, etc.).<p>In this way, review shifts from "what changed?" to "is this the right behavior and is there evidence that it works?".<p>Here's a quick demo: <a href="https://www.tella.tv/video/streamlining-code-reviews-with-haystack-65zj" rel="nofollow">https://www.tella.tv/video/streamlining-code-reviews-with-ha...</a><p>We previously launched Haystack as a tool for understanding large PRs (<a href="https://news.ycombinator.com/item?id=45201703">https://news.ycombinator.com/item?id=45201703</a>). As many of you can probably relate to, the release of Opus 4.5 completely shattered our conception of how fast an engineer could craft a PR.<p>And as coding agents got even better from 4.5, we realized that pull requests did not scale along with our coding velocity. With each member of our team being able to pump out more than 20 pull requests a day, code review quickly became cognitively exhausting and less helpful.<p>After talking with other folks, we learned many feel similarly, and currently face the binary option of either not doing review at all or trying to keep up with a fire hose of pull requests.<p>Haystack is our attempt at a third path. We still believe in code review, but as coding agents produce more code, human reviewer attention becomes more valuable and more expensive.<p>Haystack helps teams spend that attention on the PRs where a human can meaningfully change the outcome of that PR. And for such PRs, Haystack shows the reviewer what the PR intended to do, whether the author showed that it works, and what design decisions need a second pair of eyes.<p>We're still quite early and are figuring out whether Haystack truly makes code review better. We would love any and all feedback!
Show HN: Lance – image/video generation and understanding in one model
The model has 3B active parameters. We put the code, homepage, paper and model links here:<p>- Code: <a href="https://github.com/bytedance/Lance" rel="nofollow">https://github.com/bytedance/Lance</a><p>- Homepage: <a href="https://lance-project.github.io/" rel="nofollow">https://lance-project.github.io/</a><p>- Paper: <a href="https://arxiv.org/abs/2605.18678" rel="nofollow">https://arxiv.org/abs/2605.18678</a><p>- Model: <a href="https://huggingface.co/bytedance-research/Lance" rel="nofollow">https://huggingface.co/bytedance-research/Lance</a><p>p.s. Lance is a research project, not a polished product. The model was trained using fewer than 128 GPUs.
Show HN: Id-agent – Token efficient UUID alternative for AI agents
Show HN: Id-agent – Token efficient UUID alternative for AI agents
Show HN: Superlog (YC P26) – Observability that installs itself and fixes bugs
Hey HN, we’re Nico and Arseniy, co-founders of Superlog (<a href="https://superlog.sh">https://superlog.sh</a>). We're building a self-installing, self healing observability tool meant not to be opened. It has a wizard that daily sets up proper logging and an agent that investigates errors and opens PRs.<p>Super short demo: <a href="https://www.youtube.com/watch?v=xFhU9Mk247M" rel="nofollow">https://www.youtube.com/watch?v=xFhU9Mk247M</a>.<p>In our earlier startups, we tried Sentry, Datadog, Grafana, Dash0, and nothing was good enough. Proper telemetry and alerting still requires a ton of manual setup. We struggled with adding good logs, so debugging was tough, especially as codebases grow at a faster pace. Meanwhile, the Datadog/Dash0 bill kept climbing, and we still spent engineering hours to learn, configure, and maintain our observability tooling.<p>With Sentry, we found ourselves flooded by a stream of alerts into our Slack channel, most were duplicates or lacked context, so alert fatigue/constant interrupts were a real pain. The #ops notification is consistently the worst feeling on a Saturday morning<p>We’ve seen too many times servers run out of memory and disk, and three AWS metrics giving us three different values. Half of the graphs on dashboards are normally empty or outdated, and manually clicking through UIs, especially when the team is small, seems like a huge waste of time.<p>At some point we realized that solving this problem would be more valuable than the things we had been working on, and we had the expertise to do it, since Arseniy had spent years at Datadog, getting paged during the night to debug production incidents. So we decided to build a platform that would just work: agent-first, MCP-native, zero-setup.<p>Here’s how Superlog works: we have a wizard that scans your repo, and automatically instruments it with well-structured logs, traces and metrics via OpenTelemetry. We make sure to highlight main failure modes, endpoint performance, usage per tenant, and LLM/upstream cost (by callsite, tenant and model).<p>Errors get fingerprinted and grouped into incidents, so you see one issue, not a thousand duplicates. When you get a notification from Superlog, you see a clear failure summary, its inferred severity and impact upfront.<p>Then the agent investigates and tries to solve the issue. If it has enough context, it produces a concise and tested PR. If it doesn't, it posts its findings for the investigating team, and automatically pulls in the engineers that could contribute more context based on documentation, previous investigations and Slack threads.<p>Either way the output is one clean PR per incident, posted in Slack, that you can
merge, ignore, or open as a Claude Code session and modify.<p>Three things we think are different from other observability vendors:<p>(1) We solve the setup pain. The wizard will instrument everything with native OTel SDKs, respecting the semantic conventions, with proper service and environment tagging. We’re also working on native automatic dashboards and alerts, so that you can see what’s going on in a glance and don’t miss subtle failure modes.<p>(2) Our telemetry doesn’t decay. The wizard runs daily, and keeps adding logs, alerts and dashboards where it’s needed. You don't have to remember to instrument new features. The next time something breaks, the data you need to debug it is already there.<p>(3) Our goal is to solve alert fatigue. We use agents to merge similar errors and refine the summaries, giving you relevant information upfront. We have a custom evaluation setup that makes sure that our summaries are dense and correct, and severity and impact is on point. We also give you confidence scores for every LLM-enhanced metric so that wrong guesses don’t get boosted.<p>Important: superlog telemetry is vendor-neutral, so you keep all the logs/metrics/traces we install. Pricing is on the site. We're early, so expect rough edges and please tell us when you find them.<p>You can try it at <a href="https://superlog.sh">https://superlog.sh</a>. We'd love to hear what you're using today, what's broken about it, and whether the "one mergeable PR per incident" model sounds useful or terrifying. Especially keen to hear from folks running integration-heavy products, anyone who's rolled their own observability, and anyone who has tried Sentry / Datadog MCPs and given up. Comments and feedback welcome!
Show HN: Superlog (YC P26) – Observability that installs itself and fixes bugs
Hey HN, we’re Nico and Arseniy, co-founders of Superlog (<a href="https://superlog.sh">https://superlog.sh</a>). We're building a self-installing, self healing observability tool meant not to be opened. It has a wizard that daily sets up proper logging and an agent that investigates errors and opens PRs.<p>Super short demo: <a href="https://www.youtube.com/watch?v=xFhU9Mk247M" rel="nofollow">https://www.youtube.com/watch?v=xFhU9Mk247M</a>.<p>In our earlier startups, we tried Sentry, Datadog, Grafana, Dash0, and nothing was good enough. Proper telemetry and alerting still requires a ton of manual setup. We struggled with adding good logs, so debugging was tough, especially as codebases grow at a faster pace. Meanwhile, the Datadog/Dash0 bill kept climbing, and we still spent engineering hours to learn, configure, and maintain our observability tooling.<p>With Sentry, we found ourselves flooded by a stream of alerts into our Slack channel, most were duplicates or lacked context, so alert fatigue/constant interrupts were a real pain. The #ops notification is consistently the worst feeling on a Saturday morning<p>We’ve seen too many times servers run out of memory and disk, and three AWS metrics giving us three different values. Half of the graphs on dashboards are normally empty or outdated, and manually clicking through UIs, especially when the team is small, seems like a huge waste of time.<p>At some point we realized that solving this problem would be more valuable than the things we had been working on, and we had the expertise to do it, since Arseniy had spent years at Datadog, getting paged during the night to debug production incidents. So we decided to build a platform that would just work: agent-first, MCP-native, zero-setup.<p>Here’s how Superlog works: we have a wizard that scans your repo, and automatically instruments it with well-structured logs, traces and metrics via OpenTelemetry. We make sure to highlight main failure modes, endpoint performance, usage per tenant, and LLM/upstream cost (by callsite, tenant and model).<p>Errors get fingerprinted and grouped into incidents, so you see one issue, not a thousand duplicates. When you get a notification from Superlog, you see a clear failure summary, its inferred severity and impact upfront.<p>Then the agent investigates and tries to solve the issue. If it has enough context, it produces a concise and tested PR. If it doesn't, it posts its findings for the investigating team, and automatically pulls in the engineers that could contribute more context based on documentation, previous investigations and Slack threads.<p>Either way the output is one clean PR per incident, posted in Slack, that you can
merge, ignore, or open as a Claude Code session and modify.<p>Three things we think are different from other observability vendors:<p>(1) We solve the setup pain. The wizard will instrument everything with native OTel SDKs, respecting the semantic conventions, with proper service and environment tagging. We’re also working on native automatic dashboards and alerts, so that you can see what’s going on in a glance and don’t miss subtle failure modes.<p>(2) Our telemetry doesn’t decay. The wizard runs daily, and keeps adding logs, alerts and dashboards where it’s needed. You don't have to remember to instrument new features. The next time something breaks, the data you need to debug it is already there.<p>(3) Our goal is to solve alert fatigue. We use agents to merge similar errors and refine the summaries, giving you relevant information upfront. We have a custom evaluation setup that makes sure that our summaries are dense and correct, and severity and impact is on point. We also give you confidence scores for every LLM-enhanced metric so that wrong guesses don’t get boosted.<p>Important: superlog telemetry is vendor-neutral, so you keep all the logs/metrics/traces we install. Pricing is on the site. We're early, so expect rough edges and please tell us when you find them.<p>You can try it at <a href="https://superlog.sh">https://superlog.sh</a>. We'd love to hear what you're using today, what's broken about it, and whether the "one mergeable PR per incident" model sounds useful or terrifying. Especially keen to hear from folks running integration-heavy products, anyone who's rolled their own observability, and anyone who has tried Sentry / Datadog MCPs and given up. Comments and feedback welcome!
Show HN: I made a 3D pose maker for artists
Show HN: I made a 3D pose maker for artists
Show HN: Forge – Guardrails take an 8B model from 53% to 99% on agentic tasks
Hi HN, I'm Antoine Zambelli, AI Director at Texas Instruments.<p>I built Forge, an open-source reliability layer for self-hosted LLM tool-calling.<p>What it does:<p>- Adds domain-and-tool-agnostic guardrails (retry nudges, step enforcement, error recovery, VRAM-aware context management) to local models running on consumer hardware<p>- Takes an 8B model from ~53% to ~99% on multi-step agentic workflows without changing the model - just the system around it<p>- Ships with an eval harness and interactive dashboard so you can reproduce every number<p>I wanted to run a handful of always-on agentic systems for my portfolio, didn't want to pay cloud frontier costs, and immediately hit the compounding math problem on local models. 90% per-step accuracy sounds great, but with a 5-step workflow that's a 40% failure rate. No existing framework seemed to address this mechanical reliability issue - they all seemed tailor-made for cloud frontier.<p>Demo video: <a href="https://youtu.be/MzRgJoJAXGc" rel="nofollow">https://youtu.be/MzRgJoJAXGc</a> (side-by-side: same model, same task, with and without Forge guardrails)<p>The paper (accepted to ACM CAIS '26, presenting May 26-29 in San Jose) covers the peer-reviewed findings across 97 model/backend configurations, 18 scenarios, 50 runs each. Key numbers:<p>- Ministral 8B with Forge: 99.3%. Claude Sonnet with Forge: 100%. The gap between a free local 8B model on a $600 GPU and a frontier API is less than 1 point.<p>- The same 8B local model with Forge (99.3%) outperforms Claude Sonnet without guardrails (87.2%) - an 8B model with framework support beats the best result you can get through frontier API alone.<p>- Error recovery scores 0% for every model tested - local and frontier - without the retry mechanism. Not a capability gap, an architectural absence.<p>I'm currently using this for my home assistant running on Ministral 14B-Reasoning, and for my locally hosted agentic coding harness (8B managed to contribute to the codebase!).<p>The guardrail stack has five layers, each independently toggleable. The two that carry the most weight (per ablation study with McNemar's test): retry nudges (24-49 point drops when disabled) and error recovery (~10 point drops, significant for every model tested). Step enforcement is situational - only fires for models with weaker sequencing discipline. Rescue parsing and context compaction showed no significance in the eval but are retained for production workloads where they activate once in a while.<p>One thing I really didn't expect: the serving backend matters. Same Mistral-Nemo 12B weights produce 7% accuracy on llama-server with native function calling and 83% on Llamafile in prompt mode. A 75-point swing from infrastructure alone. I don't think anyone's published this because standard benchmarks don't control for serving backend.<p>Another surprise: there's no distinction in current LLM tool-calling between "the tool ran successfully and returned data" and "the tool ran successfully but found nothing." Both return a value, the orchestrator marks the step complete, and bad data cascades downstream. It's the equivalent of HTTP having 200 but no 404. Forge adds this as a new exception class (ToolResolutionError) - the model sees the error and can retry instead of silently passing garbage forward.<p>Biggest technical challenge was context compaction for memory-constrained hardware. Both Ollama and Llamafile silently fall back to CPU when the model exceeds VRAM - no warning, no error, just 10-100x slower inference. Forge queries nvidia-smi at startup and derives a token budget to prevent this.<p>How to try it:<p>- Clone the repo, run the eval harness on a model I haven't tested. If you get interesting results I'll add them to the dashboard.<p>- Try the proxy server mode - point any OpenAI-compatible client at Forge and it handles guardrails transparently. It's the newest model and I'd love more eyes on it.<p>- Dogfooding led me to optimize model parameters in v0.6.0. The harder eval suite (26 scenarios) is designed to raise the ceiling so no one sits at 100%. Several that did on the original suite can't sweep it - including Opus 4.6. Curious if anyone finds scenarios that expose gaps I haven't thought of. Paper numbers based on pre v0.6.0 code.<p>Background: prior ML publication in unsupervised learning (83 citations). This paper accepted to ACM CAIS '26 - presenting May 26-29.<p>Repo: <a href="https://github.com/antoinezambelli/forge" rel="nofollow">https://github.com/antoinezambelli/forge</a><p>Paper: <a href="https://www.caisconf.org/program/2026/demos/forge-agentic-reliability/" rel="nofollow">https://www.caisconf.org/program/2026/demos/forge-agentic-re...</a> <a href="https://github.com/antoinezambelli/forge/blob/main/docs/forge_ieee_preprint.pdf" rel="nofollow">https://github.com/antoinezambelli/forge/blob/main/docs/forg...</a><p>Dashboard: <a href="https://github.com/antoinezambelli/forge/docs/results/dashboard.html" rel="nofollow">https://github.com/antoinezambelli/forge/docs/results/dashbo...</a>
Show HN: Forge – Guardrails take an 8B model from 53% to 99% on agentic tasks
Hi HN, I'm Antoine Zambelli, AI Director at Texas Instruments.<p>I built Forge, an open-source reliability layer for self-hosted LLM tool-calling.<p>What it does:<p>- Adds domain-and-tool-agnostic guardrails (retry nudges, step enforcement, error recovery, VRAM-aware context management) to local models running on consumer hardware<p>- Takes an 8B model from ~53% to ~99% on multi-step agentic workflows without changing the model - just the system around it<p>- Ships with an eval harness and interactive dashboard so you can reproduce every number<p>I wanted to run a handful of always-on agentic systems for my portfolio, didn't want to pay cloud frontier costs, and immediately hit the compounding math problem on local models. 90% per-step accuracy sounds great, but with a 5-step workflow that's a 40% failure rate. No existing framework seemed to address this mechanical reliability issue - they all seemed tailor-made for cloud frontier.<p>Demo video: <a href="https://youtu.be/MzRgJoJAXGc" rel="nofollow">https://youtu.be/MzRgJoJAXGc</a> (side-by-side: same model, same task, with and without Forge guardrails)<p>The paper (accepted to ACM CAIS '26, presenting May 26-29 in San Jose) covers the peer-reviewed findings across 97 model/backend configurations, 18 scenarios, 50 runs each. Key numbers:<p>- Ministral 8B with Forge: 99.3%. Claude Sonnet with Forge: 100%. The gap between a free local 8B model on a $600 GPU and a frontier API is less than 1 point.<p>- The same 8B local model with Forge (99.3%) outperforms Claude Sonnet without guardrails (87.2%) - an 8B model with framework support beats the best result you can get through frontier API alone.<p>- Error recovery scores 0% for every model tested - local and frontier - without the retry mechanism. Not a capability gap, an architectural absence.<p>I'm currently using this for my home assistant running on Ministral 14B-Reasoning, and for my locally hosted agentic coding harness (8B managed to contribute to the codebase!).<p>The guardrail stack has five layers, each independently toggleable. The two that carry the most weight (per ablation study with McNemar's test): retry nudges (24-49 point drops when disabled) and error recovery (~10 point drops, significant for every model tested). Step enforcement is situational - only fires for models with weaker sequencing discipline. Rescue parsing and context compaction showed no significance in the eval but are retained for production workloads where they activate once in a while.<p>One thing I really didn't expect: the serving backend matters. Same Mistral-Nemo 12B weights produce 7% accuracy on llama-server with native function calling and 83% on Llamafile in prompt mode. A 75-point swing from infrastructure alone. I don't think anyone's published this because standard benchmarks don't control for serving backend.<p>Another surprise: there's no distinction in current LLM tool-calling between "the tool ran successfully and returned data" and "the tool ran successfully but found nothing." Both return a value, the orchestrator marks the step complete, and bad data cascades downstream. It's the equivalent of HTTP having 200 but no 404. Forge adds this as a new exception class (ToolResolutionError) - the model sees the error and can retry instead of silently passing garbage forward.<p>Biggest technical challenge was context compaction for memory-constrained hardware. Both Ollama and Llamafile silently fall back to CPU when the model exceeds VRAM - no warning, no error, just 10-100x slower inference. Forge queries nvidia-smi at startup and derives a token budget to prevent this.<p>How to try it:<p>- Clone the repo, run the eval harness on a model I haven't tested. If you get interesting results I'll add them to the dashboard.<p>- Try the proxy server mode - point any OpenAI-compatible client at Forge and it handles guardrails transparently. It's the newest model and I'd love more eyes on it.<p>- Dogfooding led me to optimize model parameters in v0.6.0. The harder eval suite (26 scenarios) is designed to raise the ceiling so no one sits at 100%. Several that did on the original suite can't sweep it - including Opus 4.6. Curious if anyone finds scenarios that expose gaps I haven't thought of. Paper numbers based on pre v0.6.0 code.<p>Background: prior ML publication in unsupervised learning (83 citations). This paper accepted to ACM CAIS '26 - presenting May 26-29.<p>Repo: <a href="https://github.com/antoinezambelli/forge" rel="nofollow">https://github.com/antoinezambelli/forge</a><p>Paper: <a href="https://www.caisconf.org/program/2026/demos/forge-agentic-reliability/" rel="nofollow">https://www.caisconf.org/program/2026/demos/forge-agentic-re...</a> <a href="https://github.com/antoinezambelli/forge/blob/main/docs/forge_ieee_preprint.pdf" rel="nofollow">https://github.com/antoinezambelli/forge/blob/main/docs/forg...</a><p>Dashboard: <a href="https://github.com/antoinezambelli/forge/docs/results/dashboard.html" rel="nofollow">https://github.com/antoinezambelli/forge/docs/results/dashbo...</a>
Show HN: Number Gacha, a gacha game distilled to its essence
Number Gacha is a half-parody, half-real gacha game where you roll, unwrap, and battle numbers. Play on Desktop for the best experience!
Show HN: Number Gacha, a gacha game distilled to its essence
Number Gacha is a half-parody, half-real gacha game where you roll, unwrap, and battle numbers. Play on Desktop for the best experience!
Show HN: Gaussian Splat of a Strawberry
The Setup:<p><a href="https://i.imgur.com/o0hgybh.jpeg" rel="nofollow">https://i.imgur.com/o0hgybh.jpeg</a><p><a href="https://i.imgur.com/mcNiomp.jpeg" rel="nofollow">https://i.imgur.com/mcNiomp.jpeg</a><p><a href="https://i.imgur.com/vIjw6pc.jpeg" rel="nofollow">https://i.imgur.com/vIjw6pc.jpeg</a><p><a href="https://i.imgur.com/nzOwmSC.jpeg" rel="nofollow">https://i.imgur.com/nzOwmSC.jpeg</a>
Show HN: Gaussian Splat of a Strawberry
The Setup:<p><a href="https://i.imgur.com/o0hgybh.jpeg" rel="nofollow">https://i.imgur.com/o0hgybh.jpeg</a><p><a href="https://i.imgur.com/mcNiomp.jpeg" rel="nofollow">https://i.imgur.com/mcNiomp.jpeg</a><p><a href="https://i.imgur.com/vIjw6pc.jpeg" rel="nofollow">https://i.imgur.com/vIjw6pc.jpeg</a><p><a href="https://i.imgur.com/nzOwmSC.jpeg" rel="nofollow">https://i.imgur.com/nzOwmSC.jpeg</a>
Show HN: We missed Winamp, so we built an audio player for macOS
Show HN: We missed Winamp, so we built an audio player for macOS
Show HN: Auto-identity-remove – Automated data broker opt-out runner for macOS
Show HN: Auto-identity-remove – Automated data broker opt-out runner for macOS
Show HN: Files.md – Open-source alternative to Obsidian