The best Hacker News stories from Show from the past day
Latest posts:
Show HN: Lemon Slice Live – Have a video call with a transformer model
Hey HN, this is Lina, Andrew, and Sidney from Lemon Slice. We’ve trained a custom diffusion transformer (DiT) model that achieves video streaming at 25fps and wrapped it into a demo that allows anyone to turn a photo into a real-time, talking avatar. Here’s an example conversation from co-founder Andrew: <a href="https://www.youtube.com/watch?v=CeYp5xQMFZY" rel="nofollow">https://www.youtube.com/watch?v=CeYp5xQMFZY</a>. Try it for yourself at: <a href="https://lemonslice.com/live">https://lemonslice.com/live</a>.<p>(Btw, we used to be called Infinity AI and did a Show HN under that name last year: <a href="https://news.ycombinator.com/item?id=41467704">https://news.ycombinator.com/item?id=41467704</a>.)<p>Unlike existing avatar video chat platforms like HeyGen, Tolan, or Apple Memoji filters, we do not require training custom models, rigging a character ahead of time, or having a human drive the avatar. Our tech allows users to create and immediately video-call a custom character by uploading a single image. The character image can be any style - from photorealistic to cartoons, paintings, and more.<p>To achieve this demo, we had to do the following (among other things! but these were the hardest):<p>1. Training a fast DiT model. To make our video generation fast, we had to both design a model that made the right trade-offs between speed and quality, and use standard distillation approaches. We first trained a custom video diffusion transformer (DiT) from scratch that achieves excellent lip and facial expression sync to audio. To further optimize the model for speed, we applied teacher-student distillation. The distilled model achieves 25fps video generation at 256-px resolution. Purpose-built transformer ASICs will eventually allow us to stream our video model at 4k resolution.<p>2. Solving the infinite video problem. Most video DiT models (Sora, Runway, Kling) generate 5-second chunks. They can iteratively extend it by another 5sec by feeding the end of the 1st chunk into the start of the 2nd in an autoregressive manner. Unfortunately the models experience quality degradation after multiple extensions due to accumulation of generation errors. We developed a temporal consistency preservation technique that maintains visual coherence across long sequences. Our technique significantly reduces artifact accumulation and allows us to generate indefinitely-long videos.<p>3. A complex streaming architecture with minimal latency. Enabling an end-to-end avatar zoom call requires several building blocks, including voice transcription, LLM inference, and text-to-speech generation in addition to video generation. We use Deepgram as our AI voice partner. Modal as the end-to-end compute platform. And Daily.co and Pipecat to help build a parallel processing pipeline that orchestrates everything via continuously streaming chunks. Our system achieves end-to-end latency of 3-6 seconds from user input to avatar response. Our target is <2 second latency.<p>More technical details here: <a href="https://lemonslice.com/live/technical-report">https://lemonslice.com/live/technical-report</a>.<p>Current limitations that we want to solve include: (1) enabling whole-body and background motions (we’re training a next-gen model for this), (2) reducing delays and improving resolution (purpose-built ASICs will help), (3) training a model on dyadic conversations so that avatars learn to listen naturally, and (4) allowing the character to “see you” and respond to what they see to create a more natural and engaging conversation.<p>We believe that generative video will usher in a new media type centered around interactivity: TV shows, movies, ads, and online courses will stop and talk to us. Our entertainment will be a mixture of passive and active experiences depending on what we’re in the mood for. Well, prediction is hard, especially about the future, but that’s how we see it anyway!<p>We’d love for you to try out the demo and let us know what you think! Post your characters and/or conversation recordings below.
Show HN: I reverse engineered top websites to build an animated UI library
Looking at websites such as Clerk, I began thinking that design engineers might be some kind of wizards. I wanted to understand how they do it, so I started reverse-engineering their components out of curiosity. One thing led to another, and I ended up building a small library of reusable, animated components based on what I found. The library is built in React and Framer Motion. I’d love to hear your feedback
Show HN: I reverse engineered top websites to build an animated UI library
Looking at websites such as Clerk, I began thinking that design engineers might be some kind of wizards. I wanted to understand how they do it, so I started reverse-engineering their components out of curiosity. One thing led to another, and I ended up building a small library of reusable, animated components based on what I found. The library is built in React and Framer Motion. I’d love to hear your feedback
Show HN: My from-scratch OS kernel that runs DOOM
Hi there! I've been on-and-off working on TacOS for a few months, which follows some UNIX-derived concepts (exec/fork, unix-style VFS, etc) and is now able to run a port of Doom, with a fairly small amount of modifications, using my from-scratch libc. The performance is actually decent compared to what I expected. Very interested to hear your thoughts. Thank you!
Show HN: My from-scratch OS kernel that runs DOOM
Hi there! I've been on-and-off working on TacOS for a few months, which follows some UNIX-derived concepts (exec/fork, unix-style VFS, etc) and is now able to run a port of Doom, with a fairly small amount of modifications, using my from-scratch libc. The performance is actually decent compared to what I expected. Very interested to hear your thoughts. Thank you!
Show HN: Moose – OSS framework to build analytical back ends with ClickHouse
Show HN: Index – New Open Source browser agent
Hey HN, Robert from Laminar (lmnr.ai) here.<p>We built Index - new SOTA Open Source browser agent.<p>It reached 92% on WebVoyager with Claude 3.7 (extended thinking). o1 was used as a judge, also we manually double checked the judge.<p>At the core is same old idea - run simple JS script in the browser to identify interactable elements -> draw bounding boxes around them on a screenshot of a browser window -> feed it to the LLM.<p>What made Index so good:<p>1. We essentially created browser agent observability. We patched Playwright to record the entire browser session while the agent operates, simultaneously tracing all agent steps and LLM calls. Then we synchronized everything in the UI, creating an unparalleled debugging experience. This allowed us to pinpoint exactly where the agent fails by seeing what it "sees" in session replay alongside execution traces.<p>2. Our detection script is simple but extremely good. It's carefully crafted via trial and error. We also employed CV and OCR.<p>3. Agent is very simple, literally just a while loop. All power comes from carefully crafted prompt and ton of eval runs.<p>Index is a simple python package. It also comes with a beautiful CLI.<p>pip install lmnr-index<p>playwright install chromium<p>index run<p>We've recently added o4-mini, Gemini 2.5 Pro and Flash. Pro is <i>extremely good and fast</i>. Give it a try via CLI.<p>You can also use index via serverless API. (<a href="https://docs.lmnr.ai/index-agent/api/getting-started">https://docs.lmnr.ai/index-agent/api/getting-started</a>)<p>Or via chat UI - <a href="https://lmnr.ai/chat">https://lmnr.ai/chat</a>.<p>To learn more about browser agent observability and evals check out open-source repo (<a href="https://github.com/lmnr-ai/lmnr">https://github.com/lmnr-ai/lmnr</a>) and our docs (<a href="https://docs.lmnr.ai/tracing/browser-agent-observability">https://docs.lmnr.ai/tracing/browser-agent-observability</a>).
Show HN: Index – New Open Source browser agent
Hey HN, Robert from Laminar (lmnr.ai) here.<p>We built Index - new SOTA Open Source browser agent.<p>It reached 92% on WebVoyager with Claude 3.7 (extended thinking). o1 was used as a judge, also we manually double checked the judge.<p>At the core is same old idea - run simple JS script in the browser to identify interactable elements -> draw bounding boxes around them on a screenshot of a browser window -> feed it to the LLM.<p>What made Index so good:<p>1. We essentially created browser agent observability. We patched Playwright to record the entire browser session while the agent operates, simultaneously tracing all agent steps and LLM calls. Then we synchronized everything in the UI, creating an unparalleled debugging experience. This allowed us to pinpoint exactly where the agent fails by seeing what it "sees" in session replay alongside execution traces.<p>2. Our detection script is simple but extremely good. It's carefully crafted via trial and error. We also employed CV and OCR.<p>3. Agent is very simple, literally just a while loop. All power comes from carefully crafted prompt and ton of eval runs.<p>Index is a simple python package. It also comes with a beautiful CLI.<p>pip install lmnr-index<p>playwright install chromium<p>index run<p>We've recently added o4-mini, Gemini 2.5 Pro and Flash. Pro is <i>extremely good and fast</i>. Give it a try via CLI.<p>You can also use index via serverless API. (<a href="https://docs.lmnr.ai/index-agent/api/getting-started">https://docs.lmnr.ai/index-agent/api/getting-started</a>)<p>Or via chat UI - <a href="https://lmnr.ai/chat">https://lmnr.ai/chat</a>.<p>To learn more about browser agent observability and evals check out open-source repo (<a href="https://github.com/lmnr-ai/lmnr">https://github.com/lmnr-ai/lmnr</a>) and our docs (<a href="https://docs.lmnr.ai/tracing/browser-agent-observability">https://docs.lmnr.ai/tracing/browser-agent-observability</a>).
Show HN: Node.js video tutorials where you can edit and run the code
Hey HN,<p>I'm Sindre, CTO of Scrimba (YC S20). We originally launched Scrimba to make video learning more interactive for aspiring frontend developers. So instead of passively watching videos, you can jump in an experiment with the code directly inside the video player. Since launch, almost two million people have used Scrimba to grow their skills.<p>However, one limitation is that we've only supported frontend code, as our interactive videos run in the browser, whereas most of our learners want to go fullstack—building APIs, handling auth, working with databases, and so forth.<p>To fix this, we spent the last 6 months integrating StackBlitz WebContainers into Scrimba. This enables a full Node.js environment—including a terminal, shell, npm access, and a virtual file system—directly inside our video player. Everything runs in the browser.<p>Here is a 2-minute recorded demo: <a href="https://scrimba.com/s08dpq3nom">https://scrimba.com/s08dpq3nom</a><p>If you want to see more, feel free to enroll into any of the seven fullstack courses we've launched so far, on subject like Node, Next, Express, SQL, Vite, and more. We've opened them up for Hacker News today so that you don't even need to create an account to watch the content:<p><a href="https://scrimba.com/fullstack">https://scrimba.com/fullstack</a><p><i>Other notable highlights about our "IDE videos":</i><p>- Based on events (code edits, cursor moves, etc) instead of pixels<p>- Roughly 100x smaller than traditional videos<p>- Recording is simple: just talk while you code<p>- Can be embedded in blogs, docs, or courses, like MDN does here: <a href="https://developer.mozilla.org/en-US/curriculum/core/css-fundamentals/" rel="nofollow">https://developer.mozilla.org/en-US/curriculum/core/css-fund...</a><p>- Entirely built in Imba, a language I created myself: <a href="https://news.ycombinator.com/item?id=28207662">https://news.ycombinator.com/item?id=28207662</a><p>We think this format could be useful for open-source maintainers and API-focused teams looking to create interactive docs or walkthroughs. Our videos are already embedded by MDN, LangChain, and Coursera.<p>If you maintain a library or SDK and want an interactive video about it, let us know—happy to record one for free that you can use however you like.<p>Would love to answer any questions or hear people's feedback!
Show HN: Node.js video tutorials where you can edit and run the code
Hey HN,<p>I'm Sindre, CTO of Scrimba (YC S20). We originally launched Scrimba to make video learning more interactive for aspiring frontend developers. So instead of passively watching videos, you can jump in an experiment with the code directly inside the video player. Since launch, almost two million people have used Scrimba to grow their skills.<p>However, one limitation is that we've only supported frontend code, as our interactive videos run in the browser, whereas most of our learners want to go fullstack—building APIs, handling auth, working with databases, and so forth.<p>To fix this, we spent the last 6 months integrating StackBlitz WebContainers into Scrimba. This enables a full Node.js environment—including a terminal, shell, npm access, and a virtual file system—directly inside our video player. Everything runs in the browser.<p>Here is a 2-minute recorded demo: <a href="https://scrimba.com/s08dpq3nom">https://scrimba.com/s08dpq3nom</a><p>If you want to see more, feel free to enroll into any of the seven fullstack courses we've launched so far, on subject like Node, Next, Express, SQL, Vite, and more. We've opened them up for Hacker News today so that you don't even need to create an account to watch the content:<p><a href="https://scrimba.com/fullstack">https://scrimba.com/fullstack</a><p><i>Other notable highlights about our "IDE videos":</i><p>- Based on events (code edits, cursor moves, etc) instead of pixels<p>- Roughly 100x smaller than traditional videos<p>- Recording is simple: just talk while you code<p>- Can be embedded in blogs, docs, or courses, like MDN does here: <a href="https://developer.mozilla.org/en-US/curriculum/core/css-fundamentals/" rel="nofollow">https://developer.mozilla.org/en-US/curriculum/core/css-fund...</a><p>- Entirely built in Imba, a language I created myself: <a href="https://news.ycombinator.com/item?id=28207662">https://news.ycombinator.com/item?id=28207662</a><p>We think this format could be useful for open-source maintainers and API-focused teams looking to create interactive docs or walkthroughs. Our videos are already embedded by MDN, LangChain, and Coursera.<p>If you maintain a library or SDK and want an interactive video about it, let us know—happy to record one for free that you can use however you like.<p>Would love to answer any questions or hear people's feedback!
Show HN: Dosidicus – A digital pet with a simple neural network
Show HN: Dosidicus – A digital pet with a simple neural network
Show HN: Rowboat – Open-source IDE for multi-agent systems
Hi HN! We’re Arjun, Ramnique, and Akhilesh, and we are building Rowboat (<a href="https://www.rowboatlabs.com/">https://www.rowboatlabs.com/</a>), an AI-assisted IDE for building and managing multi-agent systems. You start with a single agent, then scale up to teams of agents that work together, use MCP tools, and improve over time - all through a chat-based copilot.<p>Our repo is <a href="https://github.com/rowboatlabs/rowboat">https://github.com/rowboatlabs/rowboat</a>, docs are at <a href="https://docs.rowboatlabs.com/">https://docs.rowboatlabs.com/</a>, and there’s a demo video here: <a href="https://youtu.be/YRTCw9UHRbU" rel="nofollow">https://youtu.be/YRTCw9UHRbU</a><p>It’s becoming clear that real-world agentic systems work best when multiple agents collaborate, rather than having one agent attempt to do everything. This isn’t too surprising - it’s a bit like how good code consists of multiple functions that each do one thing, rather than cramming everything into one function.<p>For example, a travel assistant works best when different agents handle specialized tasks: one agent finds the best flights, another optimizes hotel selections, and a third organizes the itinerary. This modular approach makes the system easier to manage, debug, and improve over time.<p>OpenAI’s Agents SDK provides a neat Python library to support this, but building reliable agentic systems requires constant iterations and tweaking - e.g. updating agent instructions (which can quickly get as complex as actual code), connecting tools, and testing the system and incorporating feedback. Rowboat is an AI IDE to do all this. Rowboat is to AI agents what Cursor is to code.<p>We’ve taken a code-like approach to agent instructions (prompts). There are special keywords to directly reference other agents, tools or prompts - which are highlighted in the UI. The copilot is the best way to create and edit these instructions - each change comes with a code-style diff.<p>You can give agents access to tools by integrating any MCP server or connecting your own functions through a webhook. You can instruct the agents on when to use specific tools via ‘@mentions’ in the agent instruction. To enable quick testing, we added a way to mock tool responses using LLM calls.<p>Rowboat playground lets you test and debug the assistants as you build them. You can see agent transfers, tool invocations and tool responses in real-time. The copilot has the context of the chat, and can improve the agent instructions based on feedback. For example, you could say ‘The agent shouldn’t have done x here. Fix this’ and the copilot can go and make this fix.<p>You can integrate agentic systems built in Rowboat into your application via the HTTP API or the Python SDK (‘pip install rowboat’). For example, you can build user-facing chatbots, enterprise workflows and employee assistants using Rowboat.<p>We’ve been working with LLMs since GPT-1 launched in 2018. Most recently, we built Coinbase’s support chatbot after our last AI startup was acquired by them.<p>Rowboat is Apache 2.0 licensed, giving you full freedom to self-host, modify, or extend it however you like.<p>We’re excited to share Rowboat with everyone here. We’d love to hear your thoughts!
Show HN: Rowboat – Open-source IDE for multi-agent systems
Hi HN! We’re Arjun, Ramnique, and Akhilesh, and we are building Rowboat (<a href="https://www.rowboatlabs.com/">https://www.rowboatlabs.com/</a>), an AI-assisted IDE for building and managing multi-agent systems. You start with a single agent, then scale up to teams of agents that work together, use MCP tools, and improve over time - all through a chat-based copilot.<p>Our repo is <a href="https://github.com/rowboatlabs/rowboat">https://github.com/rowboatlabs/rowboat</a>, docs are at <a href="https://docs.rowboatlabs.com/">https://docs.rowboatlabs.com/</a>, and there’s a demo video here: <a href="https://youtu.be/YRTCw9UHRbU" rel="nofollow">https://youtu.be/YRTCw9UHRbU</a><p>It’s becoming clear that real-world agentic systems work best when multiple agents collaborate, rather than having one agent attempt to do everything. This isn’t too surprising - it’s a bit like how good code consists of multiple functions that each do one thing, rather than cramming everything into one function.<p>For example, a travel assistant works best when different agents handle specialized tasks: one agent finds the best flights, another optimizes hotel selections, and a third organizes the itinerary. This modular approach makes the system easier to manage, debug, and improve over time.<p>OpenAI’s Agents SDK provides a neat Python library to support this, but building reliable agentic systems requires constant iterations and tweaking - e.g. updating agent instructions (which can quickly get as complex as actual code), connecting tools, and testing the system and incorporating feedback. Rowboat is an AI IDE to do all this. Rowboat is to AI agents what Cursor is to code.<p>We’ve taken a code-like approach to agent instructions (prompts). There are special keywords to directly reference other agents, tools or prompts - which are highlighted in the UI. The copilot is the best way to create and edit these instructions - each change comes with a code-style diff.<p>You can give agents access to tools by integrating any MCP server or connecting your own functions through a webhook. You can instruct the agents on when to use specific tools via ‘@mentions’ in the agent instruction. To enable quick testing, we added a way to mock tool responses using LLM calls.<p>Rowboat playground lets you test and debug the assistants as you build them. You can see agent transfers, tool invocations and tool responses in real-time. The copilot has the context of the chat, and can improve the agent instructions based on feedback. For example, you could say ‘The agent shouldn’t have done x here. Fix this’ and the copilot can go and make this fix.<p>You can integrate agentic systems built in Rowboat into your application via the HTTP API or the Python SDK (‘pip install rowboat’). For example, you can build user-facing chatbots, enterprise workflows and employee assistants using Rowboat.<p>We’ve been working with LLMs since GPT-1 launched in 2018. Most recently, we built Coinbase’s support chatbot after our last AI startup was acquired by them.<p>Rowboat is Apache 2.0 licensed, giving you full freedom to self-host, modify, or extend it however you like.<p>We’re excited to share Rowboat with everyone here. We’d love to hear your thoughts!
Show HN: Rowboat – Open-source IDE for multi-agent systems
Hi HN! We’re Arjun, Ramnique, and Akhilesh, and we are building Rowboat (<a href="https://www.rowboatlabs.com/">https://www.rowboatlabs.com/</a>), an AI-assisted IDE for building and managing multi-agent systems. You start with a single agent, then scale up to teams of agents that work together, use MCP tools, and improve over time - all through a chat-based copilot.<p>Our repo is <a href="https://github.com/rowboatlabs/rowboat">https://github.com/rowboatlabs/rowboat</a>, docs are at <a href="https://docs.rowboatlabs.com/">https://docs.rowboatlabs.com/</a>, and there’s a demo video here: <a href="https://youtu.be/YRTCw9UHRbU" rel="nofollow">https://youtu.be/YRTCw9UHRbU</a><p>It’s becoming clear that real-world agentic systems work best when multiple agents collaborate, rather than having one agent attempt to do everything. This isn’t too surprising - it’s a bit like how good code consists of multiple functions that each do one thing, rather than cramming everything into one function.<p>For example, a travel assistant works best when different agents handle specialized tasks: one agent finds the best flights, another optimizes hotel selections, and a third organizes the itinerary. This modular approach makes the system easier to manage, debug, and improve over time.<p>OpenAI’s Agents SDK provides a neat Python library to support this, but building reliable agentic systems requires constant iterations and tweaking - e.g. updating agent instructions (which can quickly get as complex as actual code), connecting tools, and testing the system and incorporating feedback. Rowboat is an AI IDE to do all this. Rowboat is to AI agents what Cursor is to code.<p>We’ve taken a code-like approach to agent instructions (prompts). There are special keywords to directly reference other agents, tools or prompts - which are highlighted in the UI. The copilot is the best way to create and edit these instructions - each change comes with a code-style diff.<p>You can give agents access to tools by integrating any MCP server or connecting your own functions through a webhook. You can instruct the agents on when to use specific tools via ‘@mentions’ in the agent instruction. To enable quick testing, we added a way to mock tool responses using LLM calls.<p>Rowboat playground lets you test and debug the assistants as you build them. You can see agent transfers, tool invocations and tool responses in real-time. The copilot has the context of the chat, and can improve the agent instructions based on feedback. For example, you could say ‘The agent shouldn’t have done x here. Fix this’ and the copilot can go and make this fix.<p>You can integrate agentic systems built in Rowboat into your application via the HTTP API or the Python SDK (‘pip install rowboat’). For example, you can build user-facing chatbots, enterprise workflows and employee assistants using Rowboat.<p>We’ve been working with LLMs since GPT-1 launched in 2018. Most recently, we built Coinbase’s support chatbot after our last AI startup was acquired by them.<p>Rowboat is Apache 2.0 licensed, giving you full freedom to self-host, modify, or extend it however you like.<p>We’re excited to share Rowboat with everyone here. We’d love to hear your thoughts!
Show HN: Morphik – Open-source RAG that understands PDF images, runs locally
Hey HN, we’re Adi and Arnav. A few months ago, we hit a wall trying to get LLMs to answer questions over research papers and instruction manuals. Everything worked fine, until the answer lived inside an image or diagram embedded in the PDF. Even GPT‑4o flubbed it (we recently tried O3 with the same, and surprisingly it flubbed it too). Naive RAG pipelines just pulled in some text chunks and ignored the rest.<p>We took an invention disclosure PDF (<a href="https://drive.google.com/file/d/1ySzQgbNZkC5dPLtE3pnnVL2rW_9aTeuG/view?usp=sharing" rel="nofollow">https://drive.google.com/file/d/1ySzQgbNZkC5dPLtE3pnnVL2rW_9...</a>) containing an IRR‑vs‑frequency graph and asked GPT “From the graph, at what frequency is the IRR maximized?”. We originally tried this on gpt-4o, but while writing this used the new natively multimodal model o4‑mini‑high. After a 30‑second thinking pause, it asked for clarifications, then churned out buggy code, pulled data from the wrong page, and still couldn’t answer the question. We wrote up the full story with screenshots here: <a href="https://docs.morphik.ai/blogs/gpt-vs-morphik-multimodal">https://docs.morphik.ai/blogs/gpt-vs-morphik-multimodal</a>.<p>We got frustrated enough to try fixing it ourselves.<p>We built Morphik to do multimodal retrieval over documents like PDFs, where images and diagrams matter as much as the text.<p>To do this, we use Colpali-style embeddings, which treat each document page as an image and generate multi-vector representations. These embeddings capture layout, typography, and visual context, allowing retrieval to get a whole table or schematic, not just nearby tokens. Along with vector search, this could now retrieve exact pages with relevant diagrams and pass them as images to the LLM to get relevant answers. It’s able to answer the question with an 8B llama 3.1 vision running locally!<p>Early pharma testers hit our system with queries like "Which EGFR inhibitors at 50 mg showed ≥ 30% tumor reduction?" We correctly returned the right tables and plots, but still hit a bottleneck, we weren’t able to join the dots across multiple reports. So we built a knowledge graph: we tag entities in both text and images, normalize synonyms (Erlotinib → EGFR inhibitor), infer relations (e.g. administered_at, yields_reduction), and stitch everything into a graph. Now a single query could traverse that graph across documents and surface a coherent, cross‑document answer along with the correct pages as images.<p>To illustrate that, and just for fun, we built a graph of 100 Paul Graham’s essays here: <a href="https://pggraph.streamlit.app/" rel="nofollow">https://pggraph.streamlit.app/</a> You can search for various nodes, (eg. startup, sam altman, paul graham and see corresponding connections). In our system, we create graphs and store the relevant text chunks along with the entities, so on querying, we can extract the relevant entity, do a search on the graph and pull in the text chunks of all connected nodes, improving cross document queries.<p>For longer or multi-turn queries, we added persistent KV caching, which stores intermediate key-value states from transformer attention layers. Instead of recomputing attention from scratch every time, we reuse prior layers, speeding up repeated queries and letting us handle much longer context windows.<p>We’re open‑source under the MIT Expat license: <a href="https://github.com/morphik-org/morphik-core">https://github.com/morphik-org/morphik-core</a><p>Would love to hear your RAG horror stories, what worked, what didn’t and any feedback on Morphik. We’re here for it.
Show HN: Morphik – Open-source RAG that understands PDF images, runs locally
Hey HN, we’re Adi and Arnav. A few months ago, we hit a wall trying to get LLMs to answer questions over research papers and instruction manuals. Everything worked fine, until the answer lived inside an image or diagram embedded in the PDF. Even GPT‑4o flubbed it (we recently tried O3 with the same, and surprisingly it flubbed it too). Naive RAG pipelines just pulled in some text chunks and ignored the rest.<p>We took an invention disclosure PDF (<a href="https://drive.google.com/file/d/1ySzQgbNZkC5dPLtE3pnnVL2rW_9aTeuG/view?usp=sharing" rel="nofollow">https://drive.google.com/file/d/1ySzQgbNZkC5dPLtE3pnnVL2rW_9...</a>) containing an IRR‑vs‑frequency graph and asked GPT “From the graph, at what frequency is the IRR maximized?”. We originally tried this on gpt-4o, but while writing this used the new natively multimodal model o4‑mini‑high. After a 30‑second thinking pause, it asked for clarifications, then churned out buggy code, pulled data from the wrong page, and still couldn’t answer the question. We wrote up the full story with screenshots here: <a href="https://docs.morphik.ai/blogs/gpt-vs-morphik-multimodal">https://docs.morphik.ai/blogs/gpt-vs-morphik-multimodal</a>.<p>We got frustrated enough to try fixing it ourselves.<p>We built Morphik to do multimodal retrieval over documents like PDFs, where images and diagrams matter as much as the text.<p>To do this, we use Colpali-style embeddings, which treat each document page as an image and generate multi-vector representations. These embeddings capture layout, typography, and visual context, allowing retrieval to get a whole table or schematic, not just nearby tokens. Along with vector search, this could now retrieve exact pages with relevant diagrams and pass them as images to the LLM to get relevant answers. It’s able to answer the question with an 8B llama 3.1 vision running locally!<p>Early pharma testers hit our system with queries like "Which EGFR inhibitors at 50 mg showed ≥ 30% tumor reduction?" We correctly returned the right tables and plots, but still hit a bottleneck, we weren’t able to join the dots across multiple reports. So we built a knowledge graph: we tag entities in both text and images, normalize synonyms (Erlotinib → EGFR inhibitor), infer relations (e.g. administered_at, yields_reduction), and stitch everything into a graph. Now a single query could traverse that graph across documents and surface a coherent, cross‑document answer along with the correct pages as images.<p>To illustrate that, and just for fun, we built a graph of 100 Paul Graham’s essays here: <a href="https://pggraph.streamlit.app/" rel="nofollow">https://pggraph.streamlit.app/</a> You can search for various nodes, (eg. startup, sam altman, paul graham and see corresponding connections). In our system, we create graphs and store the relevant text chunks along with the entities, so on querying, we can extract the relevant entity, do a search on the graph and pull in the text chunks of all connected nodes, improving cross document queries.<p>For longer or multi-turn queries, we added persistent KV caching, which stores intermediate key-value states from transformer attention layers. Instead of recomputing attention from scratch every time, we reuse prior layers, speeding up repeated queries and letting us handle much longer context windows.<p>We’re open‑source under the MIT Expat license: <a href="https://github.com/morphik-org/morphik-core">https://github.com/morphik-org/morphik-core</a><p>Would love to hear your RAG horror stories, what worked, what didn’t and any feedback on Morphik. We’re here for it.
Show HN: Morphik – Open-source RAG that understands PDF images, runs locally
Hey HN, we’re Adi and Arnav. A few months ago, we hit a wall trying to get LLMs to answer questions over research papers and instruction manuals. Everything worked fine, until the answer lived inside an image or diagram embedded in the PDF. Even GPT‑4o flubbed it (we recently tried O3 with the same, and surprisingly it flubbed it too). Naive RAG pipelines just pulled in some text chunks and ignored the rest.<p>We took an invention disclosure PDF (<a href="https://drive.google.com/file/d/1ySzQgbNZkC5dPLtE3pnnVL2rW_9aTeuG/view?usp=sharing" rel="nofollow">https://drive.google.com/file/d/1ySzQgbNZkC5dPLtE3pnnVL2rW_9...</a>) containing an IRR‑vs‑frequency graph and asked GPT “From the graph, at what frequency is the IRR maximized?”. We originally tried this on gpt-4o, but while writing this used the new natively multimodal model o4‑mini‑high. After a 30‑second thinking pause, it asked for clarifications, then churned out buggy code, pulled data from the wrong page, and still couldn’t answer the question. We wrote up the full story with screenshots here: <a href="https://docs.morphik.ai/blogs/gpt-vs-morphik-multimodal">https://docs.morphik.ai/blogs/gpt-vs-morphik-multimodal</a>.<p>We got frustrated enough to try fixing it ourselves.<p>We built Morphik to do multimodal retrieval over documents like PDFs, where images and diagrams matter as much as the text.<p>To do this, we use Colpali-style embeddings, which treat each document page as an image and generate multi-vector representations. These embeddings capture layout, typography, and visual context, allowing retrieval to get a whole table or schematic, not just nearby tokens. Along with vector search, this could now retrieve exact pages with relevant diagrams and pass them as images to the LLM to get relevant answers. It’s able to answer the question with an 8B llama 3.1 vision running locally!<p>Early pharma testers hit our system with queries like "Which EGFR inhibitors at 50 mg showed ≥ 30% tumor reduction?" We correctly returned the right tables and plots, but still hit a bottleneck, we weren’t able to join the dots across multiple reports. So we built a knowledge graph: we tag entities in both text and images, normalize synonyms (Erlotinib → EGFR inhibitor), infer relations (e.g. administered_at, yields_reduction), and stitch everything into a graph. Now a single query could traverse that graph across documents and surface a coherent, cross‑document answer along with the correct pages as images.<p>To illustrate that, and just for fun, we built a graph of 100 Paul Graham’s essays here: <a href="https://pggraph.streamlit.app/" rel="nofollow">https://pggraph.streamlit.app/</a> You can search for various nodes, (eg. startup, sam altman, paul graham and see corresponding connections). In our system, we create graphs and store the relevant text chunks along with the entities, so on querying, we can extract the relevant entity, do a search on the graph and pull in the text chunks of all connected nodes, improving cross document queries.<p>For longer or multi-turn queries, we added persistent KV caching, which stores intermediate key-value states from transformer attention layers. Instead of recomputing attention from scratch every time, we reuse prior layers, speeding up repeated queries and letting us handle much longer context windows.<p>We’re open‑source under the MIT Expat license: <a href="https://github.com/morphik-org/morphik-core">https://github.com/morphik-org/morphik-core</a><p>Would love to hear your RAG horror stories, what worked, what didn’t and any feedback on Morphik. We’re here for it.
Show HN: I open-sourced my AI toy company that runs on ESP32 and OpenAI realtime
Hi HN! Last year the project I launched here got a lot of good feedback on creating speech to speech AI on the ESP32. Recently I revamped the whole stack, iterated on that feedback and made our project fully open-source—all of the client, hardware, firmware code.<p>This Github repo turns an ESP32-S3 into a realtime AI speech companion using the OpenAI Realtime API, Arduino WebSockets, Deno Edge Functions, and a full-stack web interface. You can talk to your own custom AI character, and it responds instantly.<p>I couldn't find a resource that helped set up a reliable, secure websocket (WSS) AI speech to speech service. While there are several useful Text-To-Speech (TTS) and Speech-To-Text (STT) repos out there, I believe none gets Speech-To-Speech right. OpenAI launched an embedded-repo late last year which sets up WebRTC with ESP-IDF. However, it's not beginner friendly and doesn't have a server side component for business logic.<p>This repo is an attempt at solving the above pains and creating a great speech to speech experience on Arduino with Secure Websockets using Edge Servers (with Deno/Supabase Edge Functions) for fast global connectivity and low latency.
Show HN: I open-sourced my AI toy company that runs on ESP32 and OpenAI realtime
Hi HN! Last year the project I launched here got a lot of good feedback on creating speech to speech AI on the ESP32. Recently I revamped the whole stack, iterated on that feedback and made our project fully open-source—all of the client, hardware, firmware code.<p>This Github repo turns an ESP32-S3 into a realtime AI speech companion using the OpenAI Realtime API, Arduino WebSockets, Deno Edge Functions, and a full-stack web interface. You can talk to your own custom AI character, and it responds instantly.<p>I couldn't find a resource that helped set up a reliable, secure websocket (WSS) AI speech to speech service. While there are several useful Text-To-Speech (TTS) and Speech-To-Text (STT) repos out there, I believe none gets Speech-To-Speech right. OpenAI launched an embedded-repo late last year which sets up WebRTC with ESP-IDF. However, it's not beginner friendly and doesn't have a server side component for business logic.<p>This repo is an attempt at solving the above pains and creating a great speech to speech experience on Arduino with Secure Websockets using Edge Servers (with Deno/Supabase Edge Functions) for fast global connectivity and low latency.