The best Hacker News stories from Show from the past day
Latest posts:
Show HN: HelixDB – Open-source vector-graph database for AI applications (Rust)
Hey HN, we want to share HelixDB (<a href="https://github.com/HelixDB/helix-db/">https://github.com/HelixDB/helix-db/</a>), a project a college friend and I are working on. It’s a new database that natively intertwines graph and vector types, without sacrificing performance. It’s written in Rust and our initial focus is on supporting RAG. Here’s a video runthrough: <a href="https://screen.studio/share/szgQu3yq" rel="nofollow">https://screen.studio/share/szgQu3yq</a>.<p>Why a hybrid? Vector databases are useful for similarity queries, while graph databases are useful for relationship queries. Each stores data in a way that’s best for its main type of query (e.g. key-value stores vs. node-and-edge tables). However, many AI-driven applications need <i>both</i> similarity and relationship queries. For example, you might use vector-based semantic search to retrieve relevant legal documents, and then use graph traversal to identify relationships between cases.<p>Developers of such apps have the quandary of needing to build on top of two different databases—a vector one and a graph one—plus you have to link them together and sync the data. Even then, your two databases aren't designed to work together—for example, there’s no native way to perform joins or queries that span both systems. You’ll need to handle that logic at the application level.<p>Helix started when we realized that there are ways to integrate vector and graph data that are both fast and suitable for AI applications, especially RAG-based ones. See this cool research paper: <a href="https://arxiv.org/html/2408.04948v1" rel="nofollow">https://arxiv.org/html/2408.04948v1</a>. After reading that and some other papers on graph and hybrid RAG, we decided to build a hybrid DB. Our aim was to make something better to use from a developer standpoint, while also making it fast as hell.<p>After a few months of working on this as a side project, our benchmarking shows that we are on par with Pinecone and Qdrant for vectors, and our graph is up to three orders of magnitude faster than Neo4j.<p>Problems where a hybrid approach works particularly well include:<p>- Indexing codebases: you can vectorize code-snippets within a function (connected by edges) based on context and then create an AST (in a graph) from function calls, imports, dependencies, etc. Agents can look up code by similarity or keyword and then traverse the AST to get only the relevant code, which reduces hallucinations and prevents the LLM from guessing object shapes or variable/function names.<p>- Molecule discovery: Model biological interactions (e.g., proteins → genes → diseases) using graph types and then embed molecule structures to find similar compounds or case studies.<p>- Enterprise knowledge management: you can represent organisational structure, projects, and people (e.g., employee → team → project) in graph form, then index internal documents, emails, or notes as vectors for semantic search and link them directly employees/teams/projects in the graph.<p>I naively assumed when learning about databases for the first time that queries would be compiled and executed like functions in traditional programming. Turns out I was wrong, but this creates unnecessary latency by sending extra data (the whole written query), compiling it at run time, and then executing it. With Helix, you write the queries in our query language (HelixQL), which is then transpiled into Rust code and built directly into the database server, where you can call a generated API endpoint.<p>Many people have a thing against “yet another query language” (doubtless for good reason!) but we went ahead and did it anyway, because we think it makes working with our database so much easier that it’s worth a bit of a learning curve. HelixQL takes from other query languages such as Gremlin, Cypher and SQL with some extra ideas added in. It is declarative while the traversals themselves are functional. This allows complete control over the traversal flow while also having a cleaner syntax. HelixQL returns JSON to make things easy for clients. Also, it uses a schema, so the queries are type-checked.<p>We took a crude approach to building the original graph engine as a way to get an MVP out, so we are now working on improving the graph engine by making traversals massively parallel and pipelined. This means data is only ever decoded from disk when it is needed, and parts of reads are all processed in parallel.<p>If you’d like to try it out in a simple RAG demo, you can follow this guide and run our Jupyter notebook: <a href="https://github.com/HelixDB/helix-db/tree/main/examples/rag_demo">https://github.com/HelixDB/helix-db/tree/main/examples/rag_d...</a><p>Many thanks! Comments and feedback welcome!
Show HN: Codigo – The Programming Language Repository
Codigo is a site I've built for discovering, exploring and comparing programming languages, including language news, trends and code examples.<p>I couldn't find any definitive resource for finding new languages and comparing them, so decided to make one.<p>It combines dynamic data from sources like PyPL Index, TIOBE Index and official feeds as well as static data about each language defined in a structured format. The language data is open-contribution and can be updated by anyone on the GitHub repository: <a href="https://github.com/codigo-langs/codigo">https://github.com/codigo-langs/codigo</a>.<p>I styled it specifically for coders - using a monospaced font and terminal-esque styling, along with many common IDE themes to choose from.<p>Codigo is built using Rust, Axum, HTMX and Alpine.js.<p>Keen to hear any feedback!
Show HN: Codigo – The Programming Language Repository
Codigo is a site I've built for discovering, exploring and comparing programming languages, including language news, trends and code examples.<p>I couldn't find any definitive resource for finding new languages and comparing them, so decided to make one.<p>It combines dynamic data from sources like PyPL Index, TIOBE Index and official feeds as well as static data about each language defined in a structured format. The language data is open-contribution and can be updated by anyone on the GitHub repository: <a href="https://github.com/codigo-langs/codigo">https://github.com/codigo-langs/codigo</a>.<p>I styled it specifically for coders - using a monospaced font and terminal-esque styling, along with many common IDE themes to choose from.<p>Codigo is built using Rust, Axum, HTMX and Alpine.js.<p>Keen to hear any feedback!
Show HN: Lumoar – Free SOC 2 tool for SaaS startups
We built Lumoar to help small SaaS teams get SOC 2-ready without paying thousands for Big 4 consultants or dealing with bloated compliance platforms.<p>As a startup ourselves, we faced the usual issues: long security questionnaires, confusing audit requirements, and expensive tools that felt overkill.<p>Lumoar is a simpler alternative:
- Generate compliant SOC 2 policies automatically
- Track your controls and progress in a clean dashboard
- Upload evidence and get plain-language recommendations
- Designed for engineers and founders, not compliance pros<p>It's free to start — you can generate policies and explore the dashboard without a sales call or demo.<p>Would love to hear what blockers you’ve faced with SOC 2 and what other frameworks you’re thinking about (e.g., ISO 27001, GDPR). All feedback is welcome.
Show HN: Lumoar – Free SOC 2 tool for SaaS startups
We built Lumoar to help small SaaS teams get SOC 2-ready without paying thousands for Big 4 consultants or dealing with bloated compliance platforms.<p>As a startup ourselves, we faced the usual issues: long security questionnaires, confusing audit requirements, and expensive tools that felt overkill.<p>Lumoar is a simpler alternative:
- Generate compliant SOC 2 policies automatically
- Track your controls and progress in a clean dashboard
- Upload evidence and get plain-language recommendations
- Designed for engineers and founders, not compliance pros<p>It's free to start — you can generate policies and explore the dashboard without a sales call or demo.<p>Would love to hear what blockers you’ve faced with SOC 2 and what other frameworks you’re thinking about (e.g., ISO 27001, GDPR). All feedback is welcome.
Show HN: Lumoar – Free SOC 2 tool for SaaS startups
We built Lumoar to help small SaaS teams get SOC 2-ready without paying thousands for Big 4 consultants or dealing with bloated compliance platforms.<p>As a startup ourselves, we faced the usual issues: long security questionnaires, confusing audit requirements, and expensive tools that felt overkill.<p>Lumoar is a simpler alternative:
- Generate compliant SOC 2 policies automatically
- Track your controls and progress in a clean dashboard
- Upload evidence and get plain-language recommendations
- Designed for engineers and founders, not compliance pros<p>It's free to start — you can generate policies and explore the dashboard without a sales call or demo.<p>Would love to hear what blockers you’ve faced with SOC 2 and what other frameworks you’re thinking about (e.g., ISO 27001, GDPR). All feedback is welcome.
Show HN: CLI that spots fake GitHub stars, risky dependencies and licence traps
When I came across a study that traced 4.5 million fake GitHub stars, it confirmed a suspicion I’d had for a while: stars are noisy. The issue is they’re visible, they’re persuasive, and they still shape hiring decisions, VC term sheets, and dependency choices—but they say very little about actual quality.<p>I wrote StarGuard to put that number in perspective based on my own methodology inspired with what they did and to fold a broader supply-chain check into one command-line run.<p>It starts with the simplest raw input: every starred_at timestamp GitHub will give. It applies a median-absolute-deviation test to locate sudden bursts. For each spike, StarGuard pulls a random sample of the accounts behind it and asks: how old is the user? Any followers? Any contribution history? Still using the default avatar? From that, it computes a Fake Star Index, between 0 (organic) and 1 (fully synthetic).<p>But inflated stars are just one issue. In parallel, StarGuard parses dependency manifests or SBOMs and flags common risk signs: unpinned versions, direct Git URLs, lookalike package names. It also scans licences—AGPL sneaking into a repo claiming MIT, or other inconsistencies that can turn into compliance headaches.<p>It checks contributor patterns too. If 90% of commits come from one person who hasn’t pushed in months, that’s flagged. It skims for obvious code red flags: eval calls, minified blobs, sketchy install scripts—because sometimes the problem is hiding in plain sight.<p>All of this feeds into a weighted scoring model. The final Trust Score (0–100) reflects repo health at a glance, with direct penalties for fake-star behaviour, so a pretty README badge can’t hide inorganic hype.<p>I added for the fun of it it generating a cool little badge for the trust score lol.<p>Under the hood, its all uses, heuristics, and a lot of GitHub API paging. Run it on any public repo with:<p>python starguard.py owner/repo --format markdown
It works without a token, but you’ll hit rate limits sooner.<p>Please provide any feedback you can.
Show HN: CLI that spots fake GitHub stars, risky dependencies and licence traps
When I came across a study that traced 4.5 million fake GitHub stars, it confirmed a suspicion I’d had for a while: stars are noisy. The issue is they’re visible, they’re persuasive, and they still shape hiring decisions, VC term sheets, and dependency choices—but they say very little about actual quality.<p>I wrote StarGuard to put that number in perspective based on my own methodology inspired with what they did and to fold a broader supply-chain check into one command-line run.<p>It starts with the simplest raw input: every starred_at timestamp GitHub will give. It applies a median-absolute-deviation test to locate sudden bursts. For each spike, StarGuard pulls a random sample of the accounts behind it and asks: how old is the user? Any followers? Any contribution history? Still using the default avatar? From that, it computes a Fake Star Index, between 0 (organic) and 1 (fully synthetic).<p>But inflated stars are just one issue. In parallel, StarGuard parses dependency manifests or SBOMs and flags common risk signs: unpinned versions, direct Git URLs, lookalike package names. It also scans licences—AGPL sneaking into a repo claiming MIT, or other inconsistencies that can turn into compliance headaches.<p>It checks contributor patterns too. If 90% of commits come from one person who hasn’t pushed in months, that’s flagged. It skims for obvious code red flags: eval calls, minified blobs, sketchy install scripts—because sometimes the problem is hiding in plain sight.<p>All of this feeds into a weighted scoring model. The final Trust Score (0–100) reflects repo health at a glance, with direct penalties for fake-star behaviour, so a pretty README badge can’t hide inorganic hype.<p>I added for the fun of it it generating a cool little badge for the trust score lol.<p>Under the hood, its all uses, heuristics, and a lot of GitHub API paging. Run it on any public repo with:<p>python starguard.py owner/repo --format markdown
It works without a token, but you’ll hit rate limits sooner.<p>Please provide any feedback you can.
Show HN: CLI that spots fake GitHub stars, risky dependencies and licence traps
When I came across a study that traced 4.5 million fake GitHub stars, it confirmed a suspicion I’d had for a while: stars are noisy. The issue is they’re visible, they’re persuasive, and they still shape hiring decisions, VC term sheets, and dependency choices—but they say very little about actual quality.<p>I wrote StarGuard to put that number in perspective based on my own methodology inspired with what they did and to fold a broader supply-chain check into one command-line run.<p>It starts with the simplest raw input: every starred_at timestamp GitHub will give. It applies a median-absolute-deviation test to locate sudden bursts. For each spike, StarGuard pulls a random sample of the accounts behind it and asks: how old is the user? Any followers? Any contribution history? Still using the default avatar? From that, it computes a Fake Star Index, between 0 (organic) and 1 (fully synthetic).<p>But inflated stars are just one issue. In parallel, StarGuard parses dependency manifests or SBOMs and flags common risk signs: unpinned versions, direct Git URLs, lookalike package names. It also scans licences—AGPL sneaking into a repo claiming MIT, or other inconsistencies that can turn into compliance headaches.<p>It checks contributor patterns too. If 90% of commits come from one person who hasn’t pushed in months, that’s flagged. It skims for obvious code red flags: eval calls, minified blobs, sketchy install scripts—because sometimes the problem is hiding in plain sight.<p>All of this feeds into a weighted scoring model. The final Trust Score (0–100) reflects repo health at a glance, with direct penalties for fake-star behaviour, so a pretty README badge can’t hide inorganic hype.<p>I added for the fun of it it generating a cool little badge for the trust score lol.<p>Under the hood, its all uses, heuristics, and a lot of GitHub API paging. Run it on any public repo with:<p>python starguard.py owner/repo --format markdown
It works without a token, but you’ll hit rate limits sooner.<p>Please provide any feedback you can.
Show HN: Airweave – Let agents search any app
Hey HN, we're Lennert and Rauf. We’re building Airweave (<a href="https://github.com/airweave-ai/airweave">https://github.com/airweave-ai/airweave</a>), an open-source tool that lets agents search and retrieve data from any app or database. Here’s a general intro: <a href="https://www.youtube.com/watch?v=EFI-7SYGQ48" rel="nofollow">https://www.youtube.com/watch?v=EFI-7SYGQ48</a>, and here’s a longer one that shows more real-world use cases, examples of how Airweave is used by Cursor (0:33) and Claude desktop (2:04), etc.: <a href="https://youtu.be/p2dl-39HwQo" rel="nofollow">https://youtu.be/p2dl-39HwQo</a><p>A couple of months ago we were building agents that interacted with different apps and were frustrated when they struggled to handle vague natural language requests like "resolve that one Linear issue about missing auth configs", "if you get an email from an unsatisfied customer, reimburse their payment in Stripe", or "what were the returns for Q1 based on the financials sheet in gdrive?", only to have the agent inefficiently chain together loads of function calls to find the data or not find it at all and hallucinate.<p>We also noticed that despite the rise of MCP creating more desire for agents to interact with external resources, the majority of agent dev tooling focused on function calling and actions instead of search. We were annoyed by the lack of tooling that enabled agents to semantically search workspace or database contents, so we started building Airweave first as an internal solution. Then we decided to open-source it and pursue it full time after we got positive reactions from coworkers and other agent builders.<p>Airweave connects to productivity tools, databases, or document stores via their APIs and transforms their contents into searchable knowledge bases, accessible through a standardized interface for the agent. The search interface is exposed via REST or MCP. When using MCP, Airweave essentially builds a semantically searchable MCP server on top of the resource. The platform handles the entire data pipeline from connection and extraction to chunking, embedding, and serving. To ensure knowledge is current, it has automated sync capabilities, with configurable schedules and change detection through content hashing.<p>We built it with support for white-labeled multi-tenancy to provide OAuth2-based integration across multiple user accounts while maintaining privacy and security boundaries. We're also actively working on permission-awareness (i.e., RBAC on the data) for the platform.<p>So happy to share learnings and get insights from your experiences. looking forward to comments!
Show HN: Airweave – Let agents search any app
Hey HN, we're Lennert and Rauf. We’re building Airweave (<a href="https://github.com/airweave-ai/airweave">https://github.com/airweave-ai/airweave</a>), an open-source tool that lets agents search and retrieve data from any app or database. Here’s a general intro: <a href="https://www.youtube.com/watch?v=EFI-7SYGQ48" rel="nofollow">https://www.youtube.com/watch?v=EFI-7SYGQ48</a>, and here’s a longer one that shows more real-world use cases, examples of how Airweave is used by Cursor (0:33) and Claude desktop (2:04), etc.: <a href="https://youtu.be/p2dl-39HwQo" rel="nofollow">https://youtu.be/p2dl-39HwQo</a><p>A couple of months ago we were building agents that interacted with different apps and were frustrated when they struggled to handle vague natural language requests like "resolve that one Linear issue about missing auth configs", "if you get an email from an unsatisfied customer, reimburse their payment in Stripe", or "what were the returns for Q1 based on the financials sheet in gdrive?", only to have the agent inefficiently chain together loads of function calls to find the data or not find it at all and hallucinate.<p>We also noticed that despite the rise of MCP creating more desire for agents to interact with external resources, the majority of agent dev tooling focused on function calling and actions instead of search. We were annoyed by the lack of tooling that enabled agents to semantically search workspace or database contents, so we started building Airweave first as an internal solution. Then we decided to open-source it and pursue it full time after we got positive reactions from coworkers and other agent builders.<p>Airweave connects to productivity tools, databases, or document stores via their APIs and transforms their contents into searchable knowledge bases, accessible through a standardized interface for the agent. The search interface is exposed via REST or MCP. When using MCP, Airweave essentially builds a semantically searchable MCP server on top of the resource. The platform handles the entire data pipeline from connection and extraction to chunking, embedding, and serving. To ensure knowledge is current, it has automated sync capabilities, with configurable schedules and change detection through content hashing.<p>We built it with support for white-labeled multi-tenancy to provide OAuth2-based integration across multiple user accounts while maintaining privacy and security boundaries. We're also actively working on permission-awareness (i.e., RBAC on the data) for the platform.<p>So happy to share learnings and get insights from your experiences. looking forward to comments!
Show HN: Airweave – Let agents search any app
Hey HN, we're Lennert and Rauf. We’re building Airweave (<a href="https://github.com/airweave-ai/airweave">https://github.com/airweave-ai/airweave</a>), an open-source tool that lets agents search and retrieve data from any app or database. Here’s a general intro: <a href="https://www.youtube.com/watch?v=EFI-7SYGQ48" rel="nofollow">https://www.youtube.com/watch?v=EFI-7SYGQ48</a>, and here’s a longer one that shows more real-world use cases, examples of how Airweave is used by Cursor (0:33) and Claude desktop (2:04), etc.: <a href="https://youtu.be/p2dl-39HwQo" rel="nofollow">https://youtu.be/p2dl-39HwQo</a><p>A couple of months ago we were building agents that interacted with different apps and were frustrated when they struggled to handle vague natural language requests like "resolve that one Linear issue about missing auth configs", "if you get an email from an unsatisfied customer, reimburse their payment in Stripe", or "what were the returns for Q1 based on the financials sheet in gdrive?", only to have the agent inefficiently chain together loads of function calls to find the data or not find it at all and hallucinate.<p>We also noticed that despite the rise of MCP creating more desire for agents to interact with external resources, the majority of agent dev tooling focused on function calling and actions instead of search. We were annoyed by the lack of tooling that enabled agents to semantically search workspace or database contents, so we started building Airweave first as an internal solution. Then we decided to open-source it and pursue it full time after we got positive reactions from coworkers and other agent builders.<p>Airweave connects to productivity tools, databases, or document stores via their APIs and transforms their contents into searchable knowledge bases, accessible through a standardized interface for the agent. The search interface is exposed via REST or MCP. When using MCP, Airweave essentially builds a semantically searchable MCP server on top of the resource. The platform handles the entire data pipeline from connection and extraction to chunking, embedding, and serving. To ensure knowledge is current, it has automated sync capabilities, with configurable schedules and change detection through content hashing.<p>We built it with support for white-labeled multi-tenancy to provide OAuth2-based integration across multiple user accounts while maintaining privacy and security boundaries. We're also actively working on permission-awareness (i.e., RBAC on the data) for the platform.<p>So happy to share learnings and get insights from your experiences. looking forward to comments!
Show HN: Airweave – Let agents search any app
Hey HN, we're Lennert and Rauf. We’re building Airweave (<a href="https://github.com/airweave-ai/airweave">https://github.com/airweave-ai/airweave</a>), an open-source tool that lets agents search and retrieve data from any app or database. Here’s a general intro: <a href="https://www.youtube.com/watch?v=EFI-7SYGQ48" rel="nofollow">https://www.youtube.com/watch?v=EFI-7SYGQ48</a>, and here’s a longer one that shows more real-world use cases, examples of how Airweave is used by Cursor (0:33) and Claude desktop (2:04), etc.: <a href="https://youtu.be/p2dl-39HwQo" rel="nofollow">https://youtu.be/p2dl-39HwQo</a><p>A couple of months ago we were building agents that interacted with different apps and were frustrated when they struggled to handle vague natural language requests like "resolve that one Linear issue about missing auth configs", "if you get an email from an unsatisfied customer, reimburse their payment in Stripe", or "what were the returns for Q1 based on the financials sheet in gdrive?", only to have the agent inefficiently chain together loads of function calls to find the data or not find it at all and hallucinate.<p>We also noticed that despite the rise of MCP creating more desire for agents to interact with external resources, the majority of agent dev tooling focused on function calling and actions instead of search. We were annoyed by the lack of tooling that enabled agents to semantically search workspace or database contents, so we started building Airweave first as an internal solution. Then we decided to open-source it and pursue it full time after we got positive reactions from coworkers and other agent builders.<p>Airweave connects to productivity tools, databases, or document stores via their APIs and transforms their contents into searchable knowledge bases, accessible through a standardized interface for the agent. The search interface is exposed via REST or MCP. When using MCP, Airweave essentially builds a semantically searchable MCP server on top of the resource. The platform handles the entire data pipeline from connection and extraction to chunking, embedding, and serving. To ensure knowledge is current, it has automated sync capabilities, with configurable schedules and change detection through content hashing.<p>We built it with support for white-labeled multi-tenancy to provide OAuth2-based integration across multiple user accounts while maintaining privacy and security boundaries. We're also actively working on permission-awareness (i.e., RBAC on the data) for the platform.<p>So happy to share learnings and get insights from your experiences. looking forward to comments!
Show HN: GlassFlow – OSS streaming dedup and joins from Kafka to ClickHouse
Hi HN! We are Ashish and Armend, founders of GlassFlow. We just launched our open-source streaming ETL that deduplicates and joins Kafka streams before ingesting them to ClickHouse <a href="https://github.com/glassflow/clickhouse-etl">https://github.com/glassflow/clickhouse-etl</a><p>Why we built this:
Dedup with batch data is straightforward. You load the data into a temporary table. Then, find only the latest versions of the record through hashes or keys and keep them. After that, move the clean data into your main table. But have you tried this with streaming data?
Users of our prev product were running real-time analytics pipelines from Kafka to ClickHouse and noticed that the analyses were wrong due to duplicates. The source systems produced duplicates as they ingested similar user data from CRMs, shop systems and click streams.<p>We wanted to solve this issue for them with the existing ClickHouse options, but ClickHouse ReplacingMergeTree has an uncontrollable background merging process. This means the new data is in the system, but you never know when they’ll finish the merging, and until then, your queries return incorrect results.<p>We looked into using FINAL but haven't been happy with the speed for real-time workloads.<p>We tried Flink, but there is too much overhead to manage Java Flink jobs, and a self-built solution would have put us in a position to set up and maintain state storage, possibly a very large one (number of unique keys), to keep track of whether we have already encountered a record. And if your dedupe service fails, you need to rehydrate that state before processing new records. That would have been too much maintenance for us.<p>We decided to solve it by building a new product and are excited to share it with you.<p>The key difference is that the streams are deduplicated before ingesting to ClickHouse. So, ClickHouse always has clean data and less load, eliminating the risk of wrong results. We want more people to benefit from it and decided to open-source it (Apache-2.0).<p>Main components:<p>- Streaming deduplication:
You define the deduplication key and a time window (up to 7 days), and it handles the checks in real time to avoid duplicates before hitting ClickHouse. The state store is built in.<p>- Temporal Stream Joins:
You can join two Kafka streams on the fly with a few config inputs. You set the join key, choose a time window (up to 7 days), and you're good.<p>- Built-in Kafka source connector:
There is no need to build custom consumers or manage polling logic. Just point it at your Kafka cluster, and it auto-subscribes to the topics you define. Payloads are parsed as JSON by default, so you get structured data immediately. As underlying tech, we decided on NATS to make it lightweight and low-latency.<p>- ClickHouse sink:
Data gets pushed into ClickHouse through a native connector optimized for performance. You can tweak batch sizes and flush intervals to match your throughput needs. It handles retries automatically, so you don't lose data on transient failures.<p>We'd love to hear your feedback and know if you solved it nicely with existing tools. Thanks for reading!
Show HN: GlassFlow – OSS streaming dedup and joins from Kafka to ClickHouse
Hi HN! We are Ashish and Armend, founders of GlassFlow. We just launched our open-source streaming ETL that deduplicates and joins Kafka streams before ingesting them to ClickHouse <a href="https://github.com/glassflow/clickhouse-etl">https://github.com/glassflow/clickhouse-etl</a><p>Why we built this:
Dedup with batch data is straightforward. You load the data into a temporary table. Then, find only the latest versions of the record through hashes or keys and keep them. After that, move the clean data into your main table. But have you tried this with streaming data?
Users of our prev product were running real-time analytics pipelines from Kafka to ClickHouse and noticed that the analyses were wrong due to duplicates. The source systems produced duplicates as they ingested similar user data from CRMs, shop systems and click streams.<p>We wanted to solve this issue for them with the existing ClickHouse options, but ClickHouse ReplacingMergeTree has an uncontrollable background merging process. This means the new data is in the system, but you never know when they’ll finish the merging, and until then, your queries return incorrect results.<p>We looked into using FINAL but haven't been happy with the speed for real-time workloads.<p>We tried Flink, but there is too much overhead to manage Java Flink jobs, and a self-built solution would have put us in a position to set up and maintain state storage, possibly a very large one (number of unique keys), to keep track of whether we have already encountered a record. And if your dedupe service fails, you need to rehydrate that state before processing new records. That would have been too much maintenance for us.<p>We decided to solve it by building a new product and are excited to share it with you.<p>The key difference is that the streams are deduplicated before ingesting to ClickHouse. So, ClickHouse always has clean data and less load, eliminating the risk of wrong results. We want more people to benefit from it and decided to open-source it (Apache-2.0).<p>Main components:<p>- Streaming deduplication:
You define the deduplication key and a time window (up to 7 days), and it handles the checks in real time to avoid duplicates before hitting ClickHouse. The state store is built in.<p>- Temporal Stream Joins:
You can join two Kafka streams on the fly with a few config inputs. You set the join key, choose a time window (up to 7 days), and you're good.<p>- Built-in Kafka source connector:
There is no need to build custom consumers or manage polling logic. Just point it at your Kafka cluster, and it auto-subscribes to the topics you define. Payloads are parsed as JSON by default, so you get structured data immediately. As underlying tech, we decided on NATS to make it lightweight and low-latency.<p>- ClickHouse sink:
Data gets pushed into ClickHouse through a native connector optimized for performance. You can tweak batch sizes and flush intervals to match your throughput needs. It handles retries automatically, so you don't lose data on transient failures.<p>We'd love to hear your feedback and know if you solved it nicely with existing tools. Thanks for reading!
Show HN: GlassFlow – OSS streaming dedup and joins from Kafka to ClickHouse
Hi HN! We are Ashish and Armend, founders of GlassFlow. We just launched our open-source streaming ETL that deduplicates and joins Kafka streams before ingesting them to ClickHouse <a href="https://github.com/glassflow/clickhouse-etl">https://github.com/glassflow/clickhouse-etl</a><p>Why we built this:
Dedup with batch data is straightforward. You load the data into a temporary table. Then, find only the latest versions of the record through hashes or keys and keep them. After that, move the clean data into your main table. But have you tried this with streaming data?
Users of our prev product were running real-time analytics pipelines from Kafka to ClickHouse and noticed that the analyses were wrong due to duplicates. The source systems produced duplicates as they ingested similar user data from CRMs, shop systems and click streams.<p>We wanted to solve this issue for them with the existing ClickHouse options, but ClickHouse ReplacingMergeTree has an uncontrollable background merging process. This means the new data is in the system, but you never know when they’ll finish the merging, and until then, your queries return incorrect results.<p>We looked into using FINAL but haven't been happy with the speed for real-time workloads.<p>We tried Flink, but there is too much overhead to manage Java Flink jobs, and a self-built solution would have put us in a position to set up and maintain state storage, possibly a very large one (number of unique keys), to keep track of whether we have already encountered a record. And if your dedupe service fails, you need to rehydrate that state before processing new records. That would have been too much maintenance for us.<p>We decided to solve it by building a new product and are excited to share it with you.<p>The key difference is that the streams are deduplicated before ingesting to ClickHouse. So, ClickHouse always has clean data and less load, eliminating the risk of wrong results. We want more people to benefit from it and decided to open-source it (Apache-2.0).<p>Main components:<p>- Streaming deduplication:
You define the deduplication key and a time window (up to 7 days), and it handles the checks in real time to avoid duplicates before hitting ClickHouse. The state store is built in.<p>- Temporal Stream Joins:
You can join two Kafka streams on the fly with a few config inputs. You set the join key, choose a time window (up to 7 days), and you're good.<p>- Built-in Kafka source connector:
There is no need to build custom consumers or manage polling logic. Just point it at your Kafka cluster, and it auto-subscribes to the topics you define. Payloads are parsed as JSON by default, so you get structured data immediately. As underlying tech, we decided on NATS to make it lightweight and low-latency.<p>- ClickHouse sink:
Data gets pushed into ClickHouse through a native connector optimized for performance. You can tweak batch sizes and flush intervals to match your throughput needs. It handles retries automatically, so you don't lose data on transient failures.<p>We'd love to hear your feedback and know if you solved it nicely with existing tools. Thanks for reading!
Show HN: I’m 16 years old and working on my first startup, a study app
As a student with a lot of notes I had a problem with studying fast for tests. So I created Notiv an AI study app that analyzes your notes and prepares you for test.
Show HN: I’m 16 years old and working on my first startup, a study app
As a student with a lot of notes I had a problem with studying fast for tests. So I created Notiv an AI study app that analyzes your notes and prepares you for test.
Show HN: LoopMix128 – Fast C PRNG (.46ns), 2^128 Period, BigCrush/PractRand Pass
LoopMix128 is a fast C PRNG I wrote for non-cryptographic tasks.<p>GitHub (MIT): <a href="https://github.com/danielcota/LoopMix128">https://github.com/danielcota/LoopMix128</a><p>Highlights:<p>* ~0.37 ns/value (GCC 11.4, -O3 -march=native), 98% faster than xoroshiro128++ and PCG64.<p>* Passes TestU01 BigCrush & PractRand (32TB).<p>* Guaranteed 2^128 period.<p>* Proven injective (192-bit state) via Z3 SMT solver; allows parallel streams.<p>* Core requires only stdint.h.<p>Seeking feedback on design, use cases, or further testing.
Show HN: LoopMix128 – Fast C PRNG (.46ns), 2^128 Period, BigCrush/PractRand Pass
LoopMix128 is a fast C PRNG I wrote for non-cryptographic tasks.<p>GitHub (MIT): <a href="https://github.com/danielcota/LoopMix128">https://github.com/danielcota/LoopMix128</a><p>Highlights:<p>* ~0.37 ns/value (GCC 11.4, -O3 -march=native), 98% faster than xoroshiro128++ and PCG64.<p>* Passes TestU01 BigCrush & PractRand (32TB).<p>* Guaranteed 2^128 period.<p>* Proven injective (192-bit state) via Z3 SMT solver; allows parallel streams.<p>* Core requires only stdint.h.<p>Seeking feedback on design, use cases, or further testing.