The best Hacker News stories from Show from the past day
Latest posts:
Show HN: Data Diff – compare tables of any size across databases
Gleb, Alex, Erez and Simon here – we are building an open-source tool for comparing data within and across databases at any scale. The repo is at <a href="https://github.com/datafold/data-diff" rel="nofollow">https://github.com/datafold/data-diff</a>, and our home page is <a href="https://datafold.com/" rel="nofollow">https://datafold.com/</a>.<p>As a company, Datafold builds tools for data engineers to automate the most tedious and error-prone tasks falling through the cracks of the modern data stack, such as data testing and lineage. We launched two years ago with a tool for regression-testing changes to ETL code <a href="https://news.ycombinator.com/item?id=24071955" rel="nofollow">https://news.ycombinator.com/item?id=24071955</a>. It compares the produced data before and after the code change and shows the impact on values, aggregate metrics, and downstream data applications.<p>While working with many customers on improving their data engineering experience, we kept hearing that they needed to diff their data across databases to validate data replication between systems.<p>There were 3 main use cases for such replication:<p>(1) To perform analytics on transactional data in an OLAP engine (e.g. PostgreSQL > Snowflake)
(2) To migrate between transactional stores (e.g. MySQL > PostgreSQL)
(3) To leverage data in a specialized engine (e.g. PostgreSQL > ElasticSearch).<p>Despite multiple vendors (e.g., Fivetran, Stitch) and open-source products (Airbyte, Debezium) solving data replication, there was no tooling for validating the correctness of such replication. When we researched how teams were going about this, we found that most have been either:<p>Running manual checks: e.g., starting with COUNT(*) and then digging into the discrepancies, which often took hours to pinpoint the inconsistencies.
Using distributed MPP engines such as Spark or Trino to download the complete datasets from both databases and then comparing them in memory – an expensive process requiring complex infrastructure.<p>Our users wanted a tool that could:<p>(1) Compare datasets quickly (seconds/minutes) at a large (millions/billions of rows) scale across different databases (2) Have minimal network IO and database workload overhead. (3) Provide straightforward output: basic stats and what rows are different. (4) Be embedded into a data orchestrator such as Airflow to run right after the replication process.<p>So we built Data Diff as an open-source package available through pip. Data Diff can be run in a CLI or wrapped into any data orchestrator such as Airflow, Dagster, etc.<p>To solve for speed at scale with minimal overhead, Data Diff relies on checksumming the data in both databases and uses binary search to identify diverging records. That way, it can compare arbitrarily large datasets in logarithmic time and IO – only transferring a tiny fraction of the data over the network. For example, it can diff tables with 25M rows in ~10s and 1B+ rows in ~5m across two physically separate PostgreSQL databases while running on a typical laptop.<p>We've launched this tool under the MIT license so that any developer can use it, and to encourage contributions of other database connectors. We didn't want to charge engineers for such a fundamental use case. We make money by charging a license fee for advanced solutions such as column-level data lineage, CI workflow automation, and ML-powered alerts.
Show HN: Data Diff – compare tables of any size across databases
Gleb, Alex, Erez and Simon here – we are building an open-source tool for comparing data within and across databases at any scale. The repo is at <a href="https://github.com/datafold/data-diff" rel="nofollow">https://github.com/datafold/data-diff</a>, and our home page is <a href="https://datafold.com/" rel="nofollow">https://datafold.com/</a>.<p>As a company, Datafold builds tools for data engineers to automate the most tedious and error-prone tasks falling through the cracks of the modern data stack, such as data testing and lineage. We launched two years ago with a tool for regression-testing changes to ETL code <a href="https://news.ycombinator.com/item?id=24071955" rel="nofollow">https://news.ycombinator.com/item?id=24071955</a>. It compares the produced data before and after the code change and shows the impact on values, aggregate metrics, and downstream data applications.<p>While working with many customers on improving their data engineering experience, we kept hearing that they needed to diff their data across databases to validate data replication between systems.<p>There were 3 main use cases for such replication:<p>(1) To perform analytics on transactional data in an OLAP engine (e.g. PostgreSQL > Snowflake)
(2) To migrate between transactional stores (e.g. MySQL > PostgreSQL)
(3) To leverage data in a specialized engine (e.g. PostgreSQL > ElasticSearch).<p>Despite multiple vendors (e.g., Fivetran, Stitch) and open-source products (Airbyte, Debezium) solving data replication, there was no tooling for validating the correctness of such replication. When we researched how teams were going about this, we found that most have been either:<p>Running manual checks: e.g., starting with COUNT(*) and then digging into the discrepancies, which often took hours to pinpoint the inconsistencies.
Using distributed MPP engines such as Spark or Trino to download the complete datasets from both databases and then comparing them in memory – an expensive process requiring complex infrastructure.<p>Our users wanted a tool that could:<p>(1) Compare datasets quickly (seconds/minutes) at a large (millions/billions of rows) scale across different databases (2) Have minimal network IO and database workload overhead. (3) Provide straightforward output: basic stats and what rows are different. (4) Be embedded into a data orchestrator such as Airflow to run right after the replication process.<p>So we built Data Diff as an open-source package available through pip. Data Diff can be run in a CLI or wrapped into any data orchestrator such as Airflow, Dagster, etc.<p>To solve for speed at scale with minimal overhead, Data Diff relies on checksumming the data in both databases and uses binary search to identify diverging records. That way, it can compare arbitrarily large datasets in logarithmic time and IO – only transferring a tiny fraction of the data over the network. For example, it can diff tables with 25M rows in ~10s and 1B+ rows in ~5m across two physically separate PostgreSQL databases while running on a typical laptop.<p>We've launched this tool under the MIT license so that any developer can use it, and to encourage contributions of other database connectors. We didn't want to charge engineers for such a fundamental use case. We make money by charging a license fee for advanced solutions such as column-level data lineage, CI workflow automation, and ML-powered alerts.
Show HN: Crocodile - Better code review for GitHub
Hi HN!<p>I've been working on a code review app for GitHub called Crocodile for about a year. I used to work at Microsoft where we used a tool called CodeFlow for reviewing code and I missed it after I left. I know many other ex-Microsoft engineers feel the same. Here are some of the distinguishing features of Crocodile that are inspired by CodeFlow:<p>* Comments float above the code instead of being inline. Long discussions that are displayed inline make it really hard to review the code.<p>* Comment on any text selection in the file, even a single character.<p>* Comments don't get lost when code changes. I hate it when comments become "outdated" because I rebase or the line is edited.<p>I also implemented lots of features that I wish CodeFlow had which you can read more about on the blog. [1]<p>For those curious about the tech stack: it's mostly written in Go with Alpine.js, HTMX, and Tailwind CSS for the frontend. For storage I use PostgreSQL, S3 compatible object storage, and Redis for caching. I use Pulumi for infrastructure provisioning and Kubernetes deployments. Everything is hosted on DigitalOcean.<p>Feedback is welcome!<p>[1] <a href="https://www.crocodile.dev/blog/why-crocodile" rel="nofollow">https://www.crocodile.dev/blog/why-crocodile</a>
Show HN: Crocodile - Better code review for GitHub
Hi HN!<p>I've been working on a code review app for GitHub called Crocodile for about a year. I used to work at Microsoft where we used a tool called CodeFlow for reviewing code and I missed it after I left. I know many other ex-Microsoft engineers feel the same. Here are some of the distinguishing features of Crocodile that are inspired by CodeFlow:<p>* Comments float above the code instead of being inline. Long discussions that are displayed inline make it really hard to review the code.<p>* Comment on any text selection in the file, even a single character.<p>* Comments don't get lost when code changes. I hate it when comments become "outdated" because I rebase or the line is edited.<p>I also implemented lots of features that I wish CodeFlow had which you can read more about on the blog. [1]<p>For those curious about the tech stack: it's mostly written in Go with Alpine.js, HTMX, and Tailwind CSS for the frontend. For storage I use PostgreSQL, S3 compatible object storage, and Redis for caching. I use Pulumi for infrastructure provisioning and Kubernetes deployments. Everything is hosted on DigitalOcean.<p>Feedback is welcome!<p>[1] <a href="https://www.crocodile.dev/blog/why-crocodile" rel="nofollow">https://www.crocodile.dev/blog/why-crocodile</a>
Show HN: Crocodile - Better code review for GitHub
Hi HN!<p>I've been working on a code review app for GitHub called Crocodile for about a year. I used to work at Microsoft where we used a tool called CodeFlow for reviewing code and I missed it after I left. I know many other ex-Microsoft engineers feel the same. Here are some of the distinguishing features of Crocodile that are inspired by CodeFlow:<p>* Comments float above the code instead of being inline. Long discussions that are displayed inline make it really hard to review the code.<p>* Comment on any text selection in the file, even a single character.<p>* Comments don't get lost when code changes. I hate it when comments become "outdated" because I rebase or the line is edited.<p>I also implemented lots of features that I wish CodeFlow had which you can read more about on the blog. [1]<p>For those curious about the tech stack: it's mostly written in Go with Alpine.js, HTMX, and Tailwind CSS for the frontend. For storage I use PostgreSQL, S3 compatible object storage, and Redis for caching. I use Pulumi for infrastructure provisioning and Kubernetes deployments. Everything is hosted on DigitalOcean.<p>Feedback is welcome!<p>[1] <a href="https://www.crocodile.dev/blog/why-crocodile" rel="nofollow">https://www.crocodile.dev/blog/why-crocodile</a>
Show HN: Crocodile - Better code review for GitHub
Hi HN!<p>I've been working on a code review app for GitHub called Crocodile for about a year. I used to work at Microsoft where we used a tool called CodeFlow for reviewing code and I missed it after I left. I know many other ex-Microsoft engineers feel the same. Here are some of the distinguishing features of Crocodile that are inspired by CodeFlow:<p>* Comments float above the code instead of being inline. Long discussions that are displayed inline make it really hard to review the code.<p>* Comment on any text selection in the file, even a single character.<p>* Comments don't get lost when code changes. I hate it when comments become "outdated" because I rebase or the line is edited.<p>I also implemented lots of features that I wish CodeFlow had which you can read more about on the blog. [1]<p>For those curious about the tech stack: it's mostly written in Go with Alpine.js, HTMX, and Tailwind CSS for the frontend. For storage I use PostgreSQL, S3 compatible object storage, and Redis for caching. I use Pulumi for infrastructure provisioning and Kubernetes deployments. Everything is hosted on DigitalOcean.<p>Feedback is welcome!<p>[1] <a href="https://www.crocodile.dev/blog/why-crocodile" rel="nofollow">https://www.crocodile.dev/blog/why-crocodile</a>
Show HN: I built a fun video meeting app with 2D physics and proximity chat
Hi HN!<p><a href="https://flat.social" rel="nofollow">https://flat.social</a> is a web video meeting app for organising fun online meetings & social events. Each participant can move around and talk with others in their proximity.<p>Here is a quick demo if you wanna see it in action: <a href="https://youtu.be/Y2yH3twjrx4" rel="nofollow">https://youtu.be/Y2yH3twjrx4</a><p>I’ve been on it solo for around a year right now. Tech used is Next.js, Typescript, PIXI.js on the front-end and Node, Mediasoup, Socket.io and Matter.js (physics engine) on the backend.<p>Feel free to jump in into the demo room (<a href="https://flat.social/f/Flat.Social-Demo" rel="nofollow">https://flat.social/f/Flat.Social-Demo</a>) to say hi, I’ll be hanging out there throughout today.<p>Would love to hear your thoughts on it!
Show HN: I built a fun video meeting app with 2D physics and proximity chat
Hi HN!<p><a href="https://flat.social" rel="nofollow">https://flat.social</a> is a web video meeting app for organising fun online meetings & social events. Each participant can move around and talk with others in their proximity.<p>Here is a quick demo if you wanna see it in action: <a href="https://youtu.be/Y2yH3twjrx4" rel="nofollow">https://youtu.be/Y2yH3twjrx4</a><p>I’ve been on it solo for around a year right now. Tech used is Next.js, Typescript, PIXI.js on the front-end and Node, Mediasoup, Socket.io and Matter.js (physics engine) on the backend.<p>Feel free to jump in into the demo room (<a href="https://flat.social/f/Flat.Social-Demo" rel="nofollow">https://flat.social/f/Flat.Social-Demo</a>) to say hi, I’ll be hanging out there throughout today.<p>Would love to hear your thoughts on it!
Show HN: I built a fun video meeting app with 2D physics and proximity chat
Hi HN!<p><a href="https://flat.social" rel="nofollow">https://flat.social</a> is a web video meeting app for organising fun online meetings & social events. Each participant can move around and talk with others in their proximity.<p>Here is a quick demo if you wanna see it in action: <a href="https://youtu.be/Y2yH3twjrx4" rel="nofollow">https://youtu.be/Y2yH3twjrx4</a><p>I’ve been on it solo for around a year right now. Tech used is Next.js, Typescript, PIXI.js on the front-end and Node, Mediasoup, Socket.io and Matter.js (physics engine) on the backend.<p>Feel free to jump in into the demo room (<a href="https://flat.social/f/Flat.Social-Demo" rel="nofollow">https://flat.social/f/Flat.Social-Demo</a>) to say hi, I’ll be hanging out there throughout today.<p>Would love to hear your thoughts on it!
Show HN: I built a fun video meeting app with 2D physics and proximity chat
Hi HN!<p><a href="https://flat.social" rel="nofollow">https://flat.social</a> is a web video meeting app for organising fun online meetings & social events. Each participant can move around and talk with others in their proximity.<p>Here is a quick demo if you wanna see it in action: <a href="https://youtu.be/Y2yH3twjrx4" rel="nofollow">https://youtu.be/Y2yH3twjrx4</a><p>I’ve been on it solo for around a year right now. Tech used is Next.js, Typescript, PIXI.js on the front-end and Node, Mediasoup, Socket.io and Matter.js (physics engine) on the backend.<p>Feel free to jump in into the demo room (<a href="https://flat.social/f/Flat.Social-Demo" rel="nofollow">https://flat.social/f/Flat.Social-Demo</a>) to say hi, I’ll be hanging out there throughout today.<p>Would love to hear your thoughts on it!
How to Get a Job as a Software Developer in the UK After Brexit (Guide)
Show HN: Akedo – Retro gaming and coding platform
Show HN: Akedo – Retro gaming and coding platform
Show HN: Akedo – Retro gaming and coding platform
Show HN: Akedo – Retro gaming and coding platform
An experiment to test if Bionic Reading helps you read faster
Show HN: Avo – Build Ruby on Rails apps faster
Show HN: Avo – Build Ruby on Rails apps faster
Show HN: Avo – Build Ruby on Rails apps faster
Show HN: Avo – Build Ruby on Rails apps faster