The best Hacker News stories from Show from the past day
Latest posts:
Show HN: Interesting companies that are running on-prem
Show HN: SymbolicAI
The SymbolicAI project started somewhere at the end of the last year and had its first commit mid January this year. If I would to be briefly summarize "why" do we think it's a project worth working on, is because of the following idea: we're slowly marching towards software 3.0 and we need to grow frameworks to a maturity point that would allow people not only to PoC their own ideas, but also to gain access to a strong community support that nurtures the mutual exchange of ideas between the individuals. I personally believe this is the secret behind many successful OS projects (e.g. Neovim, LazyGit, PyTorch, Jax, just to name a few).<p>FAQ<p>Q: What the project does?<p>A: A lot. You can build your own chatbot, interact with as many as 13 tools (search, wolfram, dall-e, blip, clip, ocr, pinecone, whisper, selenium, local files, etc.), pretty much most of the things you've already seen hyped or cool on the social media.<p>Q: Sounds close to… LangChain…?<p>A: Briefly, I think LangChain grew too fast and became the jack of all trades but the master of none. I'm sure they had their reasons for approaching things the way they did, and I don't want to make this post about them more than I already have. Others have had more thorough investigations of this topic and better rants than I would.<p>Q: Ok, then tell me why would I want to be part of it?<p>A: We're x2 core developers. Sometimes less is more, giving us time to think more deeply about designing the framework and making it accessible to others. Some principles:<p>- Ease of use and flexibility: we were heavily inspired by PyTorch, and we aimed to follow the same code structure one uses with torch. Our original intuition was that when you're introducing something new, tying it with something people are familiar with will make it more accessible (in terms of read/write). Not only this, but the initial recipe proved quite successful and replacing it with something else without concrete reasons is not worth doing IMHO. Moreover, one of our long-term visions is to have smooth integration with torch. We aim to grant SymbolicAI differentiable features. Imagine your chatbot learning to better use its memory (e.g. how to update its memory with relevant information).<p>- As in torch everything is a tensor, in our framework everything is a Symbol. A Symbol once defined gets accessed to some primitives (as an analogy think of PrimTorch) which would easily allow you to compose complex expressions or manipulate Symbol variables. This unlock very fast manipulations (i.e. dot notation <|object|><|dot|><|method|>).<p>- The hard work is done by decorators. We use them for the following reasons: (1) modularity, (2) composition, (3) flexibility, and (4) readability.<p>- We want to make a cohesive dev environment. I'm a script kiddo and I don't like to leave my terminal. I dislike web interfaces. I want to use my local env with my own setup. We have an experimental feature that is built on top of git and would enable package management. It's similar to pip, but for extensions built with our framework. Another long-term vision is to make accessible to anyone using our framework a quick share with the community. See <a href="https://github.com/ExtensityAI/symscribe">https://github.com/ExtensityAI/symscribe</a> for a showcase of how to do transcription and create youtube chapters with Whisper with our package manager.<p>There's much more to say, but I will stop here. Please check our GitHub README (<a href="https://github.com/Xpitfire/symbolicai">https://github.com/Xpitfire/symbolicai</a>) for a more deep dive or our latest tutorial video that highlights some relevant use-cases from a more high-level POV (<a href="https://www.youtube.com/watch?v=0AqB6SEvRqo">https://www.youtube.com/watch?v=0AqB6SEvRqo</a>).<p>I really do hope that at least some of you reading will get interested. We have so many goals we want to reach, so many ideas we want to test, and probably just as many bugs (we call them maggots just for fun) we need to fix.<p>We need you.
Show HN: SymbolicAI
The SymbolicAI project started somewhere at the end of the last year and had its first commit mid January this year. If I would to be briefly summarize "why" do we think it's a project worth working on, is because of the following idea: we're slowly marching towards software 3.0 and we need to grow frameworks to a maturity point that would allow people not only to PoC their own ideas, but also to gain access to a strong community support that nurtures the mutual exchange of ideas between the individuals. I personally believe this is the secret behind many successful OS projects (e.g. Neovim, LazyGit, PyTorch, Jax, just to name a few).<p>FAQ<p>Q: What the project does?<p>A: A lot. You can build your own chatbot, interact with as many as 13 tools (search, wolfram, dall-e, blip, clip, ocr, pinecone, whisper, selenium, local files, etc.), pretty much most of the things you've already seen hyped or cool on the social media.<p>Q: Sounds close to… LangChain…?<p>A: Briefly, I think LangChain grew too fast and became the jack of all trades but the master of none. I'm sure they had their reasons for approaching things the way they did, and I don't want to make this post about them more than I already have. Others have had more thorough investigations of this topic and better rants than I would.<p>Q: Ok, then tell me why would I want to be part of it?<p>A: We're x2 core developers. Sometimes less is more, giving us time to think more deeply about designing the framework and making it accessible to others. Some principles:<p>- Ease of use and flexibility: we were heavily inspired by PyTorch, and we aimed to follow the same code structure one uses with torch. Our original intuition was that when you're introducing something new, tying it with something people are familiar with will make it more accessible (in terms of read/write). Not only this, but the initial recipe proved quite successful and replacing it with something else without concrete reasons is not worth doing IMHO. Moreover, one of our long-term visions is to have smooth integration with torch. We aim to grant SymbolicAI differentiable features. Imagine your chatbot learning to better use its memory (e.g. how to update its memory with relevant information).<p>- As in torch everything is a tensor, in our framework everything is a Symbol. A Symbol once defined gets accessed to some primitives (as an analogy think of PrimTorch) which would easily allow you to compose complex expressions or manipulate Symbol variables. This unlock very fast manipulations (i.e. dot notation <|object|><|dot|><|method|>).<p>- The hard work is done by decorators. We use them for the following reasons: (1) modularity, (2) composition, (3) flexibility, and (4) readability.<p>- We want to make a cohesive dev environment. I'm a script kiddo and I don't like to leave my terminal. I dislike web interfaces. I want to use my local env with my own setup. We have an experimental feature that is built on top of git and would enable package management. It's similar to pip, but for extensions built with our framework. Another long-term vision is to make accessible to anyone using our framework a quick share with the community. See <a href="https://github.com/ExtensityAI/symscribe">https://github.com/ExtensityAI/symscribe</a> for a showcase of how to do transcription and create youtube chapters with Whisper with our package manager.<p>There's much more to say, but I will stop here. Please check our GitHub README (<a href="https://github.com/Xpitfire/symbolicai">https://github.com/Xpitfire/symbolicai</a>) for a more deep dive or our latest tutorial video that highlights some relevant use-cases from a more high-level POV (<a href="https://www.youtube.com/watch?v=0AqB6SEvRqo">https://www.youtube.com/watch?v=0AqB6SEvRqo</a>).<p>I really do hope that at least some of you reading will get interested. We have so many goals we want to reach, so many ideas we want to test, and probably just as many bugs (we call them maggots just for fun) we need to fix.<p>We need you.
Show HN: SymbolicAI
The SymbolicAI project started somewhere at the end of the last year and had its first commit mid January this year. If I would to be briefly summarize "why" do we think it's a project worth working on, is because of the following idea: we're slowly marching towards software 3.0 and we need to grow frameworks to a maturity point that would allow people not only to PoC their own ideas, but also to gain access to a strong community support that nurtures the mutual exchange of ideas between the individuals. I personally believe this is the secret behind many successful OS projects (e.g. Neovim, LazyGit, PyTorch, Jax, just to name a few).<p>FAQ<p>Q: What the project does?<p>A: A lot. You can build your own chatbot, interact with as many as 13 tools (search, wolfram, dall-e, blip, clip, ocr, pinecone, whisper, selenium, local files, etc.), pretty much most of the things you've already seen hyped or cool on the social media.<p>Q: Sounds close to… LangChain…?<p>A: Briefly, I think LangChain grew too fast and became the jack of all trades but the master of none. I'm sure they had their reasons for approaching things the way they did, and I don't want to make this post about them more than I already have. Others have had more thorough investigations of this topic and better rants than I would.<p>Q: Ok, then tell me why would I want to be part of it?<p>A: We're x2 core developers. Sometimes less is more, giving us time to think more deeply about designing the framework and making it accessible to others. Some principles:<p>- Ease of use and flexibility: we were heavily inspired by PyTorch, and we aimed to follow the same code structure one uses with torch. Our original intuition was that when you're introducing something new, tying it with something people are familiar with will make it more accessible (in terms of read/write). Not only this, but the initial recipe proved quite successful and replacing it with something else without concrete reasons is not worth doing IMHO. Moreover, one of our long-term visions is to have smooth integration with torch. We aim to grant SymbolicAI differentiable features. Imagine your chatbot learning to better use its memory (e.g. how to update its memory with relevant information).<p>- As in torch everything is a tensor, in our framework everything is a Symbol. A Symbol once defined gets accessed to some primitives (as an analogy think of PrimTorch) which would easily allow you to compose complex expressions or manipulate Symbol variables. This unlock very fast manipulations (i.e. dot notation <|object|><|dot|><|method|>).<p>- The hard work is done by decorators. We use them for the following reasons: (1) modularity, (2) composition, (3) flexibility, and (4) readability.<p>- We want to make a cohesive dev environment. I'm a script kiddo and I don't like to leave my terminal. I dislike web interfaces. I want to use my local env with my own setup. We have an experimental feature that is built on top of git and would enable package management. It's similar to pip, but for extensions built with our framework. Another long-term vision is to make accessible to anyone using our framework a quick share with the community. See <a href="https://github.com/ExtensityAI/symscribe">https://github.com/ExtensityAI/symscribe</a> for a showcase of how to do transcription and create youtube chapters with Whisper with our package manager.<p>There's much more to say, but I will stop here. Please check our GitHub README (<a href="https://github.com/Xpitfire/symbolicai">https://github.com/Xpitfire/symbolicai</a>) for a more deep dive or our latest tutorial video that highlights some relevant use-cases from a more high-level POV (<a href="https://www.youtube.com/watch?v=0AqB6SEvRqo">https://www.youtube.com/watch?v=0AqB6SEvRqo</a>).<p>I really do hope that at least some of you reading will get interested. We have so many goals we want to reach, so many ideas we want to test, and probably just as many bugs (we call them maggots just for fun) we need to fix.<p>We need you.
Show HN: I made Grammarly for accessibility code violations
Show HN: I made Grammarly for accessibility code violations
Show HN: YC idea matcher – Submit an idea and get a list of similar YC companies
This project uses semantic search, an advanced search technique that aims to understand the intent and context behind a search query instead of just matching keywords.<p>It's built using Neon Postgres + pg_embedding and OpenAI for generating embeddings. More details can be found in the repo
<a href="https://github.com/neondatabase/yc-idea-matcher">https://github.com/neondatabase/yc-idea-matcher</a>
Show HN: YC idea matcher – Submit an idea and get a list of similar YC companies
This project uses semantic search, an advanced search technique that aims to understand the intent and context behind a search query instead of just matching keywords.<p>It's built using Neon Postgres + pg_embedding and OpenAI for generating embeddings. More details can be found in the repo
<a href="https://github.com/neondatabase/yc-idea-matcher">https://github.com/neondatabase/yc-idea-matcher</a>
Show HN: YC idea matcher – Submit an idea and get a list of similar YC companies
This project uses semantic search, an advanced search technique that aims to understand the intent and context behind a search query instead of just matching keywords.<p>It's built using Neon Postgres + pg_embedding and OpenAI for generating embeddings. More details can be found in the repo
<a href="https://github.com/neondatabase/yc-idea-matcher">https://github.com/neondatabase/yc-idea-matcher</a>
Show HN: Gdańsk AI – full stack AI voice chatbot
Hi!<p>It's a complete product with integrations to Auth0, OpenAI, Google Cloud and Stripe, which consists of Next.js Web App, Node.js + Express Web API and Python + FastAPI AI API<p>I've built this software, because I wanted to make money by selling tokens to enable users talking with the chatbot. But I think Google / Apple will include such AI-powered assistant in their products soon, so nobody will pay me for using it<p>So I open source the product today and share it as a GNU GPL-2 licensed software<p>I'm happy to assist in case if something is unclear or requires additional docs and answer any questions about Gdańsk AI :)<p>Thanks
Show HN: Gdańsk AI – full stack AI voice chatbot
Hi!<p>It's a complete product with integrations to Auth0, OpenAI, Google Cloud and Stripe, which consists of Next.js Web App, Node.js + Express Web API and Python + FastAPI AI API<p>I've built this software, because I wanted to make money by selling tokens to enable users talking with the chatbot. But I think Google / Apple will include such AI-powered assistant in their products soon, so nobody will pay me for using it<p>So I open source the product today and share it as a GNU GPL-2 licensed software<p>I'm happy to assist in case if something is unclear or requires additional docs and answer any questions about Gdańsk AI :)<p>Thanks
Show HN: Gdańsk AI – full stack AI voice chatbot
Hi!<p>It's a complete product with integrations to Auth0, OpenAI, Google Cloud and Stripe, which consists of Next.js Web App, Node.js + Express Web API and Python + FastAPI AI API<p>I've built this software, because I wanted to make money by selling tokens to enable users talking with the chatbot. But I think Google / Apple will include such AI-powered assistant in their products soon, so nobody will pay me for using it<p>So I open source the product today and share it as a GNU GPL-2 licensed software<p>I'm happy to assist in case if something is unclear or requires additional docs and answer any questions about Gdańsk AI :)<p>Thanks
Show HN: We built swup+fragment-plugin to visually enhance classic websites
## TL;DR<p>- Progressively enhance your classic website / MPA to a single page app.<p>- Support for fragment visits, comparable to nested routes in React or Vue.<p>- Keep your site crawlable and indexable without any of the overhead of SSR.<p>- No tight coupling of back- and frontend. Use the CMS / Framework / SSG of your choice.<p>- Strong focus on interoperability with DOM-altering JS tools (think Alpine.js, jQuery, ...).<p>- Strong focus on accessibility, even for fragment visits.<p>## Long Version: Best of three worlds<p>Hi, I'm Rasso Hilber. I have been a web designer and developer since around 2004. From the beginning of my career, I always had to make tradeoffs between 3 goals when building websites:<p>1. The websites I build should be visually impressive, original, and snappy.<p>2. The websites I build should be crawlable, accessible and standards compliant.<p>3. The websites I build should have low technical complexity and be easy to maintain in the long run.<p>In the beginning, I was able to achieve goals 1 (impressive!) and 3 (easy to maintain!) by using Macromedia/Adobe Flash, but due to the nature of the technology horribly failed to deliver crawlable and accessible websites. Later, I found a way to run two sites in parallel for each website I built, one using CMS-generated XHTML for crawlability, one in Flash for the visitors, fetching the data from its XHTML twin. Now I had solved goals 1 and 2, but my setup was awfully complex and brittle.<p>Around 2010, I was relieved to see Flash finally coming to its end. I switched to building websites using PHP, HTML, and jQuery. I could now tick goals 2 (accessibility) and 3 (low complexity), but the websites I was able to build using these technologies weren't as impressive anymore. Hard page loads between every link click being one of the biggest regressions in UX from the days of Flash IMO.<p>Around 2014/15, I first heard about the new frameworks: Angular, React, Vue. These frameworks were not intended to be used for classic websites. They were made for single-page-apps! But it felt to me like no one cared. Even when building classic websites, many developers sacrificed SEO and accessibility for a snappy experience, serving an empty `<div id="app"></div>` to the browser. I couldn't blame them; I had done the same in my early days as a Flash developer. They ticked goal 1 (impressive) and goal 3 (low complexity). But the lack of accessibility kept me from joining the movement. I was still building classic websites, after all.<p>After some time, many started realizing that serving an empty div had downsides – SSR, hydration, and whatnot were born, now ticking goal 1 (impressive) and goal 2 (accessibility), with the trade-off of awful complexity. It reminded me a lot of my little Frankenstein's monster "Flash+XHTML," and I still didn't want to join the hype.<p>Still, because the noise was so loud, I felt like I might be becoming obsolete, an "old man yelling at the clouds".<p>New very interesting tools like HTMX or Unpoly popped up that looked promising at first, but at closer inspection weren't optimized for my use case either. These were primarily built for real interfaces/single-page-apps (html snippets instead of full pages, UI state independent of URLs, altered DOM saved in history snapshots, ...). I wanted to find a <i>tiny</i> tool, optimized for building <i>presentational</i>, <i>content-driven</i> websites with a strong focus on <i>accessibility</i>.<p>Instead, after a few years of rolling my own home-grown solutions, I started using swup[0], a "Versatile and extensible page transition library for server-rendered websites". Swup consists of a tiny core and a rich ecosystem of official plugins[1] for additional functionality. It was hitting the sweet spot between simplicity and complexity, and felt like it was perfect for my use cases. Shortly after I had started using it, I became a core contributor and maintainer of swup.<p>The only thing I was still missing to be a happy developer was the ability to create really complex navigation paths where selected fragments are updated as you navigate a site, much like nested routes allow in React or Vue.<p>The last two months I teamed up with @daun [2] to finally solve this hard problem. The result is two things:<p>1. A new major release of swup (v4) that allows customizing the complete page transition process by providing a powerful hook system and a mutable visit object
2. The newly released fragment-plugin [3] that provides a declarative API for dynamically replacing containers based on rules<p>Use cases for the fragment-plugin are:<p>- a filter UI that live-updates its list of results on every interaction<p>- a detail overlay that shows on top of the currently open content<p>- a tab group that updates only itself when selecting one of the tabs<p>- a form that updates only itself upon submission<p>I can now finally build websites that tick all three boxes:<p>1. Visually impressive, fun, and snappy by using swup's first-class support for animations[4], cache[5], and preload capacities[6], enhanced with fragment visits as seen on the demo site.<p>2. Accessible by being able to serve server-rendered semantic markup that will fully work even with JavaScript disabled (try it out on the demo site!). On top of that, swup's a11y plugin[7] will automatically announce page visits to assistive technologies and will focus the new `<main>` element after each visit.<p>3. Because now all I need for my fancy frontend is a bit of progressive JavaScript, I can choose whatever tool I like on the server, keeping complexity low and maintainability high. I can use SSGs like eleventy or Astro (the demo site is built using Astro!), I can use any CMS like WordPress or ProcessWire, or a framework like Laravel. And I don't have to maintain an additional node server for SSG!<p>The plugin is still in it's early stages, but I have a good feeling that this finally is the right path for me as a web developer. All it took was 20 years! ;)<p>[0] <a href="https://github.com/swup/swup">https://github.com/swup/swup</a><p>[1] <a href="https://swup.js.org/plugins" rel="nofollow noreferrer">https://swup.js.org/plugins</a><p>[2] <a href="https://github.com/daun">https://github.com/daun</a><p>[3] <a href="https://github.com/swup/fragment-plugin">https://github.com/swup/fragment-plugin</a><p>[4] <a href="https://swup.js.org/getting-started/how-it-works/" rel="nofollow noreferrer">https://swup.js.org/getting-started/how-it-works/</a><p>[5] <a href="https://swup.js.org/api/cache/" rel="nofollow noreferrer">https://swup.js.org/api/cache/</a><p>[6] <a href="https://swup.js.org/plugins/preload-plugin/" rel="nofollow noreferrer">https://swup.js.org/plugins/preload-plugin/</a><p>[7] <a href="https://swup.js.org/plugins/a11y-plugin/" rel="nofollow noreferrer">https://swup.js.org/plugins/a11y-plugin/</a>
Show HN: I wrote a compendium of software design
Show HN: I wrote a compendium of software design
Show HN: Hydra 1.0 – open-source column-oriented Postgres
hi hn, hydra ceo here<p>hydra is an open source, column-oriented postgres. you can set up remarkably fast aggregates on your project in minutes to query billions of rows instantly.<p>postgres is great, but aggregates can take minutes to hours to return results on large data sets. long-running analytical queries hog database resources and degrade performance. use hydra to run much faster analytics on postgres without making code changes. data is automatically loaded into columnar format and compressed. connect to hydra with your preferred postgres client (psql, dbeaver, etc).<p>following 4 months of development on hydra v0.3.0-alpha, our team is proud to share our first major version release. hydra 1.0 is under active development, but ready for use and feedback. we’re aiming to release 1.0 into general availability (ga) soon.<p>for testing, try the hydra free tier to create a column-oriented postgres instance on the cloud. <a href="https://dashboard.hydra.so/signup">https://dashboard.hydra.so/signup</a>
Show HN: Hydra 1.0 – open-source column-oriented Postgres
hi hn, hydra ceo here<p>hydra is an open source, column-oriented postgres. you can set up remarkably fast aggregates on your project in minutes to query billions of rows instantly.<p>postgres is great, but aggregates can take minutes to hours to return results on large data sets. long-running analytical queries hog database resources and degrade performance. use hydra to run much faster analytics on postgres without making code changes. data is automatically loaded into columnar format and compressed. connect to hydra with your preferred postgres client (psql, dbeaver, etc).<p>following 4 months of development on hydra v0.3.0-alpha, our team is proud to share our first major version release. hydra 1.0 is under active development, but ready for use and feedback. we’re aiming to release 1.0 into general availability (ga) soon.<p>for testing, try the hydra free tier to create a column-oriented postgres instance on the cloud. <a href="https://dashboard.hydra.so/signup">https://dashboard.hydra.so/signup</a>
Show HN: Hydra 1.0 – open-source column-oriented Postgres
hi hn, hydra ceo here<p>hydra is an open source, column-oriented postgres. you can set up remarkably fast aggregates on your project in minutes to query billions of rows instantly.<p>postgres is great, but aggregates can take minutes to hours to return results on large data sets. long-running analytical queries hog database resources and degrade performance. use hydra to run much faster analytics on postgres without making code changes. data is automatically loaded into columnar format and compressed. connect to hydra with your preferred postgres client (psql, dbeaver, etc).<p>following 4 months of development on hydra v0.3.0-alpha, our team is proud to share our first major version release. hydra 1.0 is under active development, but ready for use and feedback. we’re aiming to release 1.0 into general availability (ga) soon.<p>for testing, try the hydra free tier to create a column-oriented postgres instance on the cloud. <a href="https://dashboard.hydra.so/signup">https://dashboard.hydra.so/signup</a>
Show HN: Hydra 1.0 – open-source column-oriented Postgres
hi hn, hydra ceo here<p>hydra is an open source, column-oriented postgres. you can set up remarkably fast aggregates on your project in minutes to query billions of rows instantly.<p>postgres is great, but aggregates can take minutes to hours to return results on large data sets. long-running analytical queries hog database resources and degrade performance. use hydra to run much faster analytics on postgres without making code changes. data is automatically loaded into columnar format and compressed. connect to hydra with your preferred postgres client (psql, dbeaver, etc).<p>following 4 months of development on hydra v0.3.0-alpha, our team is proud to share our first major version release. hydra 1.0 is under active development, but ready for use and feedback. we’re aiming to release 1.0 into general availability (ga) soon.<p>for testing, try the hydra free tier to create a column-oriented postgres instance on the cloud. <a href="https://dashboard.hydra.so/signup">https://dashboard.hydra.so/signup</a>
Show HN: Using LLama2 to Correct OCR Errors
I've been disappointed by the very poor quality of results that I generally get when trying to run OCR on older scanned documents, especially ones that are typewritten or otherwise have unusual or irregular typography. I recently had the idea of using Llama2 to use common sense reasoning and subject level expertise to correct transcription errors in a "smart" way-- basically doing what a human proofreader who is familiar with the topic might do.<p>I came up with the linked script that takes a PDF as input, runs Tesseract on it to get an initial text extraction, and then feeds this sentence-by-sentence to Llama2, first to correct mistakes, and then again on the corrected text to format it as markdown where possible.
This was surprisingly easier than I initially expected thanks to the very nice tooling now available in libraries such as llama-cpp-python, langchain, and pytesseract. But the big issue I was encountering was that Llama2 wasn't just correcting the text it was given-- it was also hallucinating a LOT of totally new sentences that didn't appear in the original text at all (some of these new sentences used words which never appeared elsewhere in the original text).<p>I figured this would be pretty simple to filter out using fuzzy string matching-- basically check all the sentences in the LLM corrected text and filter out sentences that are very different from any sentences in the original OCRed text. To my surprise, this approach worked very poorly. In fact, lots of other similar tweaks, including using bag-of-words and the spacy NLP library in various ways (spacy worked very poorly in everything I tried).<p>Finally I realized that I had a good solution staring me in the face: Llama2. I realized I could get sentence level vector embeddings straight from Llama2 using langchain. So I did that, getting embeddings for each sentence in the raw OCRed text and the LLM corrected text, and then computed the cosine similarity of each sentence in the LLM corrected text against all sentences in the raw OCRed text. If no sentences match in the raw OCRed text, then that sentence has a good chance of being hallucinated.<p>In order to save the user from having to experiment with various thresholds, I saved the computed embeddings to an SQLite database so they only had to be computed once, and then tried several thresholds, comparing the length of the filtered LLM corrected text to the raw OCRed text; if things worked right, these texts should be roughly the same length. So as soon as the filtered length dips below the raw OCRed text length, it backtracks and uses the previous threshold as the final selected threshold.<p>Anyway, if you have some very old scanned documents laying around, you might try them out and see how well it works for you. Do note that it's extremely slow, but you can leave it overnight and maybe the next day you'll have your finished text, which is better than nothing! I feel like this could be useful for sites like the Internet Archive-- I've found their OCR results to be extremely poor for older documents.<p>I'm very open to any ideas or suggestions you might have. I threw this together in a couple days and know that it can certainly be improved in various ways. One idea that I thought might be fun would be to make this work with a Ray cluster, sending a different page of the document to each of the workers in the cluster to do it all at the same time.