The best Hacker News stories from All from the past day
Latest posts:
Software Engineering at Google (2020)
Backward Compatibility, Go 1.21, and Go 2
A video game where you are an operating system
Bypassing YouTube video download throttling
Writing about what you learn pushes you to understand topics better
Show HN: LLMs can generate valid JSON 100% of the time
Outlines is a Python library that focuses on text generation with large language models. Brandon and I are not LLM experts and started the project a few months ago because we wanted to understand better how the generation process works. Our original background is probabilistic, relational and symbolic programming.<p>Recently we came up with a fast way to generate text that matches a regex (<a href="https://blog.normalcomputing.ai/posts/2023-07-27-regex-guided-generation/regex-guided-generation.html" rel="nofollow noreferrer">https://blog.normalcomputing.ai/posts/2023-07-27-regex-guide...</a>). The basic idea is simple: regular expressions have an equivalent Deterministic-Finite Automaton (DFA) representation. We can transform this DFA into a generative model: in each state we get a list of symbols which correspond to completions that partially match the regular expression. We mask the other symbols in the logits returned by a large language model, sample a new symbol and move to the next state. The subtelty is that language models work with tokens, not symbols, so we derive a new FSM whose alphabet is the model's vocabulary. We can do this in only one pass over the vocabulary.<p>Generating the token masks thus only requires a dictionary lookup at each state. Our method blows other libraries like Microsoft's guidance out of the water.<p>From there it was only a small leap to be able to generate text that follows a JSON schema (<a href="https://json-schema.org/" rel="nofollow noreferrer">https://json-schema.org/</a>), or is parseable into a Pydantic model (<a href="https://docs.pydantic.dev/latest/usage/models/" rel="nofollow noreferrer">https://docs.pydantic.dev/latest/usage/models/</a>). The method works with union types, optional types, nested schemas, arrays, everything. It is guaranteed that the output is parseable.<p>I think it's cool, and I've spent a lot of time watching even tiny models output valid JSON over the weekend. Hope you will too.<p>I look forward to feedback, bug reports, feature requests and discussions!<p>Edit: Link to our pre-print explaining the method and how this can be extended to generate text that follows a Context-Free Grammar <a href="https://arxiv.org/abs/2307.09702" rel="nofollow noreferrer">https://arxiv.org/abs/2307.09702</a>
Show HN: LLMs can generate valid JSON 100% of the time
Outlines is a Python library that focuses on text generation with large language models. Brandon and I are not LLM experts and started the project a few months ago because we wanted to understand better how the generation process works. Our original background is probabilistic, relational and symbolic programming.<p>Recently we came up with a fast way to generate text that matches a regex (<a href="https://blog.normalcomputing.ai/posts/2023-07-27-regex-guided-generation/regex-guided-generation.html" rel="nofollow noreferrer">https://blog.normalcomputing.ai/posts/2023-07-27-regex-guide...</a>). The basic idea is simple: regular expressions have an equivalent Deterministic-Finite Automaton (DFA) representation. We can transform this DFA into a generative model: in each state we get a list of symbols which correspond to completions that partially match the regular expression. We mask the other symbols in the logits returned by a large language model, sample a new symbol and move to the next state. The subtelty is that language models work with tokens, not symbols, so we derive a new FSM whose alphabet is the model's vocabulary. We can do this in only one pass over the vocabulary.<p>Generating the token masks thus only requires a dictionary lookup at each state. Our method blows other libraries like Microsoft's guidance out of the water.<p>From there it was only a small leap to be able to generate text that follows a JSON schema (<a href="https://json-schema.org/" rel="nofollow noreferrer">https://json-schema.org/</a>), or is parseable into a Pydantic model (<a href="https://docs.pydantic.dev/latest/usage/models/" rel="nofollow noreferrer">https://docs.pydantic.dev/latest/usage/models/</a>). The method works with union types, optional types, nested schemas, arrays, everything. It is guaranteed that the output is parseable.<p>I think it's cool, and I've spent a lot of time watching even tiny models output valid JSON over the weekend. Hope you will too.<p>I look forward to feedback, bug reports, feature requests and discussions!<p>Edit: Link to our pre-print explaining the method and how this can be extended to generate text that follows a Context-Free Grammar <a href="https://arxiv.org/abs/2307.09702" rel="nofollow noreferrer">https://arxiv.org/abs/2307.09702</a>
Today I realized I now trust Microsoft more than Google. What is happening?
‘I've got nothing to hide’ and other misunderstandings of privacy (2007)
Toki Pona: an attempted universal language with only ~120 words
Toki Pona: an attempted universal language with only ~120 words
PDF Tool – Modify PDFs in the browser without uploading
Downloading a video should be “fair use” as recording a song from the radio
Downloading a video should be “fair use” as recording a song from the radio
Azure ChatGPT: Private and secure ChatGPT for internal enterprise use
Azure ChatGPT: Private and secure ChatGPT for internal enterprise use
Tailscale vs. Narrowlink
Exploring the Internals of Linux v0.01
Exploring the Internals of Linux v0.01
Record labels hit Internet Archive with new copyright lawsuit