The best Hacker News stories from All from the past day

Go back

Latest posts:

Slack’s migration to a cellular architecture

Slack’s migration to a cellular architecture

For many home-schoolers, parents are no longer doing the teaching

E-ink is so Retropunk

E-ink is so Retropunk

AI isn’t good enough

How to drill your own water well

The EU's war on behavioral advertising

The EU's war on behavioral advertising

Web scraping for me, but not for thee

Web scraping for me, but not for thee

The complete sequence of a human Y chromosome

Giving up the iPad-only travel dream

Giving up the iPad-only travel dream

Beating GPT-4 on HumanEval with a fine-tuned CodeLlama-34B

Hi HN,<p>We have fine-tuned CodeLlama-34B and CodeLlama-34B-Python on an internal Phind dataset that achieved 67.6% and 69.5% pass@1 on HumanEval, respectively. GPT-4 achieved 67%. To ensure result validity, we applied OpenAI's decontamination methodology to our dataset.<p>The CodeLlama models released yesterday demonstrate impressive performance on HumanEval.<p>- CodeLlama-34B achieved 48.8% pass@1 on HumanEval<p>- CodeLlama-34B-Python achieved 53.7% pass@1 on HumanEval<p>We have fine-tuned both models on a proprietary dataset of ~80k high-quality programming problems and solutions. Instead of code completion examples, this dataset features instruction-answer pairs, setting it apart structurally from HumanEval. We trained the Phind models over two epochs, for a total of ~160k examples. LoRA was not used — both models underwent a native fine-tuning. We employed DeepSpeed ZeRO 3 and Flash Attention 2 to train these models in three hours using 32 A100-80GB GPUs, with a sequence length of 4096 tokens.<p>Furthermore, we applied OpenAI's decontamination methodology to our dataset to ensure valid results, and found no contaminated examples.<p>The methodology is:<p>- For each evaluation example, we randomly sampled three substrings of 50 characters or used the entire example if it was fewer than 50 characters.<p>- A match was identified if any sampled substring was a substring of the processed training example.<p>For further insights on the decontamination methodology, please refer to Appendix C of OpenAI's technical report.<p>Presented below are the pass@1 scores we achieved with our fine-tuned models:<p>- Phind-CodeLlama-34B-v1 achieved 67.6% pass@1 on HumanEval<p>- Phind-CodeLlama-34B-Python-v1 achieved 69.5% pass@1 on HumanEval<p>Note on GPT-4<p>According to the official technical report in March, OpenAI reported a pass@1 score of 67% for GPT-4's performance on HumanEval. Since then, there have been claims reporting higher scores. However, it's essential to note that there hasn't been any concrete evidence pointing towards an enhancement in the model's coding abilities since then. It's also crucial to highlight that these elevated figures lack the rigorous contamination analysis that the official statistic underwent, making them less of a reliable comparison. As a result, we consider 67% as the pass@1 score for GPT-4.<p>Download<p>We are releasing both models on Huggingface for verifiability and to bolster the open-source community. We welcome independent verification of results.<p>Phind-CodeLlama-34B-v1: <a href="https://huggingface.co/Phind/Phind-CodeLlama-34B-v1" rel="nofollow noreferrer">https://huggingface.co/Phind/Phind-CodeLlama-34B-v1</a><p>Phind-CodeLlama-34B-Python-v1: <a href="https://huggingface.co/Phind/Phind-CodeLlama-34B-Python-v1" rel="nofollow noreferrer">https://huggingface.co/Phind/Phind-CodeLlama-34B-Python-v1</a><p>We'd love to hear your thoughts!<p>Best,<p>The Phind Team

Beating GPT-4 on HumanEval with a fine-tuned CodeLlama-34B

Hi HN,<p>We have fine-tuned CodeLlama-34B and CodeLlama-34B-Python on an internal Phind dataset that achieved 67.6% and 69.5% pass@1 on HumanEval, respectively. GPT-4 achieved 67%. To ensure result validity, we applied OpenAI's decontamination methodology to our dataset.<p>The CodeLlama models released yesterday demonstrate impressive performance on HumanEval.<p>- CodeLlama-34B achieved 48.8% pass@1 on HumanEval<p>- CodeLlama-34B-Python achieved 53.7% pass@1 on HumanEval<p>We have fine-tuned both models on a proprietary dataset of ~80k high-quality programming problems and solutions. Instead of code completion examples, this dataset features instruction-answer pairs, setting it apart structurally from HumanEval. We trained the Phind models over two epochs, for a total of ~160k examples. LoRA was not used — both models underwent a native fine-tuning. We employed DeepSpeed ZeRO 3 and Flash Attention 2 to train these models in three hours using 32 A100-80GB GPUs, with a sequence length of 4096 tokens.<p>Furthermore, we applied OpenAI's decontamination methodology to our dataset to ensure valid results, and found no contaminated examples.<p>The methodology is:<p>- For each evaluation example, we randomly sampled three substrings of 50 characters or used the entire example if it was fewer than 50 characters.<p>- A match was identified if any sampled substring was a substring of the processed training example.<p>For further insights on the decontamination methodology, please refer to Appendix C of OpenAI's technical report.<p>Presented below are the pass@1 scores we achieved with our fine-tuned models:<p>- Phind-CodeLlama-34B-v1 achieved 67.6% pass@1 on HumanEval<p>- Phind-CodeLlama-34B-Python-v1 achieved 69.5% pass@1 on HumanEval<p>Note on GPT-4<p>According to the official technical report in March, OpenAI reported a pass@1 score of 67% for GPT-4's performance on HumanEval. Since then, there have been claims reporting higher scores. However, it's essential to note that there hasn't been any concrete evidence pointing towards an enhancement in the model's coding abilities since then. It's also crucial to highlight that these elevated figures lack the rigorous contamination analysis that the official statistic underwent, making them less of a reliable comparison. As a result, we consider 67% as the pass@1 score for GPT-4.<p>Download<p>We are releasing both models on Huggingface for verifiability and to bolster the open-source community. We welcome independent verification of results.<p>Phind-CodeLlama-34B-v1: <a href="https://huggingface.co/Phind/Phind-CodeLlama-34B-v1" rel="nofollow noreferrer">https://huggingface.co/Phind/Phind-CodeLlama-34B-v1</a><p>Phind-CodeLlama-34B-Python-v1: <a href="https://huggingface.co/Phind/Phind-CodeLlama-34B-Python-v1" rel="nofollow noreferrer">https://huggingface.co/Phind/Phind-CodeLlama-34B-Python-v1</a><p>We'd love to hear your thoughts!<p>Best,<p>The Phind Team

Factorio: Space Age

OpenTF announces fork of Terraform

OpenTF announces fork of Terraform

Add dir=“auto” to your inputs and textareas

< 1 2 3 ... 327 328 329 330 331 ... 823 824 825 >