πOpen-Source Models
Open-source models, conversely, are often smaller and less capable than their proprietary counterparts, but they offer cost-effectiveness and a higher degree of flexibility for developers.
Open-source models, conversely, are often smaller and less capable than their proprietary counterparts, but they offer cost-effectiveness and a higher degree of flexibility for developers. HuggingFace serves as a popular community hub for hosting and organizing these models.
Examples of open-source models include Stable Diffusion by Stability AI, BLOOM by BigScience, LLaMA or OPT by Meta AI, Flan-T5 by Google, and GPT-J, GPT-Neo, or Pythia by Eleuther AI.
Open LLM Leaderboard
Tests include the AI2 Reasoning Challenge (science questions), Hellaswag (commonsense inference), MMLU (multitask accuracy for elementary mathematics, US history, computer science, law, and other tasks), TruthfulQA (how truthfully the model answers):
AI2 Reasoning Challenge (25-shot) - a set of grade-school science questions.
HellaSwag (10-shot) - a test of commonsense inference, which is easy for humans (~95%) but challenging for SOTA models.
MMLU (5-shot) - a test to measure a text modelβs multitask accuracy. The test covers 57 tasks including elementary mathematics, US history, computer science, law, and more.
TruthfulQA (0-shot) - a benchmark to measure whether a language model is truthful in generating answers to questions.
Rank | Model | Family | License | Average β¬οΈ | ARC (25-shot) β¬οΈ | HellaSwag (10-shot) β¬οΈ | MMLU (5-shot) β¬οΈ | TruthfulQA (0-shot) β¬οΈ |
---|---|---|---|---|---|---|---|---|
1 | Falcon | 63.2 | 61.6 | 84.4 | 54.1 | 52.5 | ||
2 | π₯ tiiuae/falcon-40b | Falcon | 60.4 | 61.9 | 85.3 | 52.7 | 41.7 | |
3 | LLaMA | Limited, Non-commercial bespoke license | 59.8 | 58.5 | 82.9 | 44.3 | 53.6 | |
4 | LLaMA | Limited, Non-commercial bespoke license | 58.3 | 57.8 | 84.2 | 48.8 | 42.3 | |
5 | LLaMA | Limited, Non-commercial bespoke license | 57.9 | 56.7 | 81.4 | 43.6 | 49.7 | |
6 | LLaMA | Limited, Non-commercial bespoke license | 57.4 | 57.1 | 82.6 | 46.1 | 43.8 | |
7 | LLaMA | Limited, Non-commercial bespoke license | 57.2 | 56.1 | 79.8 | 44 | 49.1 | |
8 | LLaMA | Limited, Non-commercial bespoke license | 57 | 53.6 | 79.6 | 42.7 | 52 | |
9 | LLaMA | Limited, Non-commercial bespoke license | 57 | 57.8 | 80.8 | 50.8 | 38.8 | |
10 | LLaMA | Limited, Non-commercial bespoke license | 56.9 | 57.1 | 82.6 | 45.7 | 42.3 | |
11 | LLaMA | Limited, Non-commercial bespoke license | 55.7 | 52.5 | 78.6 | 41 | 50.6 | |
12 | LLaMA | Limited, Non-commercial bespoke license | 53.7 | 47.4 | 78 | 39.6 | 49.8 | |
13 | LLaMA | Limited, Non-commercial bespoke license | 53.6 | 47.8 | 77.7 | 39.1 | 49.7 | |
14 | LLaMA | Limited, Non-commercial bespoke license | 53.1 | 45.1 | 77.9 | 38.1 | 51.3 | |
15 | LLaMA | Limited, Non-commercial bespoke license | 52.6 | 48 | 78.6 | 37.2 | 46.8 | |
16 | LLaMA | Limited, Non-commercial bespoke license | 52.4 | 48.1 | 76.4 | 38.8 | 46.5 | |
17 | LLaMA | Limited, Non-commercial bespoke license | 52.2 | 47 | 75.2 | 37.5 | 48.9 | |
18 | LLaMA | Limited, Non-commercial bespoke license | 51.8 | 50.8 | 78.9 | 37.7 | 39.9 | |
19 | LLaMA | Limited, Non-commercial bespoke license | 51.7 | 51.9 | 77.6 | 37.6 | 39.6 | |
20 | 51.2 | 46.8 | 66.4 | 50.4 | 41.3 | |||
21 | LLaMA | Limited, Non-commercial bespoke license | 50.8 | 48 | 75.6 | 36.3 | 43.3 | |
22 | LLaMA | Limited, Non-commercial bespoke license | 50.7 | 45.3 | 75.5 | 36.5 | 45.5 | |
23 | LLaMA | Limited, Non-commercial bespoke license | 50.1 | 44.7 | 73.4 | 36.9 | 45.4 | |
24 | LLaMA | Limited, Non-commercial bespoke license | 49.7 | 48 | 77.1 | 36.1 | 37.7 | |
25 | Falcon | 48.8 | 47.9 | 78.1 | 35 | 34.3 | ||
26 | Proprietary | 48.6 | 47.7 | 77.7 | 35.6 | 33.4 | ||
27 | LLaMA | Limited, Non-commercial bespoke license | 48.4 | 45.5 | 75.2 | 34.4 | 38.7 | |
28 | Falcon | 48.4 | 45.9 | 70.8 | 32.8 | 44.1 | ||
29 | LLaMA | Limited, Non-commercial bespoke license | 47.6 | 46.6 | 75.6 | 34.2 | 34.1 | |
30 | 47.6 | 46.7 | 76.2 | 32.3 | 35.3 | |||
31 | 46.4 | 46.8 | 71.9 | 32.8 | 34 | |||
32 | Apache 2.0 | 46.2 | 41.2 | 64.5 | 33.3 | 45.6 | ||
33 | Apache 2.0 | 45.9 | 45.2 | 73.4 | 33.3 | 31.7 | ||
34 | 45.7 | 44.4 | 71.3 | 34 | 33.2 | |||
35 | Limited, Non-commercial bespoke license. There is also a version based on Pythia which is Apache licensed. | 45.6 | 45.6 | 68.5 | 30.6 | 37.8 | ||
36 | Apache 2.0 | 44.9 | 41.2 | 72.3 | 31.7 | 34.3 | ||
37 | 44.6 | 42.6 | 68.8 | 31.6 | 35.5 | |||
38 | 44.4 | 43.7 | 69.3 | 30.2 | 34.5 | |||
39 | 44.3 | 41.4 | 67.6 | 32.3 | 36 | |||
40 | 44.2 | 41.7 | 68.1 | 32.7 | 34.4 | |||
41 | 44 | 40.5 | 71.3 | 30.4 | 34 | |||
42 | 43.8 | 40.2 | 70.7 | 30.1 | 34.4 | |||
43 | 42.9 | 39.9 | 63.8 | 31.2 | 36.7 | |||
44 | 42.2 | 40.2 | 64.7 | 30.6 | 33.2 | |||
45 | 42.1 | 39.8 | 65.2 | 29.7 | 33.7 | |||
46 | 42 | 42.6 | 49.3 | 34.1 | 42.1 | |||
47 | 41.2 | 35 | 61.9 | 30.3 | 37.8 | |||
48 | 41.1 | 35.2 | 57.6 | 30.8 | 40.7 | |||
49 | 40 | 33.3 | 59.1 | 29.8 | 37.9 | |||
50 | 39.8 | 31.7 | 49.4 | 34.4 | 43.7 | |||
51 | 39.2 | 33.6 | 51.2 | 28.9 | 43.3 | |||
52 | 38.9 | 33.8 | 59.1 | 28 | 34.6 | |||
53 | 38.8 | 33.6 | 54.7 | 29.7 | 37.4 | |||
53 | 38.3 | 31.9 | 53.6 | 27.4 | 40.2 | |||
54 | 37.8 | 30.9 | 52.7 | 27.5 | 40.1 | |||
55 | 37.7 | 29.6 | 54.6 | 27.7 | 38.7 | |||
56 | 36.8 | 30.3 | 51.4 | 26.9 | 38.5 | |||
57 | 35.9 | 30 | 47.7 | 25.9 | 40 | |||
58 | 34 | 25.9 | 45.6 | 25.6 | 38.7 | |||
59 | 33.8 | 27.2 | 40.2 | 27 | 40.7 | |||
60 | 33.4 | 26.1 | 38.5 | 26.2 | 42.7 | |||
null | 33.2 | 25.5 | 37.6 | 26.6 | 43 | |||
null | 32.3 | 27.6 | 35.6 | 26.3 | 39.7 | |||
null | 32.2 | 23.6 | 36.7 | 27.3 | 41 | |||
null | 32 | 22.6 | 27.2 | 27.1 | 51.2 | |||
null | 31.6 | 24.7 | 30.2 | 28.9 | 42.8 | |||
null | 31.2 | 23.1 | 31.5 | 27.4 | 42.9 | |||
null | 31.2 | 22.6 | 32.8 | 26.1 | 43.4 | |||
null | 30.4 | 21.9 | 31.6 | 27.5 | 40.7 | |||
null | 30.2 | 22.2 | 27.5 | 26.8 | 44.5 | |||
null | 29.9 | 20 | 26.7 | 26.7 | 46.3 | |||
null | 29.8 | 22.7 | 31.1 | 27.3 | 38 | |||
null | GPT |
Source: https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard
Last updated