Awesome Open Source
Search
Programming Languages
Languages
All Categories
Categories
About
Search results for benchmark llm
benchmark
x
llm
x
25 search results found
Opencompass
⭐
2,758
OpenCompass is an LLM evaluation platform, supporting a wide range of models (Llama3, Mistral, InternLM2,GPT-4,LLaMa2, Qwen,GLM, Claude, etc) over 100+ datasets.
Llm Eval Survey
⭐
1,066
The official GitHub page for the survey paper "A Survey on Evaluation of Large Language Models".
Fastrag
⭐
591
Efficient Retrieval Augmentation and Generation Framework
Longbench
⭐
303
LongBench: A Bilingual, Multitask Benchmark for Long Context Understanding
Awesome Llm Eval
⭐
183
Awesome-LLM-Eval: a curated list of tools, datasets/benchmark, demos, learderboard, papers, docs and models, mainly for Evaluation on LLMs.
Trustllm
⭐
164
TrustLLM: Trustworthiness in Large Language Models
Uhgeval
⭐
140
Benchmarking the Hallucination of Chinese Large Language Models via Unconstrained Generation
Vlmevalkit
⭐
137
Open-source evaluation toolkit of large vision-language models (LVLMs), support GPT-4v, Gemini, QwenVLPlus, 30+ HF models, 15+ benchmarks
Hallusionbench
⭐
128
HallusionBench: You See What You Think? Or You Think What You See? An Image-Context Reasoning Benchmark Challenging for GPT-4V(ision), LLaVA-1.5, and Other Multi-modality Models
Awesome Llm Long Context Modeling
⭐
115
📰 Must-read papers and blogs on LLM based Long Context Modeling 🔥
Fasteval
⭐
102
Fast & more realistic evaluation of chat language models. Includes leaderboard.
Deepmark
⭐
74
Deepmark AI enables a unique testing environment for language models (LLM) assessment on task-specific metrics and on your own data so your GenAI-powered solution has predictable and reliable performance.
Lawbench
⭐
69
Benchmarking Legal Knowledge of Large Language Models
Llm Rgb
⭐
66
LLM Reasoning and Generation Benchmark. Evaluate LLMs in complex scenarios systematically.
Arb
⭐
35
Advanced Reasoning Benchmark Dataset for LLMs
Mac Ml Speed Test
⭐
27
A few quick scripts focused on testing TensorFlow/PyTorch/Llama 2 on macOS.
Llm Benchmark
⭐
24
A list of LLM benchmark frameworks.
M3dbench
⭐
23
M3DBench introduces a comprehensive 3D instruction-following dataset with support for interleaved multi-modal prompts. Furthermore, M3DBench provides a new benchmark to assess large models across 3D vision-centric tasks.
Cmexam
⭐
21
A Chinese National Medical Licensing Examination dataset and large languge model benchmarks
Reform Eval
⭐
19
An benchmark for evaluating the capabilities of large vision-language models (LVLMs)
Vllm Safety Benchmark
⭐
15
Official PyTorch Implementation of "How Many Unicorns Are in This Image? A Safety Evaluation Benchmark for Vision LLMs"
C Vqa
⭐
14
Counterfactual Reasoning VQA Dataset
Cloudeval Yaml
⭐
12
☁️ Benchmarking LLMs for Cloud Config Generation | 云场景下的大模型基准测试
Dpl
⭐
10
[NeurIPS 2023] Multi-fidelity hyperparameter optimization with deep power laws that achieves state-of-the-art results across diverse benchmarks.
Language Model Recommendation
⭐
8
Resources accompanying the "Zero-Shot Recommendation as Language Modeling" paper (ECIR2022)
Related Searches
Python Benchmark (1,941)
Python Llm (1,377)
C Plus Plus Benchmark (1,219)
Javascript Benchmark (1,165)
Golang Benchmark (1,080)
Benchmark Benchmarking (1,073)
Java Benchmark (993)
C Benchmark (902)
Benchmark Performance (776)
Openai Llm (569)
1-25 of 25 search results
Privacy
|
About
|
Terms
|
Follow Us On Twitter
Copyright 2018-2024 Awesome Open Source. All rights reserved.