Awesome Open Source

Programming Languages

Search results for benchmark llm

25 search results found

Opencompass ⭐ 2,758

OpenCompass is an LLM evaluation platform, supporting a wide range of models (Llama3, Mistral, InternLM2,GPT-4,LLaMa2, Qwen,GLM, Claude, etc) over 100+ datasets.

Llm Eval Survey ⭐ 1,066

The official GitHub page for the survey paper "A Survey on Evaluation of Large Language Models".

Fastrag ⭐ 591

Efficient Retrieval Augmentation and Generation Framework

Longbench ⭐ 303

LongBench: A Bilingual, Multitask Benchmark for Long Context Understanding

Awesome Llm Eval ⭐ 183

Awesome-LLM-Eval: a curated list of tools, datasets/benchmark, demos, learderboard, papers, docs and models, mainly for Evaluation on LLMs.

Trustllm ⭐ 164

TrustLLM: Trustworthiness in Large Language Models

Uhgeval ⭐ 140

Benchmarking the Hallucination of Chinese Large Language Models via Unconstrained Generation

Vlmevalkit ⭐ 137

Open-source evaluation toolkit of large vision-language models (LVLMs), support GPT-4v, Gemini, QwenVLPlus, 30+ HF models, 15+ benchmarks

Hallusionbench ⭐ 128

HallusionBench: You See What You Think? Or You Think What You See? An Image-Context Reasoning Benchmark Challenging for GPT-4V(ision), LLaVA-1.5, and Other Multi-modality Models

Awesome Llm Long Context Modeling ⭐ 115

📰 Must-read papers and blogs on LLM based Long Context Modeling 🔥

Fasteval ⭐ 102

Fast & more realistic evaluation of chat language models. Includes leaderboard.

Deepmark ⭐ 74

Deepmark AI enables a unique testing environment for language models (LLM) assessment on task-specific metrics and on your own data so your GenAI-powered solution has predictable and reliable performance.

Lawbench ⭐ 69

Benchmarking Legal Knowledge of Large Language Models

LLM Reasoning and Generation Benchmark. Evaluate LLMs in complex scenarios systematically.

Advanced Reasoning Benchmark Dataset for LLMs

Mac Ml Speed Test ⭐ 27

A few quick scripts focused on testing TensorFlow/PyTorch/Llama 2 on macOS.

Llm Benchmark ⭐ 24

A list of LLM benchmark frameworks.

M3dbench ⭐ 23

M3DBench introduces a comprehensive 3D instruction-following dataset with support for interleaved multi-modal prompts. Furthermore, M3DBench provides a new benchmark to assess large models across 3D vision-centric tasks.

A Chinese National Medical Licensing Examination dataset and large languge model benchmarks

Reform Eval ⭐ 19

An benchmark for evaluating the capabilities of large vision-language models (LVLMs)

Vllm Safety Benchmark ⭐ 15

Official PyTorch Implementation of "How Many Unicorns Are in This Image? A Safety Evaluation Benchmark for Vision LLMs"

Counterfactual Reasoning VQA Dataset

Cloudeval Yaml ⭐ 12

☁️ Benchmarking LLMs for Cloud Config Generation ｜云场景下的大模型基准测试

[NeurIPS 2023] Multi-fidelity hyperparameter optimization with deep power laws that achieves state-of-the-art results across diverse benchmarks.

Language Model Recommendation ⭐ 8

Resources accompanying the "Zero-Shot Recommendation as Language Modeling" paper (ECIR2022)

Related Searches

Python Benchmark (1,941)

Python Llm (1,377)

C Plus Plus Benchmark (1,219)

Javascript Benchmark (1,165)

Golang Benchmark (1,080)

Benchmark Benchmarking (1,073)

Java Benchmark (993)

C Benchmark (902)

Benchmark Performance (776)

Openai Llm (569)

1-25 of 25 search results

Privacy | About | Terms | Follow Us On Twitter

Copyright 2018-2024 Awesome Open Source. All rights reserved.