AlpacaEval 1.1k
An automatic evaluator for instruction-following language models. Human-validated, high-quality, cheap, and fast.
BigCode Eval 638
BigCode Evaluation Harness is a framework for the evaluation of autoregressive code generation language models.
FastChat 34.3k
FastChat is a benchmark platform for large language models (LLMs) that features anonymous, randomized battles in a crowdsourced manner.