Llms eval

AlpacaEval 1.1k

An automatic evaluator for instruction-following language models. Human-validated, high-quality, cheap, and fast.

BigCode Eval 638

BigCode Evaluation Harness is a framework for the evaluation of autoregressive code generation language models.

FastChat 34.3k

FastChat is a benchmark platform for large language models (LLMs) that features anonymous, randomized battles in a crowdsourced manner.