EvalEval Coalition

About EvalEval

We’re building a coalition on evaluating evaluations (Eval Eval)!

We are a researcher community developing scientifically grounded research outputs and robust deployment infrastructure for broader impact evaluations.

Hosted by Hugging Face, University of Edinburgh, and EleutherAI, this cross‑sector and interdisciplinary coalition will operate across working groups:

Research

Evaluation Cards
Benchmark Saturation
Evaluation Science

Infrastructure

Single Data Format and Caching Evals
Evaluation Harness and Tutorials

Organization

Community Engagement
Research Workshops

The flawed state of evaluations, lack of consensus for documenting evaluation applicability and utility, and disparate coverage of broader impact categories hinder adoption, scientific research, and policy analysis of generative AI systems. Coalition members are working collectively on improving the state of evaluations.

In addition to working group outputs, positive effects can include influencing research papers’ Broader Impact sections, how evaluations are released, and overall public policy.

Our NeurIPS 2024 “Evaluating Evaluations” kicked off the latest work streams, highlighting tiny paper submissions that provoked and advanced the state of evaluations. Our original framework for establishing categories of social impact is now published as a chapter in the Oxford Handbook on Generative AI and we hope will guide standards for Broader Impact analyses. We’ve previously hosted a series of workshops to hone the framework.

Interested in joining? Submit this form to join our Slack!

About EvalEval

Research

Infrastructure

Organization

Current Projects

Join Us!