profile-pic
Evaluating Evaluations We are a researcher community developing scientifically grounded research outputs and robust deployment infrastructure for broader impact evaluations.

About EvalEval


We’re building a coalition on evaluating evaluations (Eval Eval)!

We are a researcher community developing scientifically grounded research outputs and robust deployment infrastructure for broader impact evaluations.

Hosted by Hugging Face, University of Edinburgh, and EleutherAI, this cross‑sector and interdisciplinary coalition will operate across working groups:

Research

  • Evaluation Cards
  • Benchmark Saturation
  • Evaluation Science

Infrastructure

  • Single Data Format and Caching Evals
  • Evaluation Harness and Tutorials

Organization

  • Community Engagement
  • Research Workshops

The flawed state of evaluations, lack of consensus for documenting evaluation applicability and utility, and disparate coverage of broader impact categories hinder adoption, scientific research, and policy analysis of generative AI systems. Coalition members are working collectively on improving the state of evaluations.

In addition to working group outputs, positive effects can include influencing research papers’ Broader Impact sections, how evaluations are released, and overall public policy.

Our NeurIPS 2024 “Evaluating Evaluations” kicked off the latest work streams, highlighting tiny paper submissions that provoked and advanced the state of evaluations. Our original framework for establishing categories of social impact is now published as a chapter in the Oxford Handbook on Generative AI and we hope will guide standards for Broader Impact analyses. We’ve previously hosted a series of workshops to hone the framework.

Interested in joining? Submit this form to join our Slack!

Current Projects


Join Us!