CoEval

Ranking language models for a custom task or domain when no task-specific labeled data exists and standard public benchmarks cannot be trusted. From a task description alone, CoEval generates a fresh, contamination-free benchmark and ranks candidate models with a cross-family judge ensemble, with no human labels or raters.

Redirecting to the paper… If you are not redirected, open the CoEval paper (or download the Word version).

Source: github.com/ApartsinProjects/CoEval