Evaluate text with
LLM-as-a-Judge
A Python library for evaluating text outputs against weighted criteria. Define rubrics, run evaluations, and measure quality at scale.
from autorubric import Rubric, LLMConfig from autorubric.graders import CriterionGrader rubric = Rubric.from_dict([ {"name": "accuracy", "weight": 10, "requirement": "Response is factually correct"}, {"name": "clarity", "weight": 8, "requirement": "Explanation is clear and concise"}, ]) grader = CriterionGrader(llm_config=LLMConfig(model="openai/gpt-4.1-mini")) result = await rubric.grade(submission, grader=grader) print(f"Score: {result.score:.0%}")
Weighted Criteria
Define rubrics with positive and negative weights. Penalize errors, reward quality, and compute normalized scores.
Ensemble Judging
Combine multiple LLM judges with voting strategies for high-stakes evaluations with better reliability.
Few-Shot Calibration
Calibrate judges with labeled examples. Balance verdicts and improve consistency with your ground truth.
Comprehensive Metrics
Compute accuracy, Cohen's kappa, precision, recall, and correlations against human ground truth.
Multi-Choice Scales
Support ordinal and nominal scales with Likert-style ratings, not just binary MET/UNMET verdicts.
Multi-Provider
Works with OpenAI, Anthropic, Google, and any OpenAI-compatible API out of the box.