DP

DepthPilot AI

System-Level Learning

Assessment

Rubric grading lab: turn 'looks better' into reviewable scoring

This audit forces you to build a scoring rubric, grader spec, and calibration sheet for one live workflow. DepthPilot cares less about a pretty scorecard and more about whether you can explain which dimension failed, why it failed, and what to fix first.

Final artifact

One scoring rubric, one grader spec, and one calibration sheet.

Real acceptance criteria

Not that a score exists, but that a second reviewer can reuse the logic and severe failures cannot hide inside an average.

Where our value shows

This page turns eval from taste and opinion into a mechanism for diagnosis, prioritization, and rollback decisions.

Scoring rubric

  • Define dimensions before total score so quality is not reduced to one impression.
  • Write 0-3 anchors for each dimension so reviewers do not improvise meaning.
  • State which dimensions can trigger a hard fail instead of being averaged away.
  • Make the rubric point back to actual repair layers.

Grader spec

  • Write what evidence the grader inspects, how it scores, and when it must reference the trace.
  • Separate final-answer grading from trace grading.
  • Make the grader spec usable by both humans and automated graders.
  • Write rules for uncertain cases instead of leaving room for grader vibes.

Calibration sheet

  • Keep reviewer disagreements visible instead of jumping to one average.
  • Use disagreement patterns to improve rubric anchors.
  • Treat calibration as quality governance, not as a one-time ritual.
  • Make it directly useful for launch, rollback, and repair order.

Proof you must keep before launch

One scoring rubric with dimensions, anchors, and hard-stop rules.
One grader spec that defines inputs, scoring steps, and override conditions.
One calibration sheet that records disagreement and revision notes.
One short recap explaining whether factuality, citation, instruction following, or escalation judgment should be fixed first.

Reusable grading templates

Reference appendix

These links anchor rubric and grader design principles. The actual lesson is the scoring dimensions, rules, and calibration process above.

Search Cluster

Connect rubric design to discoverable eval topics

High-intent users often enter through eval, observability, or rubric searches before realizing the deeper problem is dimension-based scoring and calibration.

Rubric Grading Lab for Reviewable AI Quality | DepthPilot AI