How it works

The engineering behind the evaluation.

You submit a design. Seven evidence-backed scores and a review plan come back — every judgment cites a verbatim quote from your answer. The path between is a consistent pipeline: auth, routing, business logic, and an LLM reasoning core. Here’s the shape.

What the diagram doesn’t show

The diagram below is infrastructure. The actual craft — the part that makes Deltaframe’s feedback useful instead of generic — lives in three places the diagram can’t draw:

01
The rubric.
200+ lines of scoring instructions, per-difficulty expectations (fundamentals on beginner problems, consensus-protocol reasoning on advanced problems), and feedback-depth rules that make the model teach, not just grade.
02
Per-problem calibration.
Every problem ships with score anchors (“a 0.4 looks like X; a 0.8 looks like Y”) so the evaluator stays consistent across thousands of attempts and doesn’t drift over time.
03
The retention loop.
Evaluation without follow-through is a scoreboard. Spaced-repetition review of concepts you got wrong is what turns one session into durable judgment.

Four-stage evaluation pipeline. The dashed arrow shows the re-attempt loop.

1. Request path

Your browser hits an edge API gateway that handles authenticated sessions, rate limiting (Redis-backed counters), and routing. Claude-dependent routes are guarded here so a missing API key returns 503, not a cryptic error at the handler level.

2. Application services

The core business logic lives in a Next.js App Router service layer. Sessions, evaluations, hints, and follow-ups are orchestrated here. PostgreSQL is the source of truth; Redis handles ephemeral state (rate limits, streaming buffers). Server-Sent Events stream partial evaluator output back to your browser as it generates.

3. AI evaluation engine

The grader sees the same rubric every time.Your structured answer and diagram are assembled into a calibrated prompt — versioned in-database for auditability — and sent to Claude. The evaluator returns valid JSON across 7 skill dimensions (listed below). The rubric is anchored per-problem with calibration samples so scoring stays consistent across attempts.

4. Insights & the loop

Your weakest dimensions decide what you study next. Scores update your per-dimension skill signal via an exponentially-weighted average. Your two weakest dimensions enqueue their linked concepts into the retention loop — reviews surface at 1, 3, 7, and 14-day intervals. The retention loop closes when you re-attempt the problem and the same dimensions score higher.

How we evaluate

This page describes evaluation in interview-prep mode. Coaching mode uses the same rubric but explains gaps before scoring.

We publish our rubric. Hello Interview, ByteByteGo, and Design Gurus don’t.

Every design is scored against this rubric. For each criterion the AI must quote a verbatim substring from your answer as evidence — no paraphrasing, no inference.

A worked example

Here’s how scoring works on a single dimension. The dimension and criteria below are illustrative — the real, full rubric (with the seven actual dimensions and their criteria) is shown after the example.

Suppose a candidate is asked to design a URL shortener and writes:

“The system needs to handle 100M requests per day. I’ll use a hash function on the long URL and store the mapping in Redis with a 30-day TTL. For redirect, lookup is O(1) by short key. We’ll need to handle hash collisions by re-hashing with a salt. I haven’t addressed analytics or rate limiting.”

On this dimension, the rubric checks three criteria:

Throughput target stated explicitly
Evidence: “100M requests per day”
Read/write ratio considered in storage choice
Evidence: “lookup is O(1) by short key” — partial: read path addressed, write-side concurrency not.
Hot-key handling addressed
No evidence found — viral short links would concentrate load on one key.

That’s 1 met + 0.5 partial + 0 not-met = 1.5/3 → 0.50 normalized → 2.5/5 on the example dimension.

Requirements clarity

10% of score

Did you scope the problem and name what is in and out of scope?

Lists at least three concrete functional requirements
A bullet list, numbered list, or sentence enumerating user-facing capabilities
States at least one non-functional requirement (latency, availability, durability, throughput target)
A target like "p99 < 200 ms", "99.9% availability", "<1% data loss tolerable"
Names at least one capability explicitly out of scope
Phrasing like "out of scope:", "we will not handle X", "skipping Y"
Asks at least one clarifying question or states a working assumption before designing
A question to the interviewer, or "I will assume X for this design"
Names at least one constraint that shapes the design (regulatory, cost, latency, geography)
A constraint statement like "must be GDPR compliant", "cost-sensitive", "global low-latency reads"

Scale estimation

5% of score

Did you produce numeric estimates with explicit math?

States a numeric DAU, MAU, or request-rate estimate
A number followed by units like "100M DAU", "10K req/s", "50 QPS"
Estimates read-to-write ratio or traffic shape (peak vs average)
Phrasing like "10:1 read-heavy", "3x peak over average"
Projects storage growth with explicit math (per-record size × count)
A multiplication like "200 bytes × 1B records = 200 GB"
States a bandwidth, latency percentile, or capacity target
A number with units like "p99 < 200 ms", "10 Gbps egress", "1M concurrent connections"

API & data model

15% of score

Did you define endpoints, request/response shapes, and a data model?

Names at least two HTTP endpoints (or RPC methods) by path/method
Statements like "POST /shorten", "GET /links/:id", or "createShortLink(url) → shortId"
Describes a request or response body shape (fields, types)
Listing fields like "{ url: string, ttl?: number }" or response body shape
Defines at least one data model entity with fields
Schema-like description: "Link table: id, original_url, created_at, owner_id"
Addresses pagination, versioning, or idempotency for at least one endpoint
Phrasing like "cursor-based pagination", "/v1/", "idempotency key"
Considers real-time or async patterns when the problem warrants it (WebSocket, SSE, webhook, queue)
Mentions of WebSocket, SSE, webhooks, message queue, or async job

High-level design

20% of score

Did you name the components and how requests flow through them?

Names at least four distinct components (e.g. load balancer, app tier, cache, database)
A list or diagram description naming components by role
Describes the path a typical request takes through the named components
A sequence like "client → LB → API → cache → DB" with arrows or numbered steps
Distinguishes synchronous from asynchronous paths where it matters
Phrasing like "synchronously serve reads, async write the analytics event"
Names a specific storage technology and justifies the choice (not just "a database")
A concrete choice like "Postgres for relational", "Cassandra for write-heavy", "Redis for cache"

Bottleneck analysis

15% of score

Did you identify the primary bottleneck and quantify it at the stated scale?

Names a specific component as the primary bottleneck
A statement like "the read replica is the bottleneck" or "the cache becomes the hot spot"
States a numeric load value (QPS, RPS, MB/s, GB, ms) the bottleneck must handle
A number with units like "100K QPS hits the cache", "p99 latency exceeds 500 ms here"
Explains why this component is the bottleneck at the stated scale
A causal sentence like "because writes serialize through a single primary"
Proposes at least one mitigation specific to the named bottleneck
A specific intervention like "shard by user_id", "add a write-behind cache", "introduce a queue"
Considers read or write amplification where it applies
Phrasing like "fan-out on write", "10× read amplification from joins"

Scaling & reliability

15% of score

Did you address horizontal scaling, failure modes, and SPOF elimination?

Names a horizontal scaling strategy with a specific mechanism
A concrete strategy like "shard by user_id", "read replicas behind a load balancer", "stateless app tier"
Specifies a replication topology or caching layer with a placement decision
Phrasing like "primary-replica with async replication", "Redis cache in front of reads"
Identifies at least one single point of failure and proposes its elimination
A SPOF callout and fix like "replicate the LB across AZs", "consensus group of 3"
Names a specific failure mode and how the system degrades or recovers
A failure scenario like "if the cache fails, fall back to DB with degraded latency"
Addresses geo-distribution or consensus when the problem warrants it
Mentions of multi-region, leader election, Paxos/Raft, geo-routing

Trade-offs

20% of score

Did you name explicit trade-offs with the rejected option and the cost?

Names a first explicit trade-off with the rejected alternative and its cost
A statement like "I chose A over B because X, accepting Y as the cost"
Names a second explicit trade-off, distinct from the first
A second "A vs B because X" statement on a different axis
Reasons explicitly about consistency vs availability or strong vs eventual
Phrasing like "eventual consistency is acceptable for X because Y"
Names a trade-off specific to this problem, not a generic pattern
A trade-off that references the problem domain (not "consistency vs availability" by itself)

Scoring

Each criterion is judged met, partial, or not met. A dimension score starts as (met + 0.5 × partial) ÷ criteria count in normalized 0-1 space. Deltaframe then displays that result on a 0-5 scale so the score reads like a skill level instead of a percentage.

Skill Posture is reached when every dimension is at 3.0 / 5 or above and overall average is at least 3.5 / 5.

Dimension weights (shown above) apply to your per-attempt overall score. The longitudinal Skill Posture check uses rubric weights to compute your overall score, emphasising the highest-signal dimensions

How scores update over time

One bad attempt should nudge a dimension, not crater it. Deltaframe tracks an exponentially-weighted average per dimension across attempts:

dimensionScore[t] = α · attemptDimensionScore + (1 − α) · dimensionScore[t−1]
where α = 0.4 (default)

The weight is adaptive: if your last three attempts on a dimension are all at 4.0 / 5 or higher, α temporarily rises to 0.65 so genuine sustained improvement isn’t smoothed out by older scores.

Over four to six attempts, your dimension scores stabilize around your real skill level — this is the longitudinal signal, not the headline from one session.

Evidence guarantee

For every criterion judged met or partial, the score is backed by a verbatim substring copied directly from your answer. If the evidence cannot be verified as a substring of your actual answer, the judgment is automatically coerced to not met before the score is saved.

Calibration

Every problem ships with calibration anchors — example answers labelled passing and borderline — that the grader sees alongside the rubric. The active grading prompt is versioned in a database table; every evaluation records which prompt version produced it. There is no silent re-scoring.

Acknowledged limits

Three things keep our evaluation consistent at scale:

Criterion-based, not holistic.The grader judges each named criterion independently — no invented overall vibe.
Verbatim evidence required.A judgment without a quotable substring auto-coerces to not-met. Hallucinated praise can’t survive.
Scaffolded structured input.You answer in labeled sections (requirements, API, HLD, scaling) so the grader doesn’t have to guess what part of the text addresses what.

If a score feels wrong, the results page has a button to flag the dimension you dispute — that feedback feeds the next prompt version’s calibration anchors.

Submit a design. Get seven evidence-backed scores.

Every score cites a verbatim quote from your answer. Concepts you missed land in your retention loop. First evaluation is free; no credit card.