Skip to main content
INSIGNIA.
Engage
JournalEntry №04Foundry

Coaching at scale: a batting feedback VLM

Building a vision-language model that gives every batsman the eye their coach cannot be everywhere to give.

FILED May 2026IMPRINT FoundryREAD 10 min readENTRY 04

One expert eye serves about a dozen batsmen at best. Most of the coaching attention in a typical academy session falls to whichever player is in the highest-stakes net at the moment. The other eleven net for fifteen minutes and walk away with the same general note they walked in with. The academy partner asked us to close that gap. We built the model that does.

§01The brief

Cricket coaching is genuinely expert work. Reading a delivery, classifying what shot it should have produced, and grading what the batsman actually did is a skill that takes years to develop. The academy had a handful of full-time coaches for several hundred enrolled batsmen across multiple training sessions a day. The math does not work. They asked us to give every batsman an expert eye after every ball.

VIDEO INPUTacademy camera · per-ballBALL TRACKERsmall-object · trajectory smoothingPOSE ESTIMATORbatsman · bat-as-objectDELIVERY FEATUREStypepace · swing · spinspeedfrom frame timingline/lengthvs stumps · vs stanceangleincoming · going awayPOSE FEATURESstancepre-releasetriggerpre-shot shiftshotbat path · contactfollow-throughbalance · recoveryOPTIMAL SHOTconditional on deliverylearned · partner labelsSHOT CLASSIFIER24-class taxonomyactual shot playedEXECUTION SCOREtiming · footworkweight · balanceVLM LAYERstructured output + video → languagecoach voice · ≤ 3 sentencesOUTPUT · COACHING PARAGRAPH“Good length, off stump, going away at pace. The shot youwanted was a leave or a defensive prod with soft hands.You played a drive and the bat came down 40 ms late.”T+10s · per-balldelivery video to coaching paragraph in under ten seconds
FIG. idelivery video to coaching paragraph in under ten seconds

§02What the model needs to know about the delivery

Before we can recommend the optimal shot, we need to characterize the ball. The features the pipeline computes per delivery:

  • Type: pace, swing, seam, off-spin, leg-spin, googly
  • Speed: derived from frame timing across the ball's trajectory
  • Line: relative to the stumps and to the batsman's stance
  • Length: yorker, full, good, short, bouncer
  • Angle of arrival: incoming or going away, with magnitude

Tracking the ball is the hardest of these. It is a small object moving fast in a frame where the model also has to keep track of bowler and batsman. The pipeline uses a ball-specific detector plus trajectory smoothing across frames. The trajectory feeds a small physics model that estimates speed and predicted bounce, which together give us length classification more reliably than visual length estimation alone.

§03What the model needs to know about the batsman

Pose estimation and bat tracking handle the human side. The pipeline tracks the batsman across four phases:

STANCE01 · pre-releaseweight balanced, bat back, eyes on the bowlerTRIGGER02 · pre-shotweight shifts, bat begins to liftSHOT03 · contactbat path through the ball, contact frame loggedFOLLOW-THROUGH04 · recoverybalance held, weight forward, head over the ball
FIG. iifour pose phases the model tracks per ball

Each phase produces a pose vector. The shot classifier sees the full sequence and labels what shot was actually played: forward defensive, cover drive, pull, cut, hook, leave, anything else in the taxonomy.

§04The shot taxonomy and training data

Twenty-four shot classes covering the realistic strokes, plus a “no shot” leave class. Training data came from two sources. A large set of broadcast footage with public commentary providing weak labels built the base classifier. A smaller set of academy footage, with expert hand-labels from the partner's coaching staff, fine-tuned the model to the partner's specific coaching framework.

§05The decision

For every delivery the model produces two answers:

  • The optimal shot, conditional on the delivery features
  • The actual shot played, from the pose pipeline

If those match, the feedback focuses on execution quality. If they do not match, the feedback explains the gap.

§06The scoring

Execution quality is scored along four axes.

  • Timing: when did the bat make contact relative to the optimal contact frame for that delivery
  • Footwork: did the batsman get to the right position for the shot
  • Weight transfer: did body weight move with the shot
  • Balance: did the batsman recover or fall away

Each axis produces a 0 to 10 score. The aggregate goes into the session log alongside the shot label and the optimal-shot recommendation.

§07The VLM layer

The earliest version of the system output a JSON object. Coaches did not engage with it. Their feedback to the player happens in language, not in numerics. So we added a vision-language model layer on top of the classification pipeline.

The VLM ingests the structured output plus the original video, and produces a short coaching paragraph in natural language. The prompt is heavily templated; the language we want is direct, technical, in the partner academy's voice, and never longer than three sentences. The output reads more like a coach's note than a system message.

Good length, off stump, going away at pace. The shot you wanted was a leave or a defensive prod with soft hands. You played a drive and the bat came down forty milliseconds late, the edge would have carried.
Sample output from production

Coaches accepted this output where they had rejected the JSON.

§08Latency

The whole pipeline runs within ten seconds of the ball. The constraint is operational: a batsman should see feedback before the next ball is bowled, otherwise the loop is broken. Ten seconds is the budget we settled on with the partner after iterating from a longer initial spec. The pipeline meets it on standard academy hardware.

§09What did not work

The early prototype tried to compute optimal-shot recommendations from first principles using a rules engine. The rules engine could not handle the conditional structure of cricket coaching, where the optimal shot depends on the batsman's strengths as much as the delivery features. We replaced it with a learned model trained on the partner's coaching staff's labels. The learned model captures the partner's framework directly, which is what they wanted.

The early pose pipeline also lost track of the bat through the shot's follow-through. We fixed that by tracking the bat as a separate object rather than relying on the wrist-joint trajectory. Bat-as-object is now standard in the pipeline.

§10The handoff and maintenance

The system is fully handed off to the academy. We do quarterly maintenance: model retraining on new session footage, evaluation against a held-out set, and any pipeline updates needed for new hardware. The partner runs the day-to-day.

§11Where else this pattern works

The pattern is expert eye at scale for a coached physical skill, with three preconditions:

  1. The skill has a clear taxonomy of correct moves
  2. An expert can grade execution by watching a single repetition
  3. The repetitions are dense enough that human attention is the bottleneck

That describes more than cricket. Golf swing analysis. Tennis stroke production. Gymnastics routine grading. Martial arts form correction. Physiotherapy gait analysis. The hard work is in the taxonomy and the training data per domain. The pipeline architecture transfers.

We are open to building the next one.

END · ENTRY 04FILED May 2026