Coaching at scale: a batting feedback VLM
Building a vision-language model that gives every batsman the eye their coach cannot be everywhere to give.
One expert eye serves about a dozen batsmen at best. Most of the coaching attention in a typical academy session falls to whichever player is in the highest-stakes net at the moment. The other eleven net for fifteen minutes and walk away with the same general note they walked in with. The academy partner asked us to close that gap. We built the model that does.
§01The brief
Cricket coaching is genuinely expert work. Reading a delivery, classifying what shot it should have produced, and grading what the batsman actually did is a skill that takes years to develop. The academy had a handful of full-time coaches for several hundred enrolled batsmen across multiple training sessions a day. The math does not work. They asked us to give every batsman an expert eye after every ball.
§02What the model needs to know about the delivery
Before we can recommend the optimal shot, we need to characterize the ball. The features the pipeline computes per delivery:
- Type: pace, swing, seam, off-spin, leg-spin, googly
- Speed: derived from frame timing across the ball's trajectory
- Line: relative to the stumps and to the batsman's stance
- Length: yorker, full, good, short, bouncer
- Angle of arrival: incoming or going away, with magnitude
Tracking the ball is the hardest of these. It is a small object moving fast in a frame where the model also has to keep track of bowler and batsman. The pipeline uses a ball-specific detector plus trajectory smoothing across frames. The trajectory feeds a small physics model that estimates speed and predicted bounce, which together give us length classification more reliably than visual length estimation alone.
§03What the model needs to know about the batsman
Pose estimation and bat tracking handle the human side. The pipeline tracks the batsman across four phases:
Each phase produces a pose vector. The shot classifier sees the full sequence and labels what shot was actually played: forward defensive, cover drive, pull, cut, hook, leave, anything else in the taxonomy.
§04The shot taxonomy and training data
Twenty-four shot classes covering the realistic strokes, plus a “no shot” leave class. Training data came from two sources. A large set of broadcast footage with public commentary providing weak labels built the base classifier. A smaller set of academy footage, with expert hand-labels from the partner's coaching staff, fine-tuned the model to the partner's specific coaching framework.
§05The decision
For every delivery the model produces two answers:
- The optimal shot, conditional on the delivery features
- The actual shot played, from the pose pipeline
If those match, the feedback focuses on execution quality. If they do not match, the feedback explains the gap.
§06The scoring
Execution quality is scored along four axes.
- Timing: when did the bat make contact relative to the optimal contact frame for that delivery
- Footwork: did the batsman get to the right position for the shot
- Weight transfer: did body weight move with the shot
- Balance: did the batsman recover or fall away
Each axis produces a 0 to 10 score. The aggregate goes into the session log alongside the shot label and the optimal-shot recommendation.
§07The VLM layer
The earliest version of the system output a JSON object. Coaches did not engage with it. Their feedback to the player happens in language, not in numerics. So we added a vision-language model layer on top of the classification pipeline.
The VLM ingests the structured output plus the original video, and produces a short coaching paragraph in natural language. The prompt is heavily templated; the language we want is direct, technical, in the partner academy's voice, and never longer than three sentences. The output reads more like a coach's note than a system message.
“Good length, off stump, going away at pace. The shot you wanted was a leave or a defensive prod with soft hands. You played a drive and the bat came down forty milliseconds late, the edge would have carried.”
Coaches accepted this output where they had rejected the JSON.
§08Latency
The whole pipeline runs within ten seconds of the ball. The constraint is operational: a batsman should see feedback before the next ball is bowled, otherwise the loop is broken. Ten seconds is the budget we settled on with the partner after iterating from a longer initial spec. The pipeline meets it on standard academy hardware.
§09What did not work
The early prototype tried to compute optimal-shot recommendations from first principles using a rules engine. The rules engine could not handle the conditional structure of cricket coaching, where the optimal shot depends on the batsman's strengths as much as the delivery features. We replaced it with a learned model trained on the partner's coaching staff's labels. The learned model captures the partner's framework directly, which is what they wanted.
The early pose pipeline also lost track of the bat through the shot's follow-through. We fixed that by tracking the bat as a separate object rather than relying on the wrist-joint trajectory. Bat-as-object is now standard in the pipeline.
§10The handoff and maintenance
The system is fully handed off to the academy. We do quarterly maintenance: model retraining on new session footage, evaluation against a held-out set, and any pipeline updates needed for new hardware. The partner runs the day-to-day.
§11Where else this pattern works
The pattern is expert eye at scale for a coached physical skill, with three preconditions:
- The skill has a clear taxonomy of correct moves
- An expert can grade execution by watching a single repetition
- The repetitions are dense enough that human attention is the bottleneck
That describes more than cricket. Golf swing analysis. Tennis stroke production. Gymnastics routine grading. Martial arts form correction. Physiotherapy gait analysis. The hard work is in the taxonomy and the training data per domain. The pipeline architecture transfers.
We are open to building the next one.