AI Vision System
Real-time skeletal tracking powered by MediaPipe BlazePose. Track 33 body landmarks, validate exercise form, and count reps automatically with sub-frame precision.

Source: Google Research — BlazePose
33
Body Landmarks
Full skeletal tracking
22+
Exercises
Supported movements
30+
FPS
Real-time processing
<16ms
Latency
Frame processing time
State machine-based rep detection that identifies exercise phases (up/down, contracted/extended) and counts reps with high accuracy.
Continuous analysis of joint angles and body positioning to ensure proper exercise form and prevent injury.
Visual feedback system that shows joint angles in real-time with color coding (green/yellow/red) for form quality.
Advanced signal processing to eliminate jitter while maintaining responsiveness for smooth landmark tracking.
Our pose detection system supports a wide range of exercises across different muscle groups and movement patterns.
A deep look at the ML model, inference pipeline, and signal processing that power Aerovit's real-time exercise tracking.
BlazePose is a lightweight, on-device pose estimation model developed by Google Research. It uses a two-step detector-tracker architecture: a fast person detector localises the body in the first frame, then a landmark regression network tracks 33 keypoints across subsequent frames without re-running detection — keeping latency under 16 ms on modern mobile GPUs.
The model outputs 3D coordinates (x, y, z) plus a per-landmark visibility score, enabling depth-aware angle calculations even from a single monocular camera. BlazePose Heavy (the variant Aerovit uses) maximises landmark accuracy at the cost of slightly higher compute, which is an acceptable trade-off on modern phones.
Because inference runs entirely on-device via GPU delegates (TFLite on Android, CoreML on iOS), no camera frames ever leave the user's phone — ensuring full privacy and zero-latency operation regardless of network conditions.
33 body landmarks per frame
Full-body skeleton including face, hands, and feet
3D coordinates (x, y, z) + visibility
Depth estimation from monocular camera input
Two-step detector → tracker pipeline
Detect once, track continuously for speed
BlazePose Heavy variant
Higher accuracy model optimised for fitness use
On-device GPU-accelerated inference
TFLite GPU delegate / CoreML — no cloud needed
Front & back camera support
Automatic coordinate mirroring for selfie mode
Frame Capture
Camera streams NV21/BGRA frames at 30 FPS via CameraX / AVFoundation
Pose Estimation
BlazePose Heavy extracts 33 landmarks with 3D coordinates per frame
EMA Smoothing
Exponential moving average (α = 0.45) stabilises landmark positions across frames
Angle Calculation
Joint angles computed via 3-point inverse tangent on key landmark triplets
State Machine
Finite state machine detects exercise phases (up ↔ down) and increments reps
UI Feedback
Skeleton overlay, colour-coded angles, audio cues, and rep counter update
Raw ML Kit landmarks jitter frame-to-frame. We apply an exponential moving average (α = 0.45) independently to each landmark's x and y coordinates, producing a visually smooth skeleton while keeping responsiveness high enough for fast exercises.
The camera streams at 30 FPS, but running BlazePose Heavy on every frame is unnecessary. A frame throttle limits inference to ~15 FPS, halving GPU load with no perceptible difference in tracking quality — extending battery life during long workout sessions.
ML Kit returns landmarks in the raw camera sensor coordinate space. We compute the correct scale factor and crop offset for the BoxFit.cover display mode, accounting for sensor rotation and front-camera mirroring, so the skeleton aligns pixel-perfectly with the user's body on screen.
Fully Implemented
This feature is live and functional in the app