← Back to projects

Ish Boxing Mobile Game

Based on the viral shadow boxing game, played over WebRTC peer-to-peer connections.

Code

Demo

Overview


One time on a campout with some friends growing up, we started playing this fun shadow boxing game. It involved swiping or throwing punches in certain directions, trying to get the other person to look in the same direction that you threw your punch. As fun as the game is, there’s not really any point that you know who wins. You just take turns, quit whenever you feel like it, and declare whoever felt like the more dominant player as the winner.


I forgot about the game until recently when I saw it pop up on TikTok. I was reminded of how fun it was to play with friends, and I figured that if I built a mobile app version of the game, I could solve the issue of keeping track of scoring and who wins.


You may be wondering where the name “ish” comes from. The same way tennis players grunt during play, typically players for the shadow boxing game will make the sound “ish” when they throw each punch. In any case, it was a really fun project to build, the details of which are shared below.

Tech Stack


Front End
  • Swift, SwiftUI
Backend
  • Supabase, Supabase Edge Functions
Other
  • WebRTC, Roboflow

Videos


Game Demo

Pose Detection

Implementation


The main engineering challenge was minimizing latency at every step of gameplay. From player communication to punch dodge detection, everything needed to be as fast as possible for smooth gameplay.


WebRTC was the natural choice for real-time communication. This protocol enables peer-to-peer communication between devices without requiring a server to route traffic. While WebRTC is ideal for video calls, implementation requires careful attention as it leaves many details to the developer.


For pose detection, I used the YOLO-NAS model. I compiled a dataset of approximately 300 images and used Roboflow’s platform for labeling. The model was trained locally on my MacBook M1 Pro using Roboflow’s open-source tools.


For the backend glue, Supabase came to the rescue. If you haven’t used Supabase yet, I highly recommend it.


WebRTC

This article on WebRTC was an excellent resource for understanding implementation requirements and the protocol’s inner workings. WebRTC builds upon several established technologies and protocols, each worth studying individually for a deeper understanding.


Most of the development work goes into deciding the details of the signaling stage. After the initial connection is established, the WebRTC library handles almost everything. Many solutions use websockets, but I decided to use Supabase’s real-time broadcasting feature (technically, this also uses websockets under the hood). The feature creates a channel which the supabase client can send messages across.

Broadcasting Diagram

After establishing the connection, WebRTC clients exchange SDP offers/answers and ICE candidates (network paths including IP and port for direct connections).


The relevant code is in the MatchView, SignalClient, MatchViewModel, and WebRTCClient files.


Pose Detection


Roboflow provides excellent tools for computer vision features, covering the entire ML pipeline from labeling to training and deployment. Here’s the implementation process:


  1. Create a key-point detection project in Roboflow
  2. Create a new class and key-point skeleton under the “Classes & Tags” tab
  3. Upload your dataset under the “Upload Dataset” tab. For my dataset, I recorded videos of head movements in different directions and used ffmpeg to split them into frames: ffmpeg -i IMG_0697.MOV frames2/frame_%04d.png
  4. Label all dataset items under the “Annotate” tab
  5. Train the model locally (Roboflow’s servers require a paid plan for weight downloads)
from roboflow import Roboflow
from ultralytics import YOLO

ROBOFLOW_API_KEY = os.getenv("ROBOFLOW_API_KEY")
ROBOFLOW_MODEL_ID = os.getenv("ROBOFLOW_MODEL_ID")
ROBOFLOW_MODEL_VERSION = os.getenv("ROBOFLOW_MODEL_VERSION")
ROBOWFLOW_WORKSPACE = "north-of-60-labs-xxxx"
ROBOFLOW_PROJECT_ID = os.getenv("ROBOFLOW_PROJECT_ID")

rf = Roboflow(api_key=ROBOFLOW_API_KEY)
project = rf.workspace(ROBOWFLOW_WORKSPACE).project(ROBOFLOW_PROJECT_ID)
dataset = project.version(ROBOFLOW_MODEL_VERSION).download("yolov8")

model = YOLO('runs/pose/train/weights/best.pt')
data = Path() / "Ish-1" / "data.yaml"
results = model.train(data=data, epochs=50, imgsz=640)
## Export the model to CoreML for local app use
from ultralytics import YOLO

model = YOLO('runs/pose/train/weights/best.pt')
model.export(format="coreml")
  1. Run inference in the app. The code below is from the HeadPoseDetectionService. The HeadPoseDetectionRenderer is added to the WebRTCClient’s local renderers to process the stream.
func detectHeadPose(in image: UIImage) async throws -> HeadPoseObservation {
        guard let visionModel = visionModel else {
            throw RoboflowError.modelNotLoaded
        }

        let requestHandler = VNImageRequestHandler(
            cgImage: image.cgImage!,
            options: [:]
        )

        let request = VNCoreMLRequest(model: visionModel)

        do {
            try requestHandler.perform([request])
        } catch {
            throw RoboflowError.invalidResponseFormat
        }

        if let results = request.results {
            if let firstResult = results.first as? VNCoreMLFeatureValueObservation {
                return processCoreMLFeatureValue(firstResult)
            } else {
                throw RoboflowError.invalidResponseFormat
            }
        } else {
            throw RoboflowError.invalidResponseFormat
        }
    }
private func processCoreMLFeatureValue(_ observation: VNCoreMLFeatureValueObservation)
        -> HeadPoseObservation
    {
        // The observation result is shaped 1 x 23 x 8440
        // 6 keypoints x 3 values (x, y, confidence) = 18 + 5 for the bounding box (x, y, width, height, confidence) = 23
        // Extract keypoints from the observation to the proper shape

        let featureValue = observation.featureValue
        guard let multiArray = featureValue.multiArrayValue else {
            return HeadPoseObservation(keypoints: [], confidence: 0.0, boundingBox: .zero)
        }

        let shape = multiArray.shape.map { $0.intValue }  // [1, 23, 8400]
        let strides = multiArray.strides.map { $0.intValue }  // [193200, 8400, 1]
        let pointer = UnsafeMutablePointer<Float32>(OpaquePointer(multiArray.dataPointer))

        let anchorCount = shape[2]  // 8400
        let keypointNames = ["eye-1", "eye-2", "forehead", "mouth-center", "mouth-1", "mouth-2"]

        func sigmoid(_ x: Float32) -> Float32 {
            return 1.0 / (1.0 + exp(-x))
        }

        var bestConfidence: Float = 0
        var bestAnchorIndex: Int? = nil

        // Find the anchor with the highest objectness confidence
        for anchor in 0..<anchorCount {
            let confidenceIndex = 4 * strides[1] + anchor * strides[2]
            let rawConf = pointer[confidenceIndex]
            let conf = sigmoid(rawConf)

            if conf > bestConfidence {
                bestConfidence = conf
                bestAnchorIndex = anchor
            }
        }

        // If no good anchor found, return empty
        guard let bestAnchor = bestAnchorIndex else {
            return HeadPoseObservation(keypoints: [], confidence: 0.0, boundingBox: .zero)
        }

        // Extract bounding box
        let xCenter = pointer[0 * strides[1] + bestAnchor * strides[2]]
        let yCenter = pointer[1 * strides[1] + bestAnchor * strides[2]]
        let width = pointer[2 * strides[1] + bestAnchor * strides[2]]
        let height = pointer[3 * strides[1] + bestAnchor * strides[2]]

        let boundingBox = CGRect(
            x: CGFloat(xCenter - width / 2),
            y: CGFloat(yCenter - height / 2),
            width: CGFloat(width),
            height: CGFloat(height)
        )

        // Extract 6 keypoints
        var keypoints: [Keypoint] = []

        for i in 0..<6 {
            let x = pointer[(5 + i * 3) * strides[1] + bestAnchor * strides[2]]
            let y = pointer[(6 + i * 3) * strides[1] + bestAnchor * strides[2]]
            let conf = pointer[(7 + i * 3) * strides[1] + bestAnchor * strides[2]]
            keypoints.append(
                Keypoint(
                    name: keypointNames[i],
                    x: CGFloat(x),
                    y: CGFloat(y),
                    confidence: conf
                ))
        }

        return HeadPoseObservation(
            keypoints: keypoints,
            confidence: bestConfidence,
            boundingBox: boundingBox
        )
    }

Game Logic

How is a successful dodge determined? When a user swipes to throw a punch, their finger coordinates are recorded. After the swipe, a vector is calculated from the last half of the swipe. A similar vector is calculated from the head pose buffer. The dot product of these vectors determines if the user looked away from the punch. A dot product over 0.7 indicates a landed punch, while a value under 0.7 indicates a successful dodge.

Future Exploration


There’s other work that could be done for the app if I ever decide to get it to a state where it could be published on the App Store. The following is a list of improvements that I would make if I ever decide to do that.

  1. Multiple punch throwing modes. Currently, users only swipe to throw punches. The original game allows swiping, punching, or pointing in different directions. This could be implemented by training a new Roboflow model with a dataset of different actions.

  2. Rock-paper-scissors for turn order. The original game starts with rock-paper-scissors to determine who goes first. This could be implemented by training a Roboflow model to detect these gestures, with each round starting with a rock-paper-scissors match.

  3. Onboarding rules page. Since most people haven’t played the game, a clear rules explanation during onboarding would be essential before distribution.

  4. TURN server implementation. WebRTC isn’t perfect - some network conditions prevent peer-to-peer connections. In these cases, a TURN server would relay traffic between devices. The impact on latency and gameplay experience needs to be evaluated.

  5. Adjustable reaction sensitivity. The original game lacks an objective way to determine if a player reacted quickly enough to dodge. The app currently uses a fixed reaction time threshold. Making this adjustable would allow players to customize the game’s difficulty level.