AI Object Detection API on Mobile: A Practical, End-to-End Tutorial

Build an Android and iOS app that streams camera frames to an AI object detection API, draws real-time boxes, and ships with production-ready patterns.

ASOasis
8 min read
AI Object Detection API on Mobile: A Practical, End-to-End Tutorial

Image used for representation purposes only.

Overview

Object detection brings your camera feed to life—highlighting people, products, and scenes with labeled boxes. In this hands‑on tutorial, you’ll build a mobile app that streams frames to a hosted AI object detection API, renders bounding boxes in real time, and ships with production‑grade patterns for performance, privacy, and reliability.

We’ll cover:

  • How detection APIs work (request/response contracts)
  • Android (Kotlin + CameraX + OkHttp) implementation
  • iOS (Swift + AVFoundation + URLSession) implementation
  • Drawing overlays, throttling frames, and reducing latency
  • Testing, evaluation, and deployment tips

All code targets a generic HTTPS API so you can adapt to any provider.

Prerequisites

  • API key for an object detection service (e.g., your team’s inference endpoint).
  • Basic Android Studio or Xcode setup.
  • A modern device (Android 8+ or iOS 14+) with a rear camera.

What the API Looks Like

We’ll assume a simple REST endpoint:

  • Method: POST /v1/detect
  • Auth: Authorization: Bearer YOUR_API_KEY
  • Body: multipart/form-data with image (JPEG/PNG) or base64 JSON
  • Response: JSON with normalized boxes in [0,1] coordinates

Example request and response:

curl -X POST https://api.example.com/v1/detect \
  -H "Authorization: Bearer $API_KEY" \
  -F "image=@frame.jpg" \
  -F "threshold=0.35"
{
  "model":"yolovX-640",
  "time_ms": 42,
  "objects":[
    {"label":"person","confidence":0.94,"box":{"x":0.12,"y":0.18,"w":0.33,"h":0.64}},
    {"label":"bicycle","confidence":0.87,"box":{"x":0.51,"y":0.35,"w":0.42,"h":0.36}}
  ]
}

Notes:

  • x,y are top‑left; w,h are width/height. All normalized to the input image size.
  • Providers may return absolute pixels; convert as needed.

Architecture at a Glance

  • Camera pipeline delivers frames (NV21/YUV on Android, CMSampleBuffer on iOS).
  • We downscale and JPEG‑encode a frame periodically (e.g., every 200–300 ms).
  • Send frame to API with threshold & optional categories filter.
  • Parse JSON, map normalized boxes to the displayed preview size.
  • Draw overlays on a transparent view above the preview.
  • Debounce requests, queue at most one in‑flight call to avoid overload.

Security & Privacy Essentials

  • Never hardcode API keys in source control. Use secure keystores (Android) or Keychain (iOS) and remote config.
  • Prefer HTTPS/2 or HTTP/3; pin TLS if policy requires.
  • Minimize PII in frames. Consider on‑device blurring of faces/license plates if policy mandates.
  • Offer an opt‑in toggle and explain data use in your privacy notice.

Android Implementation (Kotlin + CameraX)

1) Dependencies

Add CameraX and OkHttp (or Retrofit) in app/build.gradle:

implementation "androidx.camera:camera-camera2:1.3.3"
implementation "androidx.camera:camera-lifecycle:1.3.3"
implementation "androidx.camera:camera-view:1.3.3"
implementation("com.squareup.okhttp3:okhttp:4.12.0")

2) Permissions

Request CAMERA at runtime (Android 6.0+). In AndroidManifest.xml:

<uses-permission android:name="android.permission.CAMERA" />

3) Layout

<!-- activity_main.xml -->
<androidx.camera.view.PreviewView
    android:id="@+id/previewView"
    android:layout_width="match_parent"
    android:layout_height="match_parent" />

<com.example.vision.OverlayView
    android:id="@+id/overlay"
    android:layout_width="match_parent"
    android:layout_height="match_parent" />

4) Start CameraX

class MainActivity : AppCompatActivity() {
  private lateinit var previewView: PreviewView
  private lateinit var overlay: OverlayView
  private var lastSentAt = 0L
  private var sending = false

  override fun onCreate(savedInstanceState: Bundle?) {
    super.onCreate(savedInstanceState)
    setContentView(R.layout.activity_main)
    previewView = findViewById(R.id.previewView)
    overlay = findViewById(R.id.overlay)

    if (ContextCompat.checkSelfPermission(this, Manifest.permission.CAMERA) == PackageManager.PERMISSION_GRANTED) {
      startCamera()
    } else {
      requestPermissions(arrayOf(Manifest.permission.CAMERA), 100)
    }
  }

  private fun startCamera() {
    val cameraProviderFuture = ProcessCameraProvider.getInstance(this)
    cameraProviderFuture.addListener({
      val cameraProvider = cameraProviderFuture.get()
      val preview = Preview.Builder().build().also {
        it.setSurfaceProvider(previewView.surfaceProvider)
      }
      val analyzer = ImageAnalysis.Builder()
        .setBackpressureStrategy(ImageAnalysis.STRATEGY_KEEP_ONLY_LATEST)
        .build()

      analyzer.setAnalyzer(Executors.newSingleThreadExecutor()) { imageProxy ->
        val now = System.currentTimeMillis()
        if (!sending && now - lastSentAt > 250) {
          sending = true
          lastSentAt = now
          processFrame(imageProxy)
        } else {
          imageProxy.close()
        }
      }

      cameraProvider.unbindAll()
      cameraProvider.bindToLifecycle(this, CameraSelector.DEFAULT_BACK_CAMERA, preview, analyzer)
    }, ContextCompat.getMainExecutor(this))
  }
}

5) Frame Encoding and Network Call

private val client = OkHttpClient.Builder()
  .callTimeout(Duration.ofSeconds(10))
  .build()

private fun processFrame(imageProxy: ImageProxy) {
  val jpgBytes = YuvToJpeg.encode(imageProxy, maxWidth = 640) // custom util
  val reqBody = MultipartBody.Builder().setType(MultipartBody.FORM)
    .addFormDataPart("image", "frame.jpg",
      jpgBytes.toRequestBody("image/jpeg".toMediaType()))
    .addFormDataPart("threshold", "0.35")
    .build()

  val request = Request.Builder()
    .url("https://api.example.com/v1/detect")
    .header("Authorization", "Bearer ${secureApiKey()}")
    .post(reqBody)
    .build()

  client.newCall(request).enqueue(object: Callback {
    override fun onFailure(call: Call, e: IOException) {
      sending = false
      imageProxy.close()
    }
    override fun onResponse(call: Call, response: Response) {
      response.use {
        val body = it.body?.string() ?: "{}"
        val result = parseDetections(body) // returns list of boxes
        runOnUiThread {
          overlay.updateDetections(result)
        }
        sending = false
        imageProxy.close()
      }
    }
  })
}

A simple YUV → JPEG helper (outline only):

object YuvToJpeg {
  fun encode(image: ImageProxy, maxWidth: Int): ByteArray {
    // Convert to Bitmap via YUV->RGB, scale maintaining aspect, then JPEG compress (80%).
    // Libraries like "androidx.camera:camera-core" + RenderScript (legacy) or ScriptIntrinsicYuvToRGB alt.
    // For brevity, implementation omitted.
    return byteArrayOf()
  }
}

6) Drawing the Overlay

class OverlayView(context: Context, attrs: AttributeSet): View(context, attrs) {
  private val boxes = mutableListOf<Detection>()
  private val paint = Paint().apply {
    color = Color.GREEN; style = Paint.Style.STROKE; strokeWidth = 4f; isAntiAlias = true
  }
  private val textPaint = Paint().apply {
    color = Color.WHITE; textSize = 36f; isAntiAlias = true
  }

  fun updateDetections(newBoxes: List<Detection>) {
    boxes.clear(); boxes.addAll(newBoxes); invalidate()
  }

  override fun onDraw(canvas: Canvas) {
    super.onDraw(canvas)
    for (d in boxes) {
      val left = d.x * width
      val top = d.y * height
      val right = left + d.w * width
      val bottom = top + d.h * height
      canvas.drawRect(left, top, right, bottom, paint)
      canvas.drawText("${d.label} ${(d.confidence*100).toInt()}%", left, max(0f, top - 8), textPaint)
    }
  }
}

data class Detection(val label:String, val confidence:Float, val x:Float, val y:Float, val w:Float, val h:Float)

Tip: Account for previewView’s scale type (fill/fit). If you letterbox the preview, compute offsets so boxes align.

iOS Implementation (Swift + AVFoundation)

1) Permissions & Setup

Add NSCameraUsageDescription to Info.plist. Create a capture session with AVCaptureVideoDataOutput.

final class CameraViewController: UIViewController, AVCaptureVideoDataOutputSampleBufferDelegate {
  private let session = AVCaptureSession()
  private let queue = DispatchQueue(label: "camera.queue")
  private var lastSent = Date(timeIntervalSince1970: 0)
  private var sending = false
  private let overlay = OverlayView()

  override func viewDidLoad() {
    super.viewDidLoad()
    setupCamera()
    view.addSubview(overlay)
    overlay.frame = view.bounds
    overlay.autoresizingMask = [.flexibleWidth, .flexibleHeight]
  }

  private func setupCamera() {
    session.beginConfiguration()
    session.sessionPreset = .high
    guard
      let device = AVCaptureDevice.default(.builtInWideAngleCamera, for: .video, position: .back),
      let input = try? AVCaptureDeviceInput(device: device)
    else { return }
    session.addInput(input)

    let output = AVCaptureVideoDataOutput()
    output.setSampleBufferDelegate(self, queue: queue)
    output.alwaysDiscardsLateVideoFrames = true
    session.addOutput(output)

    let previewLayer = AVCaptureVideoPreviewLayer(session: session)
    previewLayer.videoGravity = .resizeAspectFill
    previewLayer.frame = view.bounds
    view.layer.insertSublayer(previewLayer, at: 0)

    session.commitConfiguration()
    session.startRunning()
  }

  func captureOutput(_ output: AVCaptureOutput, didOutput sampleBuffer: CMSampleBuffer, from connection: AVCaptureConnection) {
    guard !sending, Date().timeIntervalSince(lastSent) > 0.25 else { return }
    sending = true; lastSent = Date()
    guard let jpegData = JPEGEncoder.encode(sampleBuffer: sampleBuffer, maxWidth: 640) else { sending = false; return }
    Task { await callAPI(jpegData: jpegData) }
  }

  private func callAPI(jpegData: Data) async {
    var req = URLRequest(url: URL(string: "https://api.example.com/v1/detect")!)
    req.httpMethod = "POST"
    req.setValue("Bearer \(secureApiKey())", forHTTPHeaderField: "Authorization")

    let boundary = UUID().uuidString
    req.setValue("multipart/form-data; boundary=\(boundary)", forHTTPHeaderField: "Content-Type")

    var body = Data()
    body.append("--\(boundary)\r\n".data(using: .utf8)!)
    body.append("Content-Disposition: form-data; name=\"image\"; filename=\"frame.jpg\"\r\n".data(using: .utf8)!)
    body.append("Content-Type: image/jpeg\r\n\r\n".data(using: .utf8)!)
    body.append(jpegData)
    body.append("\r\n--\(boundary)\r\n".data(using: .utf8)!)
    body.append("Content-Disposition: form-data; name=\"threshold\"\r\n\r\n0.35\r\n".data(using: .utf8)!)
    body.append("--\(boundary)--\r\n".data(using: .utf8)!)

    req.httpBody = body

    do {
      let (data, _) = try await URLSession.shared.data(for: req)
      let result = try JSONDecoder().decode(Detections.self, from: data)
      await MainActor.run { self.overlay.update(detections: result.objects) }
    } catch {
      // Handle error/log
    }
    sending = false
  }
}

struct Detections: Decodable { let objects: [Obj] }
struct Obj: Decodable { let label: String; let confidence: Double; let box: Box }
struct Box: Decodable { let x: Double; let y: Double; let w: Double; let h: Double }

A simple overlay view:

final class OverlayView: UIView {
  private var objs: [Obj] = []
  func update(detections: [Obj]) { self.objs = detections; setNeedsDisplay() }
  override func draw(_ rect: CGRect) {
    guard let ctx = UIGraphicsGetCurrentContext() else { return }
    ctx.setLineWidth(3); UIColor.systemGreen.setStroke()
    for o in objs {
      let r = CGRect(x: o.box.x * rect.width,
                     y: o.box.y * rect.height,
                     width: o.box.w * rect.width,
                     height: o.box.h * rect.height)
      ctx.stroke(r)
      let text = "\(o.label) \(Int(o.confidence*100))%"
      text.draw(at: CGPoint(x: r.minX, y: max(0, r.minY - 14)), withAttributes:[.font:UIFont.systemFont(ofSize: 12), .foregroundColor: UIColor.white])
    }
  }
}

Note: Align the preview gravity (.resizeAspectFill) with your normalization math; if you crop/letterbox, apply the same transform to boxes.

Cross‑Platform Options (Quick Glance)

  • React Native: Use react-native-vision-camera for frames; send blobs via fetch or axios with FormData; draw using SVG overlays.
  • Flutter: camera + http packages; draw using CustomPainter on a Stack.

Minimal React Native example for upload:

const form = new FormData();
form.append('image', { uri, name: 'frame.jpg', type: 'image/jpeg' });
form.append('threshold', '0.35');
await fetch('https://api.example.com/v1/detect', {
  method: 'POST',
  headers: { Authorization: `Bearer ${apiKey}` },
  body: form,
});

Performance Playbook

  • Downscale frames: 480–640 px on the long side is often enough; preserves speed with minimal accuracy loss.
  • Throttle intelligently: Sample 3–5 fps for API calls while preview runs at 30 fps.
  • Compress at ~75–85% JPEG quality; measure the latency vs. size curve.
  • Reuse HTTP connections: Keep‑alive, HTTP/2, a single OkHttp/URLSession instance.
  • Queue control: Allow at most one in‑flight call; drop older frames.
  • ROI (Region of Interest): Crop center or last‑known object region to reduce bytes when appropriate.
  • Cache labels/colors by class for stable UI; avoid allocating in draw loops.

Reliability & Error Handling

  • Timeouts: 8–12 s network timeout; back off with jitter on 429/5xx.
  • Model versions: Read the response’s model field; surface in logs for debugs.
  • Threshold tuning: Start at 0.35–0.5, then A/B for precision/recall needs.
  • Offline mode: Detect connectivity; pause uploads and show UI hint.
  • Observability: Log time_ms from responses; chart p50/p95 latency and success rates.

Testing & Evaluation

  • Golden images: Keep a folder of labeled test frames and expected boxes; run a local script to diff IoU.
  • Lighting and motion: Test low light, backlight, and motion blur.
  • Edge cases: Tiny objects, occlusions, crowded scenes.
  • Metrics to watch:
    • Latency (camera → boxes on screen)
    • Detection quality (precision/recall against your ground truth)
    • Uptime (error rates and retries)

Production Checklist

  • Secure key storage (Android Keystore, iOS Keychain); rotate keys.
  • Privacy notice and opt‑in for data upload; provide a clear toggle.
  • Rate limits respected; exponential backoff on 429.
  • Graceful degradation when API unavailable; UI state synced.
  • Analytics on confidence thresholds and user engagement.
  • Battery impact audit: throttle when device is hot or on low power mode.

Troubleshooting

  • Boxes misaligned: Check preview aspect transform; apply the same scale/crop to response boxes.
  • High latency: Downscale more aggressively; ensure HTTP/2; warm up DNS (preconnect) on app launch.
  • 415/400 errors: Verify multipart boundaries, field names, and MIME types.
  • Dim/blur frames: Increase exposure or enable video stabilization; don’t over‑compress JPEG.
  • Flickering labels: Apply temporal smoothing (e.g., EMA) over a small window of frames.

Next Steps

  • Add class filters (only detect people/vehicles) for faster inference.
  • Implement tap‑to‑track: Persist an ID from the API or run a lightweight on‑device tracker.
  • Batch uploads for burst photos; or switch to a streaming endpoint if your provider supports it.
  • Ship a debug screen: show last payload size, time_ms, and recent errors.

With these patterns, you’ve got a robust baseline: a responsive camera preview, efficient uploads, accurate overlays, and the operational guardrails needed for real‑world apps. Swap in any compatible endpoint and iterate on thresholds, sampling rates, and UI polish to meet your product goals.

Related Posts