MiSensorKit

Sensory Data Collection for iOS Devices

Efe Tarhan Kunal Pratap Singh Amir Zamir

VILAB - EPFL

Capture multi-sensory data from iOS devices — all in one tap.

Abstract

MiSensorKit is an iOS application designed for synchronized multimodal data collection from the native sensor stack of modern smartphones. It enables the capture of aligned streams including RGB, depth, pose, IMU, and environmental signals directly from the device in real-world settings. By providing an easy-to-use and scalable interface for recording sensor data, MiSensorKit lowers the barrier to studying multimodal learning on embodied devices. The collected data can support research in areas such as multimodal representation learning, sensor fusion, world modeling, and mobile perception, making it a practical tool for building and evaluating models that operate directly over hardware sensor streams.

Wide Range of Modalities!

RGB Capture

High-resolution RGB frames saved as JPEGs at up to 1024px resolution.

LiDAR Depth

Scene-depth maps from the LiDAR sensor stored as UInt16 millimeter buffers.

Camera Pose

Full camera intrinsics and 6-DoF extrinsics logged every frame as JSON.

Skeleton 3D

Real-time body skeleton tracking with 3D joint positions via ARKit.

Ambient Light

Light intensity and color temperature estimates from the AR session.

Barometer

Atmospheric pressure readings synchronized with every captured frame.

GPS

Latitude, longitude, altitude, speed, and course logged per frame via CoreLocation.

IMU

Accelerometer, gyroscope, attitude, and gravity vectors at full sensor rate.

Gaze Estimation

Eye-tracking and gaze direction via ARKit face tracking on supported devices.

Output Data Format

Each recording session creates a structured folder under the app's Documents directory:

misensorkit_<timestamp>/
  rgb/          frame_000001.jpg       # RGB images at selected resolution
  depth/        frame_000001.bin       # LiDAR depth (UInt16 millimeters)
  camera/       frame_000001.json      # Intrinsics + 6-DoF pose
  metadata/     frame_000001.json      # Ambient light + pressure
  skeleton/     frame_000001.json      # 3D body joint positions
  gps/          frame_000001.json      # Latitude, longitude, altitude, speed, course
  imu/          frame_000001.json      # Accel (g), gyro (rad/s), attitude, gravity
  selfie/       frame_000001.jpg       # Front camera frames (if enabled)
  gaze/         frame_000001.json      # Eye sight data (if enabled)
  session_summary.json                 # Summary of recording session and enabled modalities

Data Inspector

RGB
RGB frame
LiDAR Depth
Depth frame
Camera Pose
Camera pose
GPS
IMU
IMU
Ambient Light & Pressure
Sensors
1 / 30

ICLR 2026

Multimodality as Supervision: Self-Supervised Specialization
to the Test Environment via Multimodality

Kunal Pratap Singh*, Ali Garjani*, Rishubh Singh, Muhammad Uzair Khattak, Efe Tarhan, Jason Toskov, Andrei Atanov, Oğuzhan Fatih Kar, Amir Zamir

* Equal Contribution

EPFL — VILAB

Requirements

  • Device: Any iPhone or iPad. LiDAR sensor (iPhone 12 Pro and later, iPad Pro 2020 and later) is only required for depth capture — all other modalities work on any compatible device.
  • OS: iOS 26 or later
  • Permissions: Camera, Motion & Fitness, Location access

Ready to Capture?

Download MiSensorKit and start building multi-modal datasets today.

Download on the App Store