MiSensorKit

Abstract

MiSensorKit is an iOS application designed for synchronized multimodal data collection from the native sensor stack of modern smartphones. It enables the capture of aligned streams including RGB, depth, pose, IMU, and environmental signals directly from the device in real-world settings. By providing an easy-to-use and scalable interface for recording sensor data, MiSensorKit lowers the barrier to studying multimodal learning on embodied devices. The collected data can support research in areas such as multimodal representation learning, sensor fusion, world modeling, and mobile perception, making it a practical tool for building and evaluating models that operate directly over hardware sensor streams.

Wide Range of Modalities!

RGB Capture

High-resolution RGB frames saved as JPEGs at up to 1024px resolution.

LiDAR Depth

Scene-depth maps from the LiDAR sensor stored as UInt16 millimeter buffers.

Camera Pose

Full camera intrinsics and 6-DoF extrinsics logged every frame as JSON.

Skeleton 3D

Real-time body skeleton tracking with 3D joint positions via ARKit.

Ambient Light

Light intensity and color temperature estimates from the AR session.

Barometer

Atmospheric pressure readings synchronized with every captured frame.

GPS

Latitude, longitude, altitude, speed, and course logged per frame via CoreLocation.

IMU

Accelerometer, gyroscope, attitude, and gravity vectors at full sensor rate.

Gaze Estimation

Eye-tracking and gaze direction via ARKit face tracking on supported devices.

App Clips

How to Use ?

Custom Data Collection

Visualize Your Recordings

Camera & Room Pose

Gaze Tracking

3D Skeleton Tracking

Output Data Format

Each recording session creates a structured folder under the app's Documents directory:

misensorkit_<timestamp>/
  rgb/          frame_000001.jpg       # RGB images at selected resolution
  depth/        frame_000001.bin       # LiDAR depth (UInt16 millimeters)
  camera/       frame_000001.json      # Intrinsics + 6-DoF pose
  metadata/     frame_000001.json      # Ambient light + pressure
  skeleton/     frame_000001.json      # 3D body joint positions
  gps/          frame_000001.json      # Latitude, longitude, altitude, speed, course
  imu/          frame_000001.json      # Accel (g), gyro (rad/s), attitude, gravity
  selfie/       frame_000001.jpg       # Front camera frames (if enabled)
  gaze/         frame_000001.json      # Eye sight data (if enabled)
  session_summary.json                 # Summary of recording session and enabled modalities

Data Inspector

RGB

LiDAR Depth

Camera Pose

GPS

IMU

Ambient Light & Pressure

1 / 30

ICLR 2026

Multimodality as Supervision: Self-Supervised Specialization
to the Test Environment via Multimodality

Kunal Pratap Singh^*, Ali Garjani^*, Rishubh Singh, Muhammad Uzair Khattak, Efe Tarhan, Jason Toskov, Andrei Atanov, Oğuzhan Fatih Kar, Amir Zamir

* Equal Contribution

EPFL — VILAB

Paper Code

Requirements

Device: Any iPhone or iPad. LiDAR sensor (iPhone 12 Pro and later, iPad Pro 2020 and later) is only required for depth capture — all other modalities work on any compatible device.
OS: iOS 26 or later
Permissions: Camera, Motion & Fitness, Location access

Ready to Capture?

Download MiSensorKit and start building multi-modal datasets today.

Download on the App Store

Sensory Data Collection for iOS Devices

Abstract

Wide Range of Modalities!

RGB Capture

LiDAR Depth

Camera Pose

Skeleton 3D

Ambient Light

Barometer

GPS

IMU

Gaze Estimation

App Clips

Output Data Format

Data Inspector

Multimodality as Supervision: Self-Supervised Specializationto the Test Environment via Multimodality

Requirements

Ready to Capture?

Multimodality as Supervision: Self-Supervised Specialization
to the Test Environment via Multimodality