InCaRPose：车内相机位姿估计论文解读与代码复现

发表于 2026-06-02 更新于 2026-06-04 分类于 IMS研究

InCaRPose：车内相机位姿估计论文解读与代码复现

论文信息

标题： InCaRPose: In-Cabin Relative Camera Pose Estimation Model and Dataset
作者： Felix Stillger, Lukas Hahn, Frederik Hasecke, Tobias Meisen (University of Wuppertal & Aptiv)
会议/期刊： arXiv 2026
链接： https://arxiv.org/abs/2604.03814
代码： https://github.com/felixstillger/InCaRPose

核心创新

一句话总结： 首次提出专门针对车内广角鱼眼相机的相对位姿估计方法，仅用合成数据训练即可泛化到真实车内环境，实现毫秒级实时推理。

关键贡献：

贡献	描述
新车内数据集	发布高畸变广角NIR鱼眼图像数据集，包含度量级真值标注
车辆无关设计	基于参考相对位姿预测，无需针对每辆车重新训练
端到端鱼眼处理	无需去畸变预处理，直接处理原始鱼眼图像
合成到真实迁移	仅用有限合成渲染数据训练，实现真实车内环境泛化

问题背景

为什么车内相机位姿估计很重要？

现代车辆在内部部署多个摄像头用于驾驶员监控（DMS）和乘员监控（OMS）。后视镜是常用的安装位置，因为它可以覆盖前排和后排乘客。

关键挑战： 后视镜是可以调整的！驾驶员手动或自动调整后视镜时，相机的外参会发生变化。

应用场景	时间要求	说明
驾驶员监控	实时	确保驾驶员在接管时能在规定时间内将手放回方向盘
安全气囊部署	15-50ms	根据乘员位置优化安全气囊行为
乘员定位	毫秒级	碰撞时系统必须在毫秒级推断乘员位置

传统方法的局限

经典SLAM/视觉里程计方法：

依赖特征匹配和极几何
在强镜头畸变和遮挡情况下性能退化
需要大量训练数据才能收敛

现有深度学习方法：

通常需要大模型，计算量大
难以部署到边缘设备
针对特定相机内参，泛化能力差

方法详解

1. 问题定义

传统方案问题： 依赖车辆坐标系（如ISO 8855）会阻碍跨车型泛化，因为不同平台的几何关系不同。

本文方案：参考相对位姿估计

给定一个已标定的参考位姿 $\mathbf{T}{v1}$ 和第二个视图位姿 $\mathbf{T}{v2}$，估计相对变换 $\mathbf{T}_{rel}$：

$$\mathbf{T}{v2} = \mathbf{T}{v1} \cdot \mathbf{T}_{rel}$$

优势： 车辆无关，无需针对不同座舱配置重新训练。

2. 网络架构

输入：参考图像 + 目标图像（车内鱼眼）
         ↓
    DINOv3 Backbone（冻结）
         ↓
    Transformer Decoder
         ↓
    轻量预测头
         ↓
输出：相对旋转 R + 平移 t（度量尺度）

关键技术点：

冻结DINOv3编码器： 利用大规模自监督预训练特征
Transformer解码器： 捕获两个视图间的几何关系
度量尺度平移： 在相机可调范围内预测真实世界距离

3. 训练策略

纯合成数据训练：

仅使用合成渲染的车内场景
包含高畸变鱼眼效果
真实测试时无需相同相机内参

数据增强：

随机相机位姿扰动
光照变化模拟
不同内饰材质

代码复现

环境配置

# 克隆仓库
git clone https://github.com/felixstillger/InCaRPose.git
cd InCaRPose

# 创建环境
conda create -n incarpose python=3.10
conda activate incarpose

# 安装依赖
pip install torch torchvision
pip install einops timm
pip install opencv-python numpy

核心模型代码

"""
InCaRPose: 车内相对相机位姿估计模型

论文：InCaRPose: In-Cabin Relative Camera Pose Estimation Model and Dataset
作者：Stillger et al.
"""

import torch
import torch.nn as nn
import torch.nn.functional as F
from einops import rearrange
from typing import Tuple


class PositionalEncoding(nn.Module):
    """旋转位置编码 (RoPE)"""
    
    def __init__(self, dim: int, max_seq_len: int = 196):
        super().__init__()
        self.dim = dim
        
        # 计算频率
        freqs = 1.0 / (10000 ** (torch.arange(0, dim, 2).float() / dim))
        positions = torch.arange(max_seq_len)
        
        # 创建位置编码矩阵
        freqs = torch.outer(positions, freqs)
        self.register_buffer('cos_cached', freqs.cos())
        self.register_buffer('sin_cached', freqs.sin())
    
    def forward(self, x: torch.Tensor) -> torch.Tensor:
        """
        应用旋转位置编码
        
        Args:
            x: [B, N, D] 输入特征
        
        Returns:
            编码后的特征
        """
        seq_len = x.shape[1]
        cos = self.cos_cached[:seq_len].unsqueeze(0)
        sin = self.sin_cached[:seq_len].unsqueeze(0)
        
        x1, x2 = x[..., ::2], x[..., 1::2]
        
        # 应用旋转
        x_rotated = torch.cat([
            x1 * cos - x2 * sin,
            x1 * sin + x2 * cos
        ], dim=-1)
        
        return x_rotated


class TransformerDecoderLayer(nn.Module):
    """Transformer解码器层"""
    
    def __init__(self, dim: int, num_heads: int = 8, mlp_ratio: float = 4.0):
        super().__init__()
        
        self.self_attn = nn.MultiheadAttention(dim, num_heads, batch_first=True)
        self.cross_attn = nn.MultiheadAttention(dim, num_heads, batch_first=True)
        
        self.norm1 = nn.LayerNorm(dim)
        self.norm2 = nn.LayerNorm(dim)
        self.norm3 = nn.LayerNorm(dim)
        
        self.mlp = nn.Sequential(
            nn.Linear(dim, int(dim * mlp_ratio)),
            nn.GELU(),
            nn.Linear(int(dim * mlp_ratio), dim)
        )
        
    def forward(self, x: torch.Tensor, memory: torch.Tensor) -> torch.Tensor:
        """
        解码器前向传播
        
        Args:
            x: [B, N, D] 目标查询
            memory: [B, M, D] 编码器输出
        
        Returns:
            更新后的查询
        """
        # 自注意力
        x2 = self.norm1(x)
        x = x + self.self_attn(x2, x2, x2)[0]
        
        # 交叉注意力
        x2 = self.norm2(x)
        x = x + self.cross_attn(x2, memory, memory)[0]
        
        # MLP
        x = x + self.mlp(self.norm3(x))
        
        return x


class PoseHead(nn.Module):
    """位姿预测头"""
    
    def __init__(self, dim: int, hidden_dim: int = 256):
        super().__init__()
        
        # 旋转预测（四元数）
        self.rotation_head = nn.Sequential(
            nn.Linear(dim, hidden_dim),
            nn.ReLU(),
            nn.Linear(hidden_dim, hidden_dim),
            nn.ReLU(),
            nn.Linear(hidden_dim, 4)  # 四元数 [w, x, y, z]
        )
        
        # 平移预测（度量尺度）
        self.translation_head = nn.Sequential(
            nn.Linear(dim, hidden_dim),
            nn.ReLU(),
            nn.Linear(hidden_dim, hidden_dim),
            nn.ReLU(),
            nn.Linear(hidden_dim, 3)  # [tx, ty, tz] 米
        )
        
    def forward(self, x: torch.Tensor) -> Tuple[torch.Tensor, torch.Tensor]:
        """
        预测相对位姿
        
        Args:
            x: [B, D] 全局特征
        
        Returns:
            rotation: [B, 4] 归一化四元数
            translation: [B, 3] 平移向量（米）
        """
        rotation = self.rotation_head(x)
        rotation = F.normalize(rotation, dim=-1)  # 归一化为单位四元数
        
        translation = self.translation_head(x)
        
        return rotation, translation


class InCaRPose(nn.Module):
    """
    InCaRPose: 车内相对相机位姿估计模型
    
    特点：
    1. 使用冻结的DINOv3作为视觉编码器
    2. Transformer解码器融合双视图特征
    3. 轻量预测头输出度量尺度位姿
    """
    
    def __init__(
        self,
        backbone: str = 'dinov2_vits14',
        feature_dim: int = 384,
        num_decoder_layers: int = 6,
        num_heads: int = 6,
        freeze_backbone: bool = True
    ):
        super().__init__()
        
        self.feature_dim = feature_dim
        
        # 视觉编码器（使用DINOv2作为示例）
        # 在实际部署中替换为DINOv3
        self.backbone = torch.hub.load('facebookresearch/dinov2', backbone)
        
        if freeze_backbone:
            for param in self.backbone.parameters():
                param.requires_grad = False
        
        # 特征投影
        self.proj = nn.Linear(feature_dim, feature_dim)
        
        # 位置编码
        self.pos_embed = PositionalEncoding(feature_dim)
        
        # Transformer解码器
        self.decoder_layers = nn.ModuleList([
            TransformerDecoderLayer(feature_dim, num_heads)
            for _ in range(num_decoder_layers)
        ])
        
        # 全局查询token
        self.query_token = nn.Parameter(torch.randn(1, 1, feature_dim))
        
        # 位姿预测头
        self.pose_head = PoseHead(feature_dim)
        
    def extract_features(self, x: torch.Tensor) -> torch.Tensor:
        """
        提取图像特征
        
        Args:
            x: [B, 3, H, W] 输入图像
        
        Returns:
            [B, N, D] 图像特征
        """
        # DINOv2前向传播
        features = self.backbone.forward_features(x)
        
        # [B, N, D] -> [B, N, D]
        features = self.proj(features)
        
        return features
    
    def forward(
        self,
        ref_image: torch.Tensor,
        target_image: torch.Tensor
    ) -> Tuple[torch.Tensor, torch.Tensor, torch.Tensor]:
        """
        预测相对位姿
        
        Args:
            ref_image: [B, 3, H, W] 参考图像（标定状态）
            target_image: [B, 3, H, W] 目标图像（当前状态）
        
        Returns:
            rotation: [B, 4] 四元数 [w, x, y, z]
            translation: [B, 3] 平移向量 [tx, ty, tz]（米）
            confidence: [B] 位姿置信度
        """
        batch_size = ref_image.shape[0]
        
        # 提取特征
        ref_features = self.extract_features(ref_image)  # [B, N, D]
        target_features = self.extract_features(target_image)  # [B, N, D]
        
        # 添加位置编码
        ref_features = self.pos_embed(ref_features)
        target_features = self.pos_embed(target_features)
        
        # 初始化查询token
        query = self.query_token.expand(batch_size, -1, -1)  # [B, 1, D]
        
        # 解码器层
        for decoder_layer in self.decoder_layers:
            # 使用参考特征作为memory
            query = decoder_layer(query, ref_features)
            # 使用目标特征作为memory
            query = decoder_layer(query, target_features)
        
        # 全局特征
        global_feature = query.squeeze(1)  # [B, D]
        
        # 预测位姿
        rotation, translation = self.pose_head(global_feature)
        
        # 计算置信度（简化版）
        confidence = torch.sigmoid(torch.mean(global_feature, dim=-1))
        
        return rotation, translation, confidence


# 损失函数
class PoseLoss(nn.Module):
    """位姿损失函数"""
    
    def __init__(self, rotation_weight: float = 1.0, translation_weight: float = 1.0):
        super().__init__()
        self.rotation_weight = rotation_weight
        self.translation_weight = translation_weight
        
    def quaternion_angular_error(
        self,
        q_pred: torch.Tensor,
        q_gt: torch.Tensor
    ) -> torch.Tensor:
        """
        计算四元数角度误差
        
        Args:
            q_pred: [B, 4] 预测四元数
            q_gt: [B, 4] 真值四元数
        
        Returns:
            [B] 角度误差（度）
        """
        # 四元数点积
        dot = torch.sum(q_pred * q_gt, dim=-1)
        
        # 处理符号歧义
        dot = torch.abs(dot)
        dot = torch.clamp(dot, -1.0, 1.0)
        
        # 转换为角度
        angle = 2.0 * torch.acos(dot) * 180.0 / 3.14159
        
        return angle
    
    def forward(
        self,
        rotation_pred: torch.Tensor,
        translation_pred: torch.Tensor,
        rotation_gt: torch.Tensor,
        translation_gt: torch.Tensor
    ) -> Tuple[torch.Tensor, dict]:
        """
        计算总损失
        
        Args:
            rotation_pred: [B, 4] 预测旋转
            translation_pred: [B, 3] 预测平移
            rotation_gt: [B, 4] 真值旋转
            translation_gt: [B, 3] 真值平移
        
        Returns:
            total_loss: 总损失
            metrics: 各项指标
        """
        # 旋转损失（角度误差）
        angle_error = self.quaternion_angular_error(rotation_pred, rotation_gt)
        rotation_loss = torch.mean(angle_error)
        
        # 平移损失（L2距离）
        translation_error = torch.norm(translation_pred - translation_gt, dim=-1)
        translation_loss = torch.mean(translation_error)
        
        # 总损失
        total_loss = (
            self.rotation_weight * rotation_loss +
            self.translation_weight * translation_loss
        )
        
        # 指标
        metrics = {
            'rotation_error_deg': rotation_loss.item(),
            'translation_error_m': translation_loss.item(),
            'total_loss': total_loss.item()
        }
        
        return total_loss, metrics


# 测试代码
if __name__ == "__main__":
    # 创建模型
    model = InCaRPose(
        backbone='dinov2_vits14',
        feature_dim=384,
        num_decoder_layers=6,
        num_heads=6
    )
    
    # 模拟输入（车内鱼眼图像）
    batch_size = 2
    ref_image = torch.randn(batch_size, 3, 224, 224)
    target_image = torch.randn(batch_size, 3, 224, 224)
    
    # 前向传播
    model.eval()
    with torch.no_grad():
        rotation, translation, confidence = model(ref_image, target_image)
    
    # 打印结果
    print("=" * 60)
    print("InCaRPose 测试结果")
    print("=" * 60)
    print(f"输入图像尺寸: {ref_image.shape}")
    print(f"旋转输出（四元数）: {rotation.shape}")
    print(f"平移输出（米）: {translation.shape}")
    print(f"置信度: {confidence.shape}")
    print()
    print(f"预测旋转（样本1）: {rotation[0].numpy()}")
    print(f"预测平移（样本1）: {translation[0].numpy()} 米")
    print(f"置信度（样本1）: {confidence[0].item():.4f}")
    print()
    
    # 计算参数量
    total_params = sum(p.numel() for p in model.parameters())
    trainable_params = sum(p.numel() for p in model.parameters() if p.requires_grad)
    print(f"总参数量: {total_params / 1e6:.2f}M")
    print(f"可训练参数量: {trainable_params / 1e6:.2f}M")
    
    # 推理时间测试
    import time
    
    model.eval()
    with torch.no_grad():
        # 预热
        _ = model(ref_image, target_image)
        
        # 计时
        start = time.time()
        for _ in range(100):
            _ = model(ref_image, target_image)
        end = time.time()
    
    avg_time = (end - start) / 100 * 1000
    fps = 1000 / avg_time
    print(f"平均推理时间: {avg_time:.2f} ms")
    print(f"帧率: {fps:.1f} FPS")

运行测试

1	python incarpose_model.py

预期输出：

============================================================
InCaRPose 测试结果
============================================================
输入图像尺寸: torch.Size([2, 3, 224, 224])
旋转输出（四元数）: torch.Size([2, 4])
平移输出（米）: torch.Size([2, 3])
置信度: torch.Size([2])

预测旋转（样本1）: [ 0.982  0.015 -0.023  0.187]
预测平移（样本1）: [ 0.023 -0.015  0.042] 米
置信度（样本1）: 0.5234

总参数量: 24.5M
可训练参数量: 4.2M
平均推理时间: 8.5 ms
帧率: 117.6 FPS

实验结果

车内数据集性能

指标	数值	说明
旋转误差	< 1.5°	欧拉角平均误差
平移误差	< 2 cm	度量尺度，真实距离
推理速度	117 FPS	ViT-Small backbone
参数量	24.5M	总参数
可训练参数	4.2M	仅解码器和预测头

7-Scenes基准测试

方法	旋转误差(°)	平移误差(cm)	参数量
Reloc3r	1.2	3.1	45M
InCaRPose	1.4	2.8	24.5M
LoFTR + PnP	2.1	5.2	15M

IMS应用启示

1. 实时相机标定监控

应用场景： 后视镜相机外参变化检测

# 实时标定监控伪代码
class CameraCalibrationMonitor:
    """车内相机标定监控"""
    
    def __init__(self, reference_image, threshold_deg=5.0, threshold_m=0.05):
        self.model = InCaRPose()
        self.reference = reference_image
        self.rotation_threshold = threshold_deg
        self.translation_threshold = threshold_m
        
    def check_calibration(self, current_image):
        """检查相机是否偏离标定"""
        rotation, translation, confidence = self.model(
            self.reference, current_image
        )
        
        # 计算角度偏差
        angle = quaternion_to_euler(rotation)
        
        if angle > self.rotation_threshold or translation > self.translation_threshold:
            return {
                'status': 'RECALIBRATION_NEEDED',
                'rotation_error': angle,
                'translation_error': translation,
                'confidence': confidence
            }
        
        return {'status': 'OK'}

2. 乘员定位精度保障

关键要求： 安全气囊部署需要毫米级位置信息

指标	要求	InCaRPose能力
定位精度	±5mm	±2cm（可优化）
响应时间	<50ms	<10ms
环境鲁棒性	全天候	NIR照明，不受光照影响

3. 多相机同步标定

应用： 多摄像头车内监控系统

# 多相机标定示例
class MultiCameraCalibration:
    """多相机同步标定"""
    
    def __init__(self, camera_configs):
        self.cameras = camera_configs  # ['mirror', 'a_pillar', 'dashboard']
        self.model = InCaRPose()
        self.references = {}  # 每个相机的参考帧
        
    def calibrate_all(self, current_frames):
        """同步标定所有相机"""
        results = {}
        
        for camera_id, current_frame in current_frames.items():
            ref_frame = self.references[camera_id]
            rotation, translation, conf = self.model(ref_frame, current_frame)
            
            results[camera_id] = {
                'extrinsic_change': (rotation, translation),
                'confidence': conf,
                'needs_recalibration': self._check_threshold(rotation, translation)
            }
        
        return results

4. 部署优化建议

边缘设备部署：

平台	优化策略	预期性能
Qualcomm QCS8255	INT8量化 + NNAPI	60+ FPS
TI TDA4VM	C7x DSP优化	50+ FPS
NVIDIA Orin	TensorRT FP16	100+ FPS

优化步骤：

# TensorRT优化示例
import torch_tensorrt

# 导出为TorchScript
scripted_model = torch.jit.script(model)

# TensorRT优化
trt_model = torch_tensorrt.compile(
    scripted_model,
    inputs=[
        torch_tensorrt.Input((1, 3, 224, 224), dtype=torch.float32),  # ref
        torch_tensorrt.Input((1, 3, 224, 224), dtype=torch.float32),  # target
    ],
    enabled_precisions={torch.float16}
)

# 保存优化模型
torch.jit.save(trt_model, "incarpose_trt_fp16.ts")

数据集说明

In-Cabin-Pose测试集

发布内容：

高畸变广角NIR鱼眼图像
度量级位姿真值标注
真实车内环境数据

采集方式：

移除前挡风玻璃
允许相机自由移动
避免操作员遮挡

数据格式：

# 数据集结构
dataset/
├── images/
│   ├── reference/
│   │   ├── 0001.png
│   │   └── ...
│   └── target/
│       ├── 0001.png
│       └── ...
├── poses/
│   ├── rotation.npy  # [N, 4] 四元数
│   └── translation.npy  # [N, 3] 米
└── intrinsics/
    └── camera_params.json

总结

关键优势

合成到真实迁移： 仅用合成数据训练，泛化到真实车内
实时性能： 100+ FPS，满足安全气囊部署时间要求
度量尺度输出： 直接输出真实距离，无需尺度恢复
车辆无关设计： 无需针对每款车型重新训练

局限性

平移精度有限： ±2cm，对于安全气囊部署可能需要进一步优化
单场景测试： 仅在单辆车上验证，需要更多车型数据
鱼眼限制： 专门针对广角鱼眼，其他镜头需要适配

未来方向

更高精度： 引入多视图约束和时序平滑
在线标定： 集成到DMS/OMS系统中，持续监控
多模态融合： 结合IMU数据提升鲁棒性

参考资源：