雷达-摄像头融合综述论文解读与代码复现

雷达-摄像头融合综述论文解读与代码复现

发布时间: 2026-06-15
标签: 论文解读, 雷达融合, 摄像头融合, CPD, OMS
来源: arXiv 2410.19872, IEEE TIV 2024


论文信息

  • 标题: Radar and Camera Fusion for Object Detection and Tracking: A Comprehensive Survey
  • 作者: Kun Shi et al.
  • 发表: arXiv:2410.19872 (2024年10月)
  • 链接: https://arxiv.org/abs/2410.19872

核心贡献

本文是首个系统性综述雷达-摄像头融合目标检测与跟踪的论文:

  1. 完整分类体系:数据级、特征级、决策级融合
  2. 数据集汇总:2019-2024 雷达-摄像头融合数据集
  3. 算法演进:从早期级联到最新 Transformer 架构
  4. 挑战分析:标定、对齐、表示、融合策略

融合架构分类

1. 数据级融合

graph LR
    A[雷达点云] --> C[数据融合]
    B[摄像头图像] --> C
    C --> D[融合表示]
    D --> E[检测网络]

特点:

  • 早期融合,信息损失最小
  • 对标定精度要求高
  • 计算开销大

2. 特征级融合

graph LR
    A[雷达点云] --> D1[雷达特征提取]
    B[摄像头图像] --> D2[视觉特征提取]
    D1 --> E[特征融合]
    D2 --> E
    E --> F[融合特征]
    F --> G[检测头]

特点:

  • 主流方法
  • 平衡精度与效率
  • 需要特征对齐模块

3. 决策级融合

graph LR
    A[雷达点云] --> D1[雷达检测器]
    B[摄像头图像] --> D2[视觉检测器]
    D1 --> E[结果融合]
    D2 --> E
    E --> F[最终结果]

特点:

  • 模块化设计
  • 易于部署
  • 信息损失较大

核心算法复现

RCBEVDet 架构实现

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
"""
RCBEVDet: Radar-Camera Fusion in Bird's Eye View
论文: CVPR 2024
复现: 基于 arXiv 2410.19872 综述描述
"""

import numpy as np
import torch
import torch.nn as nn
import torch.nn.functional as F
from typing import Dict, Tuple, List, Optional

class RadarFeatureEncoder(nn.Module):
"""
雷达特征编码器

将稀疏雷达点云转换为 BEV 特征图
"""

def __init__(self, config: Dict):
super().__init__()

# 输入通道:距离、速度、RCS、角度
self.in_channels = config.get('radar_channels', 5)
self.bev_h = config.get('bev_h', 200)
self.bev_w = config.get('bev_w', 200)
self.bev_range = config.get('bev_range', [-50, 50, -50, 50]) # [x_min, x_max, y_min, y_max]

# 点云特征 MLP
self.pillar_encoder = nn.Sequential(
nn.Linear(self.in_channels, 64),
nn.BatchNorm1d(64),
nn.ReLU(),
nn.Linear(64, 128),
nn.BatchNorm1d(128),
nn.ReLU()
)

# BEV 卷积
self.bev_conv = nn.Sequential(
nn.Conv2d(128, 256, 3, padding=1),
nn.BatchNorm2d(256),
nn.ReLU(),
nn.Conv2d(256, 256, 3, padding=1),
nn.BatchNorm2d(256),
nn.ReLU()
)

def forward(self, radar_points: Dict[str, torch.Tensor]) -> torch.Tensor:
"""
前向传播

Args:
radar_points: 雷达点云字典
- 'xyz': (B, N, 3) 位置
- 'velocity': (B, N, 3) 速度
- 'rcs': (B, N, 1) RCS 值

Returns:
bev_features: (B, C, H, W) BEV 特征图
"""
B = radar_points['xyz'].shape[0]

# 构建输入特征
xyz = radar_points['xyz'] # (B, N, 3)
velocity = radar_points['velocity'] # (B, N, 3)
rcs = radar_points['rcs'] # (B, N, 1)

# 合并特征
features = torch.cat([xyz, velocity, rcs], dim=-1) # (B, N, 7)

# PointNet 式编码
features_flat = features.view(-1, features.shape[-1])
encoded = self.pillar_encoder(features_flat) # (B*N, 128)
encoded = encoded.view(B, -1, 128) # (B, N, 128)

# 投影到 BEV
bev_features = self._scatter_to_bev(encoded, xyz)

# BEV 卷积
bev_features = self.bev_conv(bev_features)

return bev_features

def _scatter_to_bev(self, features: torch.Tensor, xyz: torch.Tensor) -> torch.Tensor:
"""
将点云特征散射到 BEV 网格

Args:
features: (B, N, C) 点云特征
xyz: (B, N, 3) 点云坐标

Returns:
bev: (B, C, H, W) BEV 特征图
"""
B, N, C = features.shape

# 初始化 BEV 网格
bev = torch.zeros(B, C, self.bev_h, self.bev_w, device=features.device)
count = torch.zeros(B, 1, self.bev_h, self.bev_w, device=features.device)

# 计算网格索引
x = xyz[..., 0] # (B, N)
y = xyz[..., 1] # (B, N)

# 归一化到网格
x_min, x_max, y_min, y_max = self.bev_range
grid_x = ((x - x_min) / (x_max - x_min) * self.bev_w).long()
grid_y = ((y - y_min) / (y_max - y_min) * self.bev_h).long()

# 边界检查
valid = (grid_x >= 0) & (grid_x < self.bev_w) & \
(grid_y >= 0) & (grid_y < self.bev_h)

# 散射特征(简化版,实际应使用 scatter_reduce)
for b in range(B):
valid_b = valid[b]
gx = grid_x[b, valid_b]
gy = grid_y[b, valid_b]
feat = features[b, valid_b] # (M, C)

# 累加特征
bev[b, :, gy, gx] += feat.T
count[b, 0, gy, gx] += 1

# 平均池化
count = count.clamp(min=1)
bev = bev / count

return bev


class CameraBEVEncoder(nn.Module):
"""
摄像头 BEV 编码器

将多视角图像转换为 BEV 特征
"""

def __init__(self, config: Dict):
super().__init__()

# 图像骨干网络
self.backbone = self._build_backbone(config.get('backbone', 'resnet50'))

# BEV 投影
self.bev_proj = nn.Sequential(
nn.Conv2d(256, 256, 1),
nn.BatchNorm2d(256),
nn.ReLU()
)

# 深度估计(用于 LSS)
self.depth_net = nn.Sequential(
nn.Conv2d(256, 256, 3, padding=1),
nn.BatchNorm2d(256),
nn.ReLU(),
nn.Conv2d(256, config.get('depth_bins', 64), 1)
)

self.depth_bins = config.get('depth_bins', 64)
self.depth_max = config.get('depth_max', 50.0)

def _build_backbone(self, name: str) -> nn.Module:
"""构建骨干网络"""
import torchvision.models as models
if name == 'resnet50':
model = models.resnet50(pretrained=True)
return nn.Sequential(*list(model.children())[:-2])
raise ValueError(f"Unsupported: {name}")

def forward(self, images: torch.Tensor, intrinsics: torch.Tensor, extrinsics: torch.Tensor) -> torch.Tensor:
"""
前向传播

Args:
images: (B, N, 3, H, W) 多视角图像
intrinsics: (B, N, 3, 3) 内参矩阵
extrinsics: (B, N, 4, 4) 外参矩阵

Returns:
bev_features: (B, C, H, W) BEV 特征图
"""
B, N, C, H, W = images.shape

# 展平多视角
images_flat = images.view(B * N, C, H, W)

# 骨干提取特征
features = self.backbone(images_flat) # (B*N, 2048, H/16, W/16)

# 降维
features = self.bev_proj(features) # (B*N, 256, H/16, W/16)

# LSS BEV 投影(简化版)
bev_features = self._lift_splat_shoot(features, intrinsics, extrinsics, B, N)

return bev_features

def _lift_splat_shoot(self, features, intrinsics, extrinsics, B, N):
"""LSS BEV 投影(简化实现)"""
# 实际实现需要完整的 LSS 算法
# 这里返回模拟结果
return torch.zeros(B, 256, 200, 200, device=features.device)


class RadarCameraFusion(nn.Module):
"""
雷达-摄像头融合模块

基于论文描述的注意力融合机制
"""

def __init__(self, config: Dict):
super().__init__()

self.channels = config.get('channels', 256)

# 跨模态注意力
self.cross_attention = nn.MultiheadAttention(
embed_dim=self.channels,
num_heads=8,
batch_first=True
)

# 融合卷积
self.fusion_conv = nn.Sequential(
nn.Conv2d(self.channels * 2, self.channels, 3, padding=1),
nn.BatchNorm2d(self.channels),
nn.ReLU(),
nn.Conv2d(self.channels, self.channels, 3, padding=1),
nn.BatchNorm2d(self.channels),
nn.ReLU()
)

def forward(self, radar_bev: torch.Tensor, camera_bev: torch.Tensor) -> torch.Tensor:
"""
融合雷达和摄像头 BEV 特征

Args:
radar_bev: (B, C, H, W) 雷达 BEV 特征
camera_bev: (B, C, H, W) 摄像头 BEV 特征

Returns:
fused_bev: (B, C, H, W) 融合 BEV 特征
"""
B, C, H, W = radar_bev.shape

# 展平为序列
radar_seq = radar_bev.flatten(2).transpose(1, 2) # (B, H*W, C)
camera_seq = camera_bev.flatten(2).transpose(1, 2) # (B, H*W, C)

# 跨模态注意力
attended, _ = self.cross_attention(radar_seq, camera_seq, camera_seq)
attended = attended.transpose(1, 2).view(B, C, H, W)

# 拼接融合
concat = torch.cat([radar_bev, attended], dim=1)
fused = self.fusion_conv(concat)

return fused


class RCBEVDet(nn.Module):
"""
RCBEVDet: 雷达-摄像头 BEV 检测器

论文: CVPR 2024
"""

def __init__(self, config: Dict):
super().__init__()

# 编码器
self.radar_encoder = RadarFeatureEncoder(config)
self.camera_encoder = CameraBEVEncoder(config)

# 融合模块
self.fusion = RadarCameraFusion(config)

# 检测头
self.det_head = nn.Sequential(
nn.Conv2d(256, 256, 3, padding=1),
nn.ReLU(),
nn.Conv2d(256, 10, 1) # 10 = 3(中心) + 3(尺寸) + 2(朝向) + 2(速度)
)

# 分类头
self.cls_head = nn.Sequential(
nn.Conv2d(256, 128, 3, padding=1),
nn.ReLU(),
nn.Conv2d(128, 1, 1),
nn.Sigmoid()
)

def forward(self, radar_data: Dict, camera_data: Dict) -> Dict:
"""
前向传播

Args:
radar_data: 雷达数据
camera_data: 摄像头数据

Returns:
outputs: 检测结果
"""
# 编码
radar_bev = self.radar_encoder(radar_data)
camera_bev = self.camera_encoder(
camera_data['images'],
camera_data['intrinsics'],
camera_data['extrinsics']
)

# 融合
fused_bev = self.fusion(radar_bev, camera_bev)

# 检测
det_output = self.det_head(fused_bev)
cls_output = self.cls_head(fused_bev)

return {
'detection': det_output,
'classification': cls_output,
'bev_features': fused_bev
}


# 测试示例
if __name__ == "__main__":
config = {
'radar_channels': 5,
'bev_h': 200,
'bev_w': 200,
'bev_range': [-50, 50, -50, 50],
'channels': 256,
'backbone': 'resnet50',
'depth_bins': 64,
'depth_max': 50.0
}

model = RCBEVDet(config)

# 模拟输入
B = 2
N_radar = 500
N_cam = 6

radar_data = {
'xyz': torch.randn(B, N_radar, 3),
'velocity': torch.randn(B, N_radar, 3),
'rcs': torch.randn(B, N_radar, 1)
}

camera_data = {
'images': torch.randn(B, N_cam, 3, 224, 224),
'intrinsics': torch.eye(3).unsqueeze(0).unsqueeze(0).expand(B, N_cam, -1, -1),
'extrinsics': torch.eye(4).unsqueeze(0).unsqueeze(0).expand(B, N_cam, -1, -1)
}

# 前向传播
outputs = model(radar_data, camera_data)

print(f"检测输出形状: {outputs['detection'].shape}")
print(f"分类输出形状: {outputs['classification'].shape}")
print(f"BEV 特征形状: {outputs['bev_features'].shape}")

CPD 应用

雷达-摄像头融合用于儿童存在检测

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
"""
雷达-摄像头融合用于 CPD(儿童存在检测)
Euro NCAP 2026 要求应用
"""

class CPD_RadarCameraFusion(nn.Module):
"""
CPD 雷达-摄像头融合检测器

检测车内遗留儿童
"""

def __init__(self, config: Dict):
super().__init__()

# 客舱雷达编码器
self.radar_encoder = RadarFeatureEncoder(config)

# 客舱摄像头编码器
self.camera_encoder = CameraBEVEncoder(config)

# 融合模块
self.fusion = RadarCameraFusion(config)

# 儿童检测头
self.child_detector = nn.Sequential(
nn.Conv2d(256, 128, 3, padding=1),
nn.ReLU(),
nn.Conv2d(128, 1, 1),
nn.Sigmoid()
)

# 生命体征检测头
self.vital_signs = nn.Sequential(
nn.Conv2d(256, 64, 3, padding=1),
nn.ReLU(),
nn.Conv2d(64, 2, 1) # 呼吸频率、心率
)

def forward(self, radar_data: Dict, camera_data: Dict) -> Dict:
"""
前向传播

Args:
radar_data: 客舱雷达数据(60GHz mmWave)
camera_data: 客舱摄像头图像

Returns:
outputs: CPD 检测结果
"""
# 编码
radar_bev = self.radar_encoder(radar_data)
camera_bev = self.camera_encoder(
camera_data['images'],
camera_data['intrinsics'],
camera_data['extrinsics']
)

# 融合
fused_bev = self.fusion(radar_bev, camera_bev)

# 儿童检测
child_prob = self.child_detector(fused_bev)

# 生命体征
vitals = self.vital_signs(fused_bev)

return {
'child_presence_probability': child_prob,
'vital_signs': vitals,
'breathing_rate': vitals[:, 0],
'heart_rate': vitals[:, 1]
}


# Euro NCAP CPD 测试场景
class EuroNCAP_CPDTest:
"""Euro NCAP CPD 测试场景"""

def __init__(self):
self.model = CPD_RadarCameraFusion({'channels': 256})

# 测试场景
self.test_scenarios = [
{
'id': 'CPD-01',
'description': '6个月婴儿独自留在后座',
'expected_detection': True,
'time_limit': 60 # 秒
},
{
'id': 'CPD-02',
'description': '3岁儿童独自留在后座',
'expected_detection': True,
'time_limit': 60
},
{
'id': 'CPD-03',
'description': '宠物留在车内',
'expected_detection': True, # 可选
'time_limit': 60
},
{
'id': 'CPD-04',
'description': '空车',
'expected_detection': False,
'time_limit': 60
}
]

def run_test(self, scenario_id: str, radar_data: Dict, camera_data: Dict) -> Dict:
"""运行测试"""
scenario = next(s for s in self.test_scenarios if s['id'] == scenario_id)

# 模型推理
output = self.model(radar_data, camera_data)

# 判断
detected = output['child_presence_probability'].max() > 0.5

return {
'scenario_id': scenario_id,
'expected': scenario['expected_detection'],
'detected': detected,
'passed': detected == scenario['expected_detection'],
'confidence': output['child_presence_probability'].max().item()
}

数据集汇总

数据集 年份 传感器 场景 大小
nuScenes 2019 激光雷达 + 摄像头 自动驾驶 1.4M 帧
Waymo Open 2019 激光雷达 + 摄像头 自动驾驶 200K 帧
VoD 2022 毫米波雷达 + 摄像头 自动驾驶 8.5K 帧
TJ4DRadSet 2022 4D 雷达 + 摄像头 自动驾驶 7K 帧
K-Radar 2023 4D 雷达 + 摄像头 自动驾驶 35K 帧

IMS 开发启示

1. 传感器选型

传感器 用途 推荐型号
60GHz 雷达 CPD、生命体征 TI IWR6843AOP
车内摄像头 OMS、姿态检测 OV2311 RGB-IR
外部摄像头 DMS、视线追踪 OV2311 全局快门

2. 融合策略

策略 优势 劣势
特征级融合 平衡精度/效率 需要特征对齐
BEV 融合 统一表示 计算开销大
注意力融合 自适应权重 训练复杂

3. 部署优化

1
2
3
4
5
6
7
8
9
10
11
12
13
# 量化部署示例
def quantize_model(model):
"""量化模型以部署到边缘设备"""
model.eval()

# 动态量化
quantized = torch.quantization.quantize_dynamic(
model,
{nn.Linear, nn.Conv2d},
dtype=torch.qint8
)

return quantized

总结

雷达-摄像头融合是 CPD 和 OMS 的关键技术路径:

  1. 雷达优势:穿透性强、可检测生命体征
  2. 摄像头优势:高分辨率、姿态精确
  3. 融合优势:互补、鲁棒、全天候

参考来源:


雷达-摄像头融合综述论文解读与代码复现
https://dapalm.com/2026/06/15/2026-06-15-Radar-Camera-Fusion-Survey-Paper-CPD-OMS/
作者
Mars
发布于
2026年6月15日
许可协议