MediaPipe 系列 30：Object Detection——通用目标检测完整指南

前言：为什么需要通用目标检测？

30.1 Object Detection 的重要性

通用目标检测在 IMS/OMS 中的应用：

┌─────────────────────────────────────────────────────────────────────────┐
│                    Object Detection 在 IMS/OMS 中的应用                  │
├─────────────────────────────────────────────────────────────────────────┤
│                                                                         │
│   IMS/OMS 场景需求：                                                   │
│   ┌─────────────────────────────────────────────────────────┐          │
│   │                                                         │          │
│   │   • 车内物品检测（遗留物、贵重物品）                     │          │
│   │   • 儿童座椅检测（CPD 辅助）                             │          │
│   │   • 宠物检测（宠物遗留警示）                             │          │
│   │   • 驾驶员手持物品检测（打电话、抽烟、喝水）             │          │
│   │   • 车外目标检测（行人、车辆、交通标志）                 │          │
│   │                                                         │          │
│   └─────────────────────────────────────────────────────────┘          │
│                                                                         │
│   MediaPipe Object Detection 特点：                                    │
│   ┌─────────────────────────────────────────────────────────┐          │
│   │                                                         │          │
│   │   • 支持多种预训练模型（SSD、EfficientDet）              │          │
│   │   • 支持自定义模型                                       │          │
│   │   • 实时性能（~10ms GPU）                                │          │
│   │   • 支持多类别检测                                       │          │
│   │                                                         │          │
│   └─────────────────────────────────────────────────────────┘          │
│                                                                         │
└─────────────────────────────────────────────────────────────────────────┘

30.2 支持的模型

模型	参数量	计算量	FPS (GPU)	精度 (mAP)
SSD MobileNet V2	3.4M	0.9B	30+	21%
SSD MobileNet V2 FPN	4.5M	1.2B	25+	22%
EfficientDet-Lite0	3.9M	0.8B	25+	25%
EfficientDet-Lite1	4.6M	1.4B	20+	30%
EfficientDet-Lite2	5.4M	2.0B	15+	33%

30.3 COCO 类别

┌─────────────────────────────────────────────────────────────────────────┐
│                    COCO 80 类别                                          │
├─────────────────────────────────────────────────────────────────────────┤
│                                                                         │
│   IMS 相关类别：                                                        │
│   ┌─────────────────────────────────────────────────────────┐          │
│   │                                                         │          │
│   │   0: person     - 驾驶员/乘员检测                       │          │
│   │   1: bicycle    - 自行车                                │          │
│   │   2: car        - 汽车                                  │          │
│   │   3: motorcycle - 摩托车                                │          │
│   │   5: bus        - 公交车                                │          │
│   │   7: truck      - 卡车                                  │          │
│   │   9: traffic light - 交通灯                             │          │
│   │   11: stop sign - 停车标志                              │          │
│   │   13: bench     - 长椅                                  │          │
│   │   63: laptop    - 笔记本电脑                            │          │
│   │   67: cell phone - 手机                                 │          │
│   │   73: book      - 书籍                                  │          │
│   │   77: teddy bear - 玩具熊                               │          │
│   │                                                         │          │
│   └─────────────────────────────────────────────────────────┘          │
│                                                                         │
│   其他类别：交通、动物、食品、家具、电子设备等                          │
│                                                                         │
└─────────────────────────────────────────────────────────────────────────┘

三十一、SSD 架构详解

31.1 SSD (Single Shot MultiBox Detector)

┌─────────────────────────────────────────────────────────────────────────┐
│                    SSD 架构                                              │
├─────────────────────────────────────────────────────────────────────────┤
│                                                                         │
│   SSD 特点：                                                            │
│   ┌─────────────────────────────────────────────────────────┐          │
│   │                                                         │          │
│   │   • 单阶段检测器（One-stage Detector）                   │          │
│   │   • 多尺度特征图预测                                     │          │
│   │   • 默认框（Default Boxes）机制                          │          │
│   │   • 端到端训练                                           │          │
│   │                                                         │          │
│   └─────────────────────────────────────────────────────────┘          │
│                                                                         │
│   SSD MobileNet V2 架构：                                               │
│   ┌─────────────────────────────────────────────────────────┐          │
│   │                                                         │          │
│   │   输入：320×320×3 RGB                                    │          │
│   │         │                                               │          │
│   │         ▼                                               │          │
│   │   Backbone: MobileNet V2                                │          │
│   │         │                                               │          │
│   │         ├─► Feature Map 1: 19×19×512                    │          │
│   │         │    └─► 预测层：分类 + 回归                     │          │
│   │         │                                               │          │
│   │         ├─► Feature Map 2: 10×10×256                    │          │
│   │         │    └─► 预测层：分类 + 回归                     │          │
│   │         │                                               │          │
│   │         ├─► Feature Map 3: 5×5×256                      │          │
│   │         │    └─► 预测层：分类 + 回归                     │          │
│   │         │                                               │          │
│   │         ├─► Feature Map 4: 3×3×128                      │          │
│   │         │    └─► 预测层：分类 + 回归                     │          │
│   │         │                                               │          │
│   │         └─► Feature Map 5: 2×2×128                      │          │
│   │              └─► 预测层：分类 + 回归                     │          │
│   │                                                         │          │
│   │   后处理：NMS（非极大值抑制）                            │          │
│   │         │                                               │          │
│   │         ▼                                               │          │
│   │   输出：检测框 + 类别 + 分数                             │          │
│   │                                                         │          │
│   └─────────────────────────────────────────────────────────┘          │
│                                                                         │
└─────────────────────────────────────────────────────────────────────────┘

31.2 EfficientDet 架构

┌─────────────────────────────────────────────────────────────────────────┐
│                    EfficientDet 架构                                     │
├─────────────────────────────────────────────────────────────────────────┤
│                                                                         │
│   EfficientDet 特点：                                                   │
│   ┌─────────────────────────────────────────────────────────┐          │
│   │                                                         │          │
│   │   • EfficientNet Backbone                               │          │
│   │   • BiFPN（双向特征金字塔网络）                          │          │
│   │   • 复合缩放（Compound Scaling）                         │          │
│   │   • 更高的精度/效率比                                    │          │
│   │                                                         │          │
│   └─────────────────────────────────────────────────────────┘          │
│                                                                         │
│   EfficientDet-Lite 架构：                                              │
│   ┌─────────────────────────────────────────────────────────┐          │
│   │                                                         │          │
│   │   输入：320×320×3 RGB                                    │          │
│   │         │                                               │          │
│   │         ▼                                               │          │
│   │   Backbone: EfficientNet-Lite                           │          │
│   │         │                                               │          │
│   │         ▼                                               │          │
│   │   BiFPN (Bidirectional Feature Pyramid Network)         │          │
│   │         │                                               │          │
│   │         ├─► P3: 40×40×64                                │          │
│   │         ├─► P4: 20×20×88                                │          │
│   │         ├─► P5: 10×10×96                                │          │
│   │         ├─► P6: 5×5×96                                  │          │
│   │         └─► P7: 3×3×96                                  │          │
│   │                                                         │          │
│   │   分类头 + 回归头                                        │          │
│   │         │                                               │          │
│   │         ▼                                               │          │
│   │   输出：检测框 + 类别 + 分数                             │          │
│   │                                                         │          │
│   └─────────────────────────────────────────────────────────┘          │
│                                                                         │
└─────────────────────────────────────────────────────────────────────────┘

三十二、Graph 配置

32.1 完整 Graph 配置

# ========== Object Detection Graph 配置 ==========

# mediapipe/graphs/object_detection/object_detection_desktop_live.pbtxt

input_stream: "INPUT:Input"

output_stream: "DETECTIONS:detections"
output_stream: "ANNOTATIONS:annotations"

# ========== 1. 图像格式转换 ==========
node {
  calculator: "ImageTransformationCalculator"
  input_stream: "INPUT:Input"
  output_stream: "IMAGE:converted_image"
  options {
    [mediapipe.ImageTransformationCalculatorOptions.ext] {
      output_format: SRGB
    }
  }
}

# ========== 2. 目标检测 ==========
node {
  calculator: "ObjectDetectionCalculator"
  input_stream: "IMAGE:converted_image"
  output_stream: "DETECTIONS:detections"
  options {
    [mediapipe.ObjectDetectionCalculatorOptions.ext] {
      model_path: "/models/ssd_mobilenet_v2.tflite"
      score_threshold: 0.5
      max_detections: 10
    }
  }
}

# ========== 3. 可视化标注 ==========
node {
  calculator: "DetectionToAnnotationCalculator"
  input_stream: "DETECTIONS:detections"
  input_stream: "IMAGE:converted_image"
  output_stream: "ANNOTATIONS:annotations"
}

32.2 Object Detection Calculator

// object_detection_calculator.h
#ifndef MEDIAPIPE_CALCULATORS_TFLITE_OBJECT_DETECTION_CALCULATOR_H_
#define MEDIAPIPE_CALCULATORS_TFLITE_OBJECT_DETECTION_CALCULATOR_H_

#include "mediapipe/framework/calculator_framework.h"
#include "mediapipe/framework/formats/detection.pb.h"

namespace mediapipe {

// ========== Object Detection Calculator ==========
class ObjectDetectionCalculator : public CalculatorBase {
 public:
  static absl::Status GetContract(CalculatorContract* cc) {
    cc->Inputs().Tag("IMAGE").Set<ImageFrame>();
    cc->Outputs().Tag("DETECTIONS").Set<std::vector<Detection>>();
    
    cc->Options<ObjectDetectionOptions>();
    return absl::OkStatus();
  }

  absl::Status Open(CalculatorContext* cc) override {
    const auto& options = cc->Options<ObjectDetectionOptions>();
    
    // 加载模型
    model_ = LoadTFLiteModel(options.model_path());
    interpreter_ = CreateInterpreter(model_);
    
    score_threshold_ = options.score_threshold();
    max_detections_ = options.max_detections();
    
    // 加载类别标签
    labels_ = LoadLabels(options.label_path());
    
    return absl::OkStatus();
  }

  absl::Status Process(CalculatorContext* cc) override {
    const auto& image = cc->Inputs().Tag("IMAGE").Get<ImageFrame>();
    
    // ========== 1. 预处理 ==========
    cv::Mat input_mat = ImageFrameToMat(image);
    cv::Mat resized;
    cv::resize(input_mat, resized, cv::Size(320, 320));
    
    // 归一化
    resized.convertTo(resized, CV_32F, 1.0 / 127.5, -1.0);
    
    // ========== 2. 推理 ==========
    CopyToInputTensor(resized, interpreter_->input_tensor(0));
    interpreter_->Invoke();
    
    // ========== 3. 后处理 ==========
    auto detections = ParseDetections(
        interpreter_->output_tensor(0),  // boxes
        interpreter_->output_tensor(1),  // classes
        interpreter_->output_tensor(2),  // scores
        interpreter_->output_tensor(3)); // num_detections
    
    // 过滤低分数
    detections.erase(
        std::remove_if(detections.begin(), detections.end(),
            [this](const Detection& d) { return d.score() < score_threshold_; }),
        detections.end());
    
    // NMS
    detections = NonMaxSuppression(detections, 0.5);
    
    // 限制数量
    if (detections.size() > max_detections_) {
      detections.resize(max_detections_);
    }
    
    cc->Outputs().Tag("DETECTIONS").AddPacket(
        MakePacket<std::vector<Detection>>(detections).At(cc->InputTimestamp()));
    
    return absl::OkStatus();
  }

 private:
  std::unique_ptr<tflite::FlatBufferModel> model_;
  std::unique_ptr<tflite::Interpreter> interpreter_;
  float score_threshold_ = 0.5f;
  int max_detections_ = 10;
  std::vector<std::string> labels_;
  
  std::vector<Detection> ParseDetections(
      TfLiteTensor* boxes, TfLiteTensor* classes, 
      TfLiteTensor* scores, TfLiteTensor* num_detections);
  
  std::vector<Detection> NonMaxSuppression(
      std::vector<Detection>& detections, float nms_threshold);
};

REGISTER_CALCULATOR(ObjectDetectionCalculator);

}  // namespace mediapipe

#endif

三十三、IMS 实战：车内物品检测

33.1 车内物品检测 Graph

# ims_cabin_object_detection_graph.pbtxt

input_stream: "OMS_IMAGE:oms_image"
output_stream: "OBJECTS:detected_objects"
output_stream: "ALERT:alert"

# ========== 1. 目标检测 ==========
node {
  calculator: "ObjectDetectionCalculator"
  input_stream: "IMAGE:oms_image"
  output_stream: "DETECTIONS:detections"
  options {
    [mediapipe.ObjectDetectionOptions.ext] {
      model_path: "/models/cabin_objects.tflite"
      score_threshold: 0.5
      max_detections: 20
    }
  }
}

# ========== 2. 物品过滤 ==========
node {
  calculator: "ObjectFilterCalculator"
  input_stream: "DETECTIONS:detections"
  output_stream: "OBJECTS:filtered_objects"
  options {
    [mediapipe.ObjectFilterOptions.ext] {
      allowed_classes: ["person", "backpack", "handbag", "suitcase", "bottle", "cup", "cell phone", "laptop", "book"]
    }
  }
}

# ========== 3. 遗留物检测 ==========
node {
  calculator: "LeftBehindDetectorCalculator"
  input_stream: "OBJECTS:filtered_objects"
  output_stream: "LEFT_BEHIND:left_behind_objects"
  options {
    [mediapipe.LeftBehindDetectorOptions.ext] {
      absence_threshold_frames: 30
      presence_threshold_frames: 10
    }
  }
}

# ========== 4. 告警生成 ==========
node {
  calculator: "CabinObjectAlertCalculator"
  input_stream: "LEFT_BEHIND:left_behind_objects"
  output_stream: "OBJECTS:detected_objects"
  output_stream: "ALERT:alert"
}

33.2 遗留物检测

// left_behind_detector_calculator.h
#ifndef MEDIAPIPE_CALCULATORS_IMS_LEFT_BEHIND_DETECTOR_CALCULATOR_H_
#define MEDIAPIPE_CALCULATORS_IMS_LEFT_BEHIND_DETECTOR_CALCULATOR_H_

#include "mediapipe/framework/calculator_framework.h"
#include "mediapipe/framework/formats/detection.pb.h"
#include <map>

namespace mediapipe {

// ========== 遗留物检测结果 ==========
message LeftBehindResult {
  repeated Detection objects = 1;
  bool has_left_behind = 2;
  uint64 timestamp_ms = 3;
}

// ========== Left Behind Detector Calculator ==========
class LeftBehindDetectorCalculator : public CalculatorBase {
 public:
  static absl::Status GetContract(CalculatorContract* cc) {
    cc->Inputs().Tag("OBJECTS").Set<std::vector<Detection>>();
    cc->Outputs().Tag("LEFT_BEHIND").Set<LeftBehindResult>();
    
    cc->Options<LeftBehindDetectorOptions>();
    return absl::OkStatus();
  }

  absl::Status Open(CalculatorContext* cc) override {
    const auto& options = cc->Options<LeftBehindDetectorOptions>();
    
    absence_threshold_frames_ = options.absence_threshold_frames();
    presence_threshold_frames_ = options.presence_threshold_frames();
    
    return absl::OkStatus();
  }

  absl::Status Process(CalculatorContext* cc) override {
    if (cc->Inputs().Tag("OBJECTS").IsEmpty()) {
      return absl::OkStatus();
    }

    const auto& current_objects = 
        cc->Inputs().Tag("OBJECTS").Get<std::vector<Detection>>();
    
    uint64 current_time = cc->InputTimestamp().Value() / 1000;

    // ========== 1. 更新物体追踪状态 ==========
    
    // 标记所有现有物体为未找到
    for (auto& [id, info] : tracked_objects_) {
      info.found_this_frame = false;
    }
    
    // 更新当前帧检测到的物体
    for (const auto& obj : current_objects) {
      std::string obj_id = GenerateObjectId(obj);
      
      if (tracked_objects_.find(obj_id) == tracked_objects_.end()) {
        // 新物体
        TrackedObjectInfo info;
        info.first_seen = current_time;
        info.last_seen = current_time;
        info.presence_count = 1;
        info.found_this_frame = true;
        info.detection = obj;
        tracked_objects_[obj_id] = info;
      } else {
        // 已存在的物体
        tracked_objects_[obj_id].last_seen = current_time;
        tracked_objects_[obj_id].presence_count++;
        tracked_objects_[obj_id].found_this_frame = true;
      }
    }
    
    // ========== 2. 检测遗留物 ==========
    std::vector<Detection> left_behind_objects;
    
    for (const auto& [id, info] : tracked_objects_) {
      // 物体持续存在超过阈值
      if (info.presence_count >= presence_threshold_frames_) {
        // 检查是否有人离开
        bool person_left = CheckPersonLeft(current_objects);
        
        if (person_left) {
          left_behind_objects.push_back(info.detection);
        }
      }
    }
    
    // ========== 3. 清理过期的追踪 ==========
    for (auto it = tracked_objects_.begin(); it != tracked_objects_.end(); ) {
      if (!it->second.found_this_frame && 
          (current_time - it->second.last_seen) > absence_threshold_frames_ * 33) {
        it = tracked_objects_.erase(it);
      } else {
        ++it;
      }
    }
    
    // ========== 4. 输出 ==========
    LeftBehindResult result;
    for (const auto& obj : left_behind_objects) {
      *result.add_objects() = obj;
    }
    result.set_has_left_behind(!left_behind_objects.empty());
    result.set_timestamp_ms(current_time);

    cc->Outputs().Tag("LEFT_BEHIND").AddPacket(
        MakePacket<LeftBehindResult>(result).At(cc->InputTimestamp()));

    return absl::OkStatus();
  }

 private:
  struct TrackedObjectInfo {
    uint64 first_seen;
    uint64 last_seen;
    int presence_count;
    bool found_this_frame;
    Detection detection;
  };
  
  std::map<std::string, TrackedObjectInfo> tracked_objects_;
  int absence_threshold_frames_ = 30;
  int presence_threshold_frames_ = 10;
  bool person_present_last_frame_ = false;
  
  std::string GenerateObjectId(const Detection& obj) {
    // 使用位置和类别生成唯一 ID
    float cx = obj.location_data().relative_bounding_box().xmin() + 
               obj.location_data().relative_bounding_box().width() / 2;
    float cy = obj.location_data().relative_bounding_box().ymin() + 
               obj.location_data().relative_bounding_box().height() / 2;
    
    return std::to_string(obj.label_id()) + "_" + 
           std::to_string(static_cast<int>(cx * 100)) + "_" +
           std::to_string(static_cast<int>(cy * 100));
  }
  
  bool CheckPersonLeft(const std::vector<Detection>& current_objects) {
    // 检查是否有人
    bool person_present = false;
    for (const auto& obj : current_objects) {
      if (obj.label_id() == 0) {  // person
        person_present = true;
        break;
      }
    }
    
    // 检测人离开（之前有人，现在没人）
    bool person_left = person_present_last_frame_ && !person_present;
    person_present_last_frame_ = person_present;
    
    return person_left;
  }
};

REGISTER_CALCULATOR(LeftBehindDetectorCalculator);

}  // namespace mediapipe

#endif

三十四、自定义模型训练

34.1 使用 TensorFlow Object Detection API

# ========== 1. 准备数据集 ==========

# 创建数据目录
mkdir -p data/train data/val

# 转换标注格式（VOC -> TFRecord）
python create_tfrecord.py \
    --data_dir=/path/to/voc_data \
    --output_path=data/train/train.record \
    --label_map_path=data/label_map.pbtxt

# ========== 2. 配置训练 ==========

# 复制配置文件
cp models/research/object_detection/configs/tf2/ssd_mobilenet_v2_320x320_coco17_tpu-8.config \
   configs/custom_ssd.config

# 修改配置
# - num_classes: 自定义类别数
# - fine_tune_checkpoint: 预训练模型路径
# - train_input_reader: 训练数据路径
# - eval_input_reader: 验证数据路径

# ========== 3. 训练 ==========

python models/research/object_detection/model_main_tf2.py \
    --model_dir=models/custom_ssd \
    --pipeline_config_path=configs/custom_ssd.config \
    --num_train_steps=50000

# ========== 4. 导出模型 ==========

python models/research/object_detection/exporter_main_v2.py \
    --input_type=image_tensor \
    --pipeline_config_path=configs/custom_ssd.config \
    --trained_checkpoint_dir=models/custom_ssd \
    --output_directory=exported_model

# ========== 5. 转换为 TFLite ==========

# 转换为 SavedModel
python convert_to_tflite.py \
    --saved_model_dir=exported_model/saved_model \
    --output_path=custom_ssd.tflite \
    --quantize  # 可选：量化模型

34.2 使用迁移学习

# ========== 迁移学习示例 ==========

import tensorflow as tf
from object_detection import model_lib

def train_custom_detector():
    # 加载预训练模型
    pipeline_config = 'configs/custom_ssd.config'
    model_dir = 'models/custom_ssd'
    
    # 配置
    configs = config_util.get_configs_from_pipeline_file(pipeline_config)
    model_config = configs['model']
    
    # 修改类别数
    model_config.ssd.num_classes = 10  # 自定义类别数
    
    # 保存修改后的配置
    config_util.save_pipeline_config(configs, model_dir)
    
    # 训练
    strategy = tf.compat.v2.distribute.MirroredStrategy()
    
    with strategy.scope():
        model_lib.train(
            pipeline_config=pipeline_config,
            model_dir=model_dir,
            num_train_steps=50000,
            sample_1_of_n_eval_examples=1
        )

if __name__ == '__main__':
    train_custom_detector()

三十五、总结

要点	说明
模型	SSD / EfficientDet
类别	COCO 80 类（可自定义）
后处理	NMS、分数过滤
自定义	TensorFlow Object Detection API
IMS 应用	车内物品检测、遗留物检测

下篇预告

MediaPipe 系列 31：Image Segmentation——图像分割

深入讲解图像分割、语义分割、实例分割、IMS 乘员分割应用。

参考资料

Google AI Edge. Object Detection
W. Liu et al. “SSD: Single Shot MultiBox Detector”
M. Tan et al. “EfficientDet: Scalable and Efficient Object Detection”
TensorFlow Object Detection API. GitHub

系列进度： 30/55
更新时间： 2026-03-12

MediaPipe 系列 > 内置 Solution

#IMS #OMS #MediaPipe #Object Detection #目标检测 #SSD #EfficientDet

MediaPipe 系列 30：Object Detection——通用目标检测完整指南

https://dapalm.com/2026/03/13/MediaPipe系列30-Object-Detection：通用目标检测/

作者

Mars

发布于

2026年3月13日

许可协议

MediaPipe 系列 31：Image Segmentation——图像分割完整指南上一篇

MediaPipe 系列 29：Pose Detection——BodyPose 架构完整指南下一篇