Paper Review
LMDrive: Closed-Loop End-to-End Driving with Large Language Models
Vision Encoder to Control Process
1. Framework Overview
LMDrive integrates vision, language, and control to enable closed-loop, end-to-end autonomous driving. The system consists of a vision encoder that processes sensor data and a language model that interprets navigation and notice instructions to predict control signals.
- Inputs: Multi-modal sensor data (multi-view RGB images and LiDAR) and natural language instructions.
- Outputs: Control signals (steering, braking, acceleration) and a flag indicating instruction completion.
2. Vision Encoder
The vision encoder processes the sensor data into features for the language model.
- Components:
- Sensor Encoding:
- Images are encoded via ResNet-50 to extract feature maps.
- LiDAR data is processed using PointPillars, creating ego-centric BEV (bird's-eye view) features.
- BEV Decoder:
- Transforms encoded features into visual tokens, representing BEV features, waypoints, and traffic light status.
- Pre-Training:
Pre-training tasks for the encoder include object detection, waypoint prediction, and traffic light classification. These tasks ensure robust scene understanding.
3. Language Model
The language model (LLaMA) serves as the central decision-making unit, integrating visual and textual information.
- Tokenization:
- Vision tokens are reduced using a Q-Former to optimize memory usage.
- Navigation instructions are tokenized via LLaMA’s tokenizer.
- Processing:
- Temporal consistency is achieved by integrating historical visual tokens (up to 40 frames).
- Control signals and instruction completion flags are predicted.
- Adapters:
- Bridge visual features and language tokens, enabling unified processing.
4. Control Signal Prediction
Predicted waypoints guide a PID controller to produce final control signals:
- Longitudinal Control: Manages velocity via throttle and brake.
- Latitudinal Control: Adjusts steering based on predicted heading.
5. Training Process
- Stage 1: Vision Encoder Pre-Training
- Focused on scene perception tasks using a dataset of 3M frames.
- Stage 2: Instruction-Finetuning
- The entire system is trained with language-guided driving data to align control signals with instructions.
Dataset and Benchmark
- Dataset:
Created using CARLA simulator, the dataset comprises 64K clips annotated with navigation and notice instructions.
- Includes complex scenarios (e.g., adversarial events, misleading instructions).
- Clips span 2–20 seconds.
- LangAuto Benchmark:
Evaluates instruction-following driving across diverse conditions (e.g., weather, traffic scenarios).
Conclusion
LMDrive demonstrates a novel approach to autonomous driving by combining LLMs with multi-modal sensor data in a closed-loop framework. Its ability to interpret complex instructions and interact with humans sets a foundation for robust, explainable driving systems.
댓글 영역