Example Projects & Reference Implementations
This chapter showcases three complete capstone project examples, one for each hardware path. Each demonstrates how to integrate all four modules into a working voice-controlled robot system.
Example 1: Simulation-Only Path (Gazebo + ROS 2)
Project: "VLA-Bot Simulator" (Score: 4.5/5)
Target Platform: Ubuntu 22.04 + ROS 2 Humble + Gazebo + OpenAI API
What It Does:
- Loads a humanoid robot in Gazebo
- Listens for voice input via microphone
- Uses OpenAI Whisper for ASR
- Sends commands to ChatGPT for planning
- Executes robot actions (walk, turn, wave)
- Uses camera simulation + basic object detection
- Runs entirely on a laptop (no GPU required, but GPU helpful)
Repository Structure:
vla-bot-simulator/
├── README.md
├── requirements.txt
├── .env.example # API keys template
├── launch/
│ └── vla-bot.launch.xml # Master launch file
├── src/
│ ├── vla_bot/
│ │ ├── __init__.py
│ │ ├── voice_input_node.py # Whisper integration
│ │ ├── planner_node.py # LLM planning
│ │ ├── robot_controller_node.py # ROS 2 cmd_vel publisher
│ │ ├── perception_node.py # Camera → object detection
│ │ ├── safety_manager_node.py # Emergency stop
│ │ └── utils.py # Helper functions
│ └── urdf/
│ └── humanoid_simple.urdf # Robot description
├── gazebo_worlds/
│ └── vla_bot_world.sdf # Simulation environment
├── tests/
│ └── test_integration.py # Basic tests
└── RESULTS.md # Documented results
Key Files Deep-Dive:
voice_input_node.py
import rclpy
from rclpy.node import Node
import speech_recognition as sr
import openai
from std_msgs.msg import String
class VoiceInputNode(Node):
def __init__(self):
super().__init__('voice_input_node')
self.voice_pub = self.create_publisher(String, 'voice_input', 10)
self.declare_parameter('api_key', '')
openai.api_key = self.get_parameter('api_key').value
def listen(self):
"""Capture voice and transcribe with Whisper"""
recognizer = sr.Recognizer()
with sr.Microphone() as source:
audio = recognizer.listen(source, timeout=10)
try:
# Use OpenAI Whisper via API
transcript = openai.Audio.transcribe("whisper-1", audio)
msg = String()
msg.data = transcript['text']
self.voice_pub.publish(msg)
except Exception as e:
self.get_logger().error(f"Transcription failed: {e}")
def main(args=None):
rclpy.init(args=args)
node = VoiceInputNode()
rclpy.spin(node)
if __name__ == '__main__':
main()
planner_node.py
import rclpy
from rclpy.node import Node
from std_msgs.msg import String
import openai
import json
class PlannerNode(Node):
def __init__(self):
super().__init__('planner_node')
self.action_pub = self.create_publisher(String, 'robot_action', 10)
self.create_subscription(String, 'voice_input', self.plan_from_voice, 10)
self.action_schema = {
"type": "object",
"properties": {
"action": {"type": "string", "enum": ["walk", "turn", "wave", "stop"]},
"distance": {"type": "number"},
"angle": {"type": "number"},
"duration": {"type": "number"}
}
}
def plan_from_voice(self, msg):
"""Use LLM to convert voice to robot actions"""
voice_input = msg.data
prompt = f"""
Convert this voice command to a robot action:
Command: "{voice_input}"
Output ONLY valid JSON matching this schema:
{json.dumps(self.action_schema)}
Examples:
"walk forward" → {{"action": "walk", "distance": 1.0}}
"turn left 90 degrees" → {{"action": "turn", "angle": 90}}
"wave hand" → {{"action": "wave", "duration": 3}}
"""
try:
response = openai.ChatCompletion.create(
model="gpt-3.5-turbo",
messages=[{"role": "user", "content": prompt}]
)
action_json = response['choices'][0]['message']['content']
action = json.loads(action_json)
action_msg = String()
action_msg.data = json.dumps(action)
self.action_pub.publish(action_msg)
except Exception as e:
self.get_logger().error(f"Planning failed: {e}")
def main(args=None):
rclpy.init(args=args)
node = PlannerNode()
rclpy.spin(node)
if __name__ == '__main__':
main()
Launch File (vla-bot.launch.xml):
<?xml version="1.0"?>
<launch>
<arg name="world_name" default="vla_bot_world"/>
<arg name="api_key" default=""/>
<!-- Gazebo world -->
<include file="$(find-pkg-share gazebo_ros)/launch/gazebo.launch.py">
<arg name="world" value="$(find-pkg-share vla-bot)/gazebo_worlds/$(var world_name).sdf"/>
</include>
<!-- Robot spawning -->
<node pkg="gazebo_ros" exec="spawn_entity.py"
args="-entity humanoid -file $(find-pkg-share vla-bot)/src/urdf/humanoid_simple.urdf
-x 0 -y 0 -z 0.5"/>
<!-- ROS 2 nodes -->
<node pkg="vla-bot" exec="voice_input_node" name="voice_input">
<param name="api_key" value="$(var api_key)"/>
</node>
<node pkg="vla-bot" exec="planner_node" name="planner"/>
<node pkg="vla-bot" exec="robot_controller_node" name="robot_controller"/>
<node pkg="vla-bot" exec="perception_node" name="perception"/>
<node pkg="vla-bot" exec="safety_manager_node" name="safety"/>
</launch>
Test Results:
Command 1: "Walk forward 1 meter"
Expected: Robot moves forward 1m in Gazebo
Actual: ✅ Robot walks 1.05m (within tolerance)
Latency: 2.3s (voice → LLM → action)
Command 2: "Turn left 90 degrees"
Expected: Robot rotates 90°
Actual: ✅ Robot rotates 89.8° (accurate)
Latency: 1.8s
Command 3: "Stop"
Expected: Robot stops immediately
Actual: ✅ Robot stops
Latency: 0.3s
Edge Case: "Unrecognized command"
Input: "Do a backflip"
Expected: System handles gracefully
Actual: ✅ LLM returns "unknown" action; robot announces "Command not understood"
Success Rate: 3/3 = 100%
Average Latency: 1.5s
Capstone Video (6 minutes):
- [0:00] System overview (ROS 2 graph diagram)
- [1:00] Gazebo world with robot
- [1:30] Command 1: "Walk forward" → Video of robot walking
- [2:30] Command 2: "Turn left" → Video of robot turning
- [3:30] Command 3: "Stop" → Robot stops
- [4:15] Perception demo: Camera showing object detection in Gazebo
- [5:00] Recap + code walkthrough (30 seconds)
Key Features Demonstrated:
- ✅ Module 1 (ROS 2): Multi-node architecture, pub/sub pattern, launch file
- ✅ Module 2 (Gazebo): Physics simulation, robot control, sensor data
- ✅ Module 3 (Perception): Camera feed + basic object detection
- ✅ Module 4 (VLA): Voice → LLM → action pipeline
Challenges & Solutions:
| Challenge | Solution |
|---|---|
| Gazebo robot unstable | Adjusted inertia values in URDF |
| Voice recognition noise | Added noise filtering in Whisper post-processing |
| LLM latency (3+ seconds) | Cached common commands |
| No real camera, used OpenCV simulation | Acceptable for MVP |
Lessons Learned:
- ROS 2 launch files are essential for multi-node coordination
- LLM latency is the bottleneck; pre-caching helps
- Simulation alone teaches robot concepts but can't prepare for real-world complexity
Example 2: Jetson Edge Hardware Path
Project: "Smart Manipulator Arm" (Score: 4.8/5)
Target Platform: NVIDIA Jetson Orin Nano + RealSense D435i + ROS 2 Humble
What It Does:
- Processes real camera feed from RealSense
- Uses Jetson GPU for hardware-accelerated perception
- LLM planning on Jetson (using quantized model or API calls)
- Controls manipulator arm (or mobile base + arm simulation)
- Demonstrates real sensor integration with sim-to-real transfer
Repository Structure:
smart-manipulator/
├── README.md
├── requirements.txt
├── docker/
│ ├── Dockerfile.jetson # Jetson-specific image
│ └── docker-compose.yml
├── launch/
│ └── manipulator.launch.xml
├── src/
│ ├── perception_pipeline/
│ │ ├── realsense_node.py # RealSense camera driver
│ │ ├── yolo_detector.py # Object detection with TensorRT
│ │ └── point_cloud_processor.py
│ ├── planning/
│ │ ├── llm_planner.py # LLM for task planning
│ │ └── trajectory_planner.py # Arm trajectory planning
│ ├── control/
│ │ ├── arm_controller.py # ROS 2 joint command publisher
│ │ └── safety_monitor.py
│ └── integration/
│ └── vla_pipeline.py # Full VLA integration
├── config/
│ ├── yolo_config.yaml # Object detection config
│ ├── arm_params.yaml # Arm kinematics
│ └── jetson_config.yaml # GPU/memory optimization
├── tests/
│ └── hardware_integration_test.py
└── results/
├── performance_metrics.csv
└── video_demo.mp4
Key Features:
realsense_node.py (Real Camera Input)
import rclpy
from rclpy.node import Node
from sensor_msgs.msg import Image
import pyrealsense2 as rs
from cv_bridge import CvBridge
import cv2
class RealSenseNode(Node):
def __init__(self):
super().__init__('realsense_node')
self.image_pub = self.create_publisher(Image, '/camera/color/image_raw', 10)
self.depth_pub = self.create_publisher(Image, '/camera/depth/image_raw', 10)
self.pipeline = rs.pipeline()
config = rs.config()
config.enable_stream(rs.stream.color, 640, 480, rs.format.bgr8, 30)
config.enable_stream(rs.stream.depth, 640, 480, rs.format.z16, 30)
self.pipeline.start(config)
self.bridge = CvBridge()
def publish_frames(self):
frames = self.pipeline.wait_for_frames()
color_frame = frames.get_color_frame()
depth_frame = frames.get_depth_frame()
color_image = cv2.cvtColor(np.asarray(color_frame.get_data()), cv2.COLOR_BGR2RGB)
depth_image = np.asarray(depth_frame.get_data())
self.image_pub.publish(self.bridge.cv2_to_imgmsg(color_image, "rgb8"))
self.depth_pub.publish(self.bridge.cv2_to_imgmsg(depth_image, "16UC1"))
yolo_detector.py (Hardware-Accelerated Perception)
import rclpy
from rclpy.node import Node
from sensor_msgs.msg import Image
from vision_msgs.msg import Detection2DArray, Detection2D
from cv_bridge import CvBridge
import tensorrt as trt
import numpy as np
class YOLODetector(Node):
def __init__(self):
super().__init__('yolo_detector')
self.image_sub = self.create_subscription(Image, '/camera/color/image_raw', self.detect, 10)
self.detection_pub = self.create_publisher(Detection2DArray, '/detections', 10)
# TensorRT engine for GPU acceleration
self.trt_engine = self.load_tensorrt_engine('yolo_jetson.trt')
self.bridge = CvBridge()
def detect(self, msg):
"""Detect objects using TensorRT (GPU-accelerated)"""
image = self.bridge.imgmsg_to_cv2(msg)
# TensorRT inference (2-3ms on Jetson)
detections = self.trt_engine.infer(image)
detection_array = Detection2DArray()
for detection in detections:
det = Detection2D()
det.bbox.center.x = detection['x']
det.bbox.center.y = detection['y']
det.bbox.size_x = detection['width']
det.bbox.size_y = detection['height']
detection_array.detections.append(det)
self.detection_pub.publish(detection_array)
Performance Metrics (Real Hardware):
Task: Detect and grasp blue cube
Metrics:
Vision latency (camera → detection): 35ms
LLM planning latency: 1.2s
Trajectory planning: 200ms
Arm execution: 3.5s
Total time (voice → grasp): ~5.0s
Success rate: 8/10 grasps (80%)
Failure causes: Reflections, occlusions
GPU utilization: 65% (room for more tasks)
Power consumption: 12W (Jetson baseline ~8W)
Video Demo (5 minutes):
- [0:00] Setup: Jetson, RealSense camera, robot arm
- [0:30] System overview: Real perception pipeline
- [1:00] "Pick up the blue cube" → Real camera shows object → Arm grasps successfully
- [2:00] "Move to the table" → Real sensor feedback → Navigation working
- [3:00] Multiple commands sequence
- [4:00] Performance analysis: latency, success rates
- [4:30] Lessons learned from hardware integration
Key Achievements:
- ✅ Real sensor integration (RealSense depth + color)
- ✅ Hardware-accelerated perception (TensorRT on Jetson)
- ✅ Real-time control (5Hz loop with sub-second latency)
- ✅ Sim-to-real transfer validated (simulation → Jetson hardware)
Example 3: Full Physical Robot Path
Project: "Unitree G1 Voice Commander" (Score: 5/5)
Target Platform: Unitree G1 Humanoid + Jetson Orin AGX + ROS 2
What It Does:
- Full physical humanoid robot (bipedal locomotion, arms)
- Real-time voice command processing
- Visual SLAM for autonomous navigation
- Multi-modal perception (camera, IMU, proprioception)
- End-to-end control from voice to physical movement
Repository Structure:
unitree-voice-commander/
├── README.md
├── requirements.txt
├── launch/
│ ├── unitree_full_stack.launch.xml
│ ├── unitree_sim.launch.xml # Sim version for testing
│ └── safety_monitor.launch.xml
├── src/
│ ├── hardware_interface/
│ │ ├── unitree_ros2_sdk/ # Official Unitree ROS 2 SDK
│ │ └── motor_command_node.py # High-level motor control
│ ├── perception/
│ │ ├── slam_node.py # ORB-SLAM3
│ │ ├── imu_fusion_node.py # IMU + odometry fusion
│ │ └── obstacle_detection.py # LiDAR-based detection
│ ├── planning/
│ │ ├── locomotion_planner.py # Gait planning
│ │ ├── arm_planner.py # Arm trajectory planning
│ │ └── high_level_planner.py # Task-level planning via LLM
│ ├── voice_interface/
│ │ ├── whisper_node.py
│ │ ├── tts_node.py # Text-to-speech feedback
│ │ └── voice_confirmation.py
│ └── safety/
│ ├── emergency_stop_node.py # Hardware e-stop
│ ├── joint_limit_monitor.py # Joint safety
│ └── stability_monitor.py # IMU-based fall detection
├── config/
│ ├── unitree_g1_config.yaml # Robot-specific params
│ ├── gait_parameters.yaml # Walking gait profiles
│ └── safety_thresholds.yaml
├── tests/
│ ├── sim_validation.py # Gazebo tests before hardware
│ ├── hardware_safety_tests.py
│ └── integration_tests.py
└── docs/
├── sim_to_real_transfer.md
├── safety_protocols.md
└── troubleshooting.md
Example: Walk Command Pipeline
# User: "Walk to the kitchen"
#
# 1. Voice Input: Whisper captures "Walk to the kitchen"
# 2. LLM Planning: ChatGPT → "action: walk, target_location: kitchen, duration: 30s"
# 3. Perception: SLAM localizes robot in map
# 4. Locomotion Planning: Generate walking gait (5 step sequence)
# 5. Motor Commands: Send joint commands to Unitree SDK
# 6. Feedback: IMU confirms walking; SLAM updates position
# 7. Execution: Robot walks for 30s or until destination reached
Test Results (Physical Hardware):
Test 1: "Stand up"
Execution: ✅ 2.5s from sitting to standing
Stability: ✅ IMU balance maintained
Test 2: "Walk forward 2 meters"
Execution: ✅ Walks 2.1m (within tolerance)
Time: 8s at normal walking speed
Stability: ✅ No falls, gait smooth
Test 3: "Pick up the cube with your left hand"
Execution: ✅ Reaches down, grasps object, lifts
Time: 12s (slower for dexterity)
Test 4: "Stop and turn around"
Execution: ✅ Completes turn (180°) in 3s
Edge Case: Obstacle detection
Input: "Walk forward" with obstacle placed ahead
Output: ✅ Robot detects via LiDAR, avoids obstacle
Alternative path planning: Working
Challenges Overcome:
| Challenge | Impact | Solution |
|---|---|---|
| Motor control latency | 50ms → 10ms per command | Jetson Orin AGX sufficient; optimized ROS 2 bridge |
| Walking gait instability | Falls in early testing | Tuned gait parameters; added IMU feedback loops |
| Voice latency | 5+ seconds | Cached common commands; local TTS for feedback |
| Battery drain | 4hr to 2hr with real-time perception | Optimized perception pipeline; selective SLAM updates |
| Sim-to-real gap | Model trained in sim doesn't transfer | Fine-tuned gait in real environment; empirical validation |
Video Demo (8 minutes):
- [0:00] Intro: "Unitree G1 Voice Commander capstone"
- [0:30] System architecture overview (block diagram)
- [1:00] Robot standing in living room
- [1:30] Command 1: "Walk to the kitchen" → Robot walks with narration
- [3:00] Command 2: "Pick up the cup" → Robot grasps and lifts object
- [4:30] Command 3: "Avoid the chair and walk to the door" → Navigation with obstacle avoidance
- [6:00] Perception demo: Show SLAM map, camera feed, obstacle detection
- [7:00] Closing: Integration summary, lessons learned, future work
Technical Report Highlights:
- Deep analysis of sim-to-real transfer learning
- Hardware integration challenges and solutions
- Performance benchmarks (latency, power, stability)
- Safety validation procedures (gait stability, joint limits, fall detection)
- Lessons on bipedal locomotion dynamics
Integration Achievement:
- ✅ Module 1 (ROS 2): Full multi-node ecosystem running on real hardware
- ✅ Module 2 (Simulation): Validated in Gazebo before physical deployment
- ✅ Module 3 (Perception): Real SLAM + obstacle detection on Jetson AGX
- ✅ Module 4 (VLA): Full voice→LLM→action pipeline working end-to-end
Comparison: All Three Paths
| Aspect | Simulation-Only | Jetson Edge Hardware | Physical Robot |
|---|---|---|---|
| Cost | $0 | $400–500 | $30,000+ |
| Setup Time | 2–3 hours | 1–2 days | 1–2 weeks |
| Capstone Dev Time | 40–60 hours | 60–80 hours | 100+ hours |
| Sensor Realism | Simulated | Real RealSense | Real sensors (camera, IMU, LiDAR) |
| Latency | Can be tuned low | ~1.5s typical | ~2–3s (real-world complexity) |
| Failure Modes | Simulation bugs | Sensor noise, latency | Hardware failures, battery, safety |
| Learning Value | High (concepts) | Very High (real sensors) | Highest (physical constraints) |
| Publication Potential | Low | Medium | High (research quality) |
| Recommended For | Learning ROS 2 concepts | Understanding real robotics | Advanced students, research |
Which Example Should You Follow?
Choose Simulation-Only If:
- You're learning ROS 2 fundamentals for the first time
- You don't have hardware available
- You want to iterate quickly on algorithms
- You're comfortable with "what if" scenarios
Choose Jetson If:
- You want real sensor experience
- You have ~$400 budget
- You want to practice sim-to-real transfer
- You want a capstone you can reuse for other projects
Choose Physical Robot If:
- You have lab access to a robot
- You want the maximum learning impact
- You're interested in publishing research
- You're willing to invest significant time in debugging
How to Use These Examples
- Read the appropriate example for your path
- Reference the code structure when building your own project
- Test locally using the launch files and test scripts
- Adapt the code to your specific use case (different robot, different commands)
- Document your journey like these examples do
All examples include:
- ✅ Complete code structure
- ✅ Launch files for quick startup
- ✅ Documented test results
- ✅ Video demos (conceptual structure described)
- ✅ Lessons learned and challenges faced
- ✅ Ideas for future improvement
Next Steps
- Pick your example based on your hardware path (1, 2, or 3)
- Review the code structure and understand the ROS 2 patterns
- Build your own version with your own robot, sensors, and AI pipeline
- Document as you go: Results, challenges, solutions
- Create your capstone video and technical report using the rubrics in Chapter 2
Good luck! 🤖
Questions? Post in forums or review the grading rubrics (Chapter 2) for detailed expectations.