ME 326 · Stanford University · Winter 2026

Voice-Commanded Mobile Pick-and-Place

Group 5 — Monday 6–7:30pm  ·  TidyBot2 with bimanual WX250s arms, RealSense depth camera, and a Gemini-powered natural language interface.

⬡ View on GitHub ▶ Watch Demos

Overview

Problem Statement

Household and laboratory environments contain objects that constantly need to be retrieved, reorganized, or moved on demand. A robot that can understand natural spoken commands and autonomously execute multi-step manipulation tasks—locating, navigating to, and grasping objects—would be a meaningful real-world collaborator.

We built a fully integrated pipeline on the TidyBot2 platform: a holonomic mobile base with two 6-DOF WX250s arms, an Intel RealSense D435 depth camera on a pan-tilt mount, and an onboard Intel NUC running ROS2 Humble.

TidyBot2 with banana and bowl on tabletop

TidyBot2 with bimanual arms, pan-tilt RealSense camera, and test objects

🗣️

Task 1 — Object Retrieval

User says "retrieve the banana." Robot detects the object with YOLO, navigates to it, grasps it with the right arm, and returns to the starting position while holding the object.

📦

Task 2 — Pick and Place

User says "pick up the banana and place it in the bin." Robot detects, navigates, grasps, then drops the object 25 cm to its right—into a bowl or bin positioned beside the robot.

🤖

Task 3 — Bimanual Pillow Retrieval

Robot detects a red pillow using an HSV-based color detector, navigates to it, and performs a coordinated two-arm grasp to pick it up — demonstrating bimanual manipulation for large or deformable objects a single arm cannot reliably handle.

Core challenge: Bridging the gap between noisy real-time perception (depth sensor uncertainty, YOLO false positives) and reliable robot action (navigation accuracy, successful grasping) in an unstructured tabletop environment— all triggered by a single natural language utterance.

Method

System Architecture

The system is composed of specialized ROS2 nodes orchestrated by task-specific coordinators. Each node has a single responsibility; the coordinator sequences them via topics and state transitions.

NodeFileRole
nlp_interface_node.py tidybot_control/ Voice/text interface — Gemini parsing, command confirmation, and runtime object targeting
task1_coordinator.py tidybot_bringup/scripts/ Task 1 state machine — detect → navigate → pick up → return to start
task2_coordinator.py tidybot_bringup/scripts/ Task 2 state machine — detect → navigate → pick up → drop in bowl/bin
coordinator_node_task3.py tidybot_bringup/scripts/ Task 3 state machine — detect → navigate → (redetect) → bimanual pick up
detect_object_real.py tidybot_bringup/scripts/ YOLOv11 + RealSense depth → 3D object pose in base_link (Tasks 1 & 2)
detect_object_real_task3.py tidybot_bringup/scripts/ HSV color segmentation for red pillow detection (Task 3)
navigate_to_object.py tidybot_bringup/scripts/ Proportional controller → standoff 0.4m + 0.15m lateral offset (Tasks 1 & 2)
navigate_to_object_task3.py tidybot_bringup/scripts/ Proportional controller → standoff 0.3m, no lateral offset (Task 3)
task1_pickup.py tidybot_bringup/scripts/ Task 1 arm — approach → descend → grasp → lift (holds object)
task2_pickup.py tidybot_bringup/scripts/ Task 2 arm — approach → descend → grasp → lift → drop 25 cm right
pickup_task3.py tidybot_bringup/scripts/ Task 3 bimanual — R_APPROACH → R_DESCEND → R_GRASP → L_APPROACH → L_DESCEND → L_GRASP → LIFT → ROTATE → RELEASE

End-to-End Data Flow

System architecture diagram showing ROS2 node communication

Coordinator State Machines

Task 1 — Object Retrieval

IDLE
SEARCHING
NAVIGATING
PAUSE
PICKING_UP
RETURNING
DONE
IDLE

Task 2 — Pick and Place

IDLE
SEARCHING
NAVIGATING
PAUSE
PICKING_UP
DONE
IDLE

Task 3 — Bimanual Pillow Retrieval

IDLE
SEARCHING
NAVIGATING
REDETECTING
PICKING_UP
DONE
IDLE
Any state can transition to FAILED on timeout, which sends an e-stop and resets to IDLE.
IDLE

Waits for a voice command or manual trigger on /coordinator/start.

SEARCHING (30s)

Tasks 1&2: accumulates 3 confident YOLO detections (≥0.35 confidence), averages x/y into a stable nav goal. Task 3: accumulates 15 HSV pose samples and averages them before locking the navigation target.

NAVIGATING (90s)

Drives to standoff position and aligns yaw; tilts camera down on arrival.

PAUSE (Tasks 1&2 only)

3-second settling delay for stale perception data to clear before pickup.

REDETECTING (Task 3 only · 30s)

Camera sweep across 6 pan-tilt positions for close-range re-detection of the pillow.

PICKING_UP (60–120s)

Triggers the task-specific pickup node. Task 1 holds; Task 2 drops 25 cm right; Task 3 bimanual grasp + rotate.

RETURNING (Task 1 only · 120s)

Drives back to the saved start position (0, 0).

DONE / FAILED

DONE returns to IDLE after 2s. FAILED publishes zero-velocity e-stop and resets immediately.

Node Details

🎯 Perception

Tasks 1&2: YOLOv11n on RGB frames, median depth patch, back-project to base_link via TF.
Task 3: HSV color segmentation for red (dual hue bands 0–10 & 170–180), largest contour above 500 px.

YOLOv11 detecting a banana with bounding box

YOLOv11 real-time detection on TidyBot2 camera feed

🧭 Navigation

Proportional controller at 50 Hz. Tasks 1&2: 0.4m standoff + 0.15m lateral offset. Task 3: 0.3m standoff, no lateral offset. Stop-and-rotate when >60° misaligned.

🦾 Pickup

Task 1: APPROACH → DESCEND → GRASP → LIFT (holds object).
Task 2: Same + DROP 25 cm right into bowl/bin.
Task 3: R_APPROACH → R_DESCEND → R_GRASP → L_APPROACH → L_DESCEND → L_GRASP → LIFT → ROTATE 90° → RELEASE.

🗣️ NLP Interface

Records audio, transcribes via SpeechRecognition, passes text to Google Gemini. Returns structured JSON {action, object, target} to dynamically configure the detector.


Results

Videos & Demos

Task Demonstrations

Task 1 — Voice-commanded object retrieval on real hardware

Task 2 — Sequential pick-and-place on real hardware

Task 3 — Bimanual pillow retrieval on real hardware

Technical Demos

Navigation in MuJoCo simulation

Camera Sweep RViz Visualization

Robot Navigation: Return to Origin


The Team

Team Contributions

Group 5 — Monday 6–7:30pm  ·  TA: Giuse Pham

Esteban Rincon
Manipulation · Perception Integration · Coordinator
  • Authored task1_pickup.py and task2_pickup.py end-to-end, including the full pickup pipelines (approach, descend, grasp, lift, drop)
  • Bridged real-time vision detection with arm execution in detect_object_real.py — object poses flow directly from YOLO into IK-based grasp targets
  • Designed & implemented task1_coordinator.py and task2_coordinator.py state machines orchestrating the full detect → navigate → pick → return/drop pipeline
  • Built adaptive pan-tilt camera sweep in task2_pickup.py for robust close-range re-detection before grasping
  • Led end-to-end testing and debugging across perception, manipulation, and coordinator layers on real hardware
  • Built and maintained the project website
James Cheng
Perception · NLP Integration · Navigation · Coordinator
  • Built the YOLO perception pipeline for real-time object detection, later upgraded from YOLOv8 to YOLOv11
  • Rewrote navigate_to_object.py with standoff positioning, lateral offset, and coordinate transforms for arm-reachable approach
  • Built the original coordinator pipeline (coordinator_node.py) orchestrating the full detect → navigate → pick end-to-end flow
  • Integrated perception with NLP and navigation nodes for voice-driven object targeting and autonomous approach
  • Contributed to bimanual manipulation (Task 3) — built test_bimanual.py for dual-arm testing
  • Developed the project website
Yazhou Zhang
Perception · NLP · Manipulation
  • Implemented depth-to-world coordinate projection using CameraInfo intrinsics and TF transforms for object localization in the robot base frame
  • Built the initial NLP interface for natural-language command parsing and conversational robot interaction
  • Integrated perception with the NLP node for voice-driven object targeting
  • Added RealSense depth-color alignment handling and topic fallback for more reliable real-hardware perception
  • Contributed real-hardware manipulation tuning, including left-arm calibration and grasp tolerance adjustments
Marco Vizcarra
Navigation · Base Motion · Simulation
  • Developed the foundational base-motion scripts (movement_1–4.py) for robot navigation and motion control.
  • Ran simulation and real-robot tests to validate behaviors and support deployment.
  • Helped with calibration and integration, including frame alignment, yaw offsets, and pose consistency.
  • Implemented safe_movement.py and test utilities for safer motion and debugging.
  • Improved robustness through iterative troubleshooting, odometry checks, and sim-to-real validation.
  • Provided the navigation base later used for higher-level robot behaviors.
Ke Wang
NLP Interface · Perception Integration
  • Built the NLP interface node, introducing Google Gemini for natural conversational interaction between humans and the robot
  • Designed the voice command parsing pipeline that converts natural language into structured commands ({action, object, target}) through multi-turn conversation context
  • Implemented 3D object position extraction from YOLO detections using camera intrinsics and depth images, enabling accurate real-world localization in the robot base frame
Becky Miller
Manipulation Assist · Task 3 Planning · Documentation
  • Early collaboration with Esteban on manipulation pipeline
  • Develop Task 3 off-robot testing and code adaptation with Mathijs
  • Documentation: work on slides and website
Mathijs Ammerlaan
Task 3 — Perception · Navigation · Manipulation · Pipeline
  • Collaborated with Becky and Marco on Task 3 off-robot testing and code adaptation, extending the navigation and detection infrastructure for Task 3
  • Designed and built an HSV-based red pillow detector (detect_object_real_task3.py) with improved 3D pose estimation and a hardcoded-pose toggle for reliable hardware runs
  • Developed an offline tester with auto-scan support for rapid iteration on pillow detection without the robot
  • Authored the Task 3 navigation node (navigate_to_object_task3.py) with three targeted control fixes to prevent overshoot: stop-and-rotate when heading error exceeds 60°, proportional speed ramp in the final 0.2 m to avoid coasting past the standoff, and a tighter heading gate (30° vs 45°) to correct yaw earlier before forward motion; also added a post-arrival face_object alignment phase and fixed standalone mode for immediate pose acceptance
  • Implemented bimanual pillow pickup (pickup_task3.py) with hardcoded grasp positions for consistent two-arm grasping
  • Extended the coordinator for Task 3 with a skip_redetect parameter and resolved 3 critical pipeline bugs before hardware testing

Code

Codebase

All code is open-source and available on GitHub.

📁 Repository

github.com/jameszcheng/collaborative-robotics-2026-group5

ROS2 Humble workspace with MuJoCo simulation, full coordinator pipeline, perception, navigation, and manipulation nodes.

🔑 Key Files

  • scripts/task1_coordinator.py — Task 1 state machine
  • scripts/task2_coordinator.py — Task 2 state machine
  • scripts/coordinator_node_task3.py — Task 3 state machine
  • scripts/detect_object_real.py — YOLOv11 perception
  • scripts/detect_object_real_task3.py — HSV pillow detector
  • scripts/navigate_to_object.py — Navigation (Tasks 1&2)
  • scripts/navigate_to_object_task3.py — Navigation (Task 3)
  • scripts/task1_pickup.py — Task 1 arm (hold)
  • scripts/task2_pickup.py — Task 2 arm (drop)
  • scripts/pickup_task3.py — Task 3 bimanual pickup

Quick Start

# Build
cd ros2_ws && source /opt/ros/humble/setup.bash && colcon build

# Launch robot + task pipeline
source setup_env.bash
ros2 launch tidybot_bringup real.launch.py use_planner:=true  # Terminal 1
ros2 launch tidybot_bringup task1.launch.py                   # Terminal 2

# Manual trigger (no voice needed)
ros2 topic pub /coordinator/start std_msgs/String "data: banana" --once

System Requirements

Software

  • Ubuntu 22.04 · ROS2 Humble · Python 3.10+
  • YOLOv11 (ultralytics) · mink · MuJoCo
  • google-genai (Gemini)

Hardware

  • TidyBot2 holonomic base (3 DOF)
  • 2× WX250s 6-DOF arms (650mm reach)
  • Intel RealSense D435 on pan-tilt mount