Group 5 — ME 326 Collaborative Robotics 2026

Overview

Problem Statement

Household and laboratory environments contain objects that constantly need to be retrieved, reorganized, or moved on demand. A robot that can understand natural spoken commands and autonomously execute multi-step manipulation tasks—locating, navigating to, and grasping objects—would be a meaningful real-world collaborator.

We built a fully integrated pipeline on the TidyBot2 platform: a holonomic mobile base with two 6-DOF WX250s arms, an Intel RealSense D435 depth camera on a pan-tilt mount, and an onboard Intel NUC running ROS2 Humble.

TidyBot2 with banana and bowl on tabletop

TidyBot2 with bimanual arms, pan-tilt RealSense camera, and test objects

🗣️

Task 1 — Object Retrieval

User says "retrieve the banana." Robot detects the object with YOLO, navigates to it, grasps it with the right arm, and returns to the starting position while holding the object.

📦

Task 2 — Pick and Place

User says "pick up the banana and place it in the bin." Robot detects, navigates, grasps, then drops the object 25 cm to its right—into a bowl or bin positioned beside the robot.

🤖

Task 3 — Bimanual Pillow Retrieval

Robot detects a red pillow using an HSV-based color detector, navigates to it, and performs a coordinated two-arm grasp to pick it up — demonstrating bimanual manipulation for large or deformable objects a single arm cannot reliably handle.

Core challenge: Bridging the gap between noisy real-time perception (depth sensor uncertainty, YOLO false positives) and reliable robot action (navigation accuracy, successful grasping) in an unstructured tabletop environment— all triggered by a single natural language utterance.

Method

System Architecture

The system is composed of specialized ROS2 nodes orchestrated by task-specific coordinators. Each node has a single responsibility; the coordinator sequences them via topics and state transitions.

Node	File	Role
`nlp_interface_node.py`	tidybot_control/	Voice/text interface — Gemini parsing, command confirmation, and runtime object targeting
`task1_coordinator.py`	tidybot_bringup/scripts/	Task 1 state machine — detect → navigate → pick up → return to start
`task2_coordinator.py`	tidybot_bringup/scripts/	Task 2 state machine — detect → navigate → pick up → drop in bowl/bin
`coordinator_node_task3.py`	tidybot_bringup/scripts/	Task 3 state machine — detect → navigate → (redetect) → bimanual pick up
`detect_object_real.py`	tidybot_bringup/scripts/	YOLOv11 + RealSense depth → 3D object pose in base_link (Tasks 1 & 2)
`detect_object_real_task3.py`	tidybot_bringup/scripts/	HSV color segmentation for red pillow detection (Task 3)
`navigate_to_object.py`	tidybot_bringup/scripts/	Proportional controller → standoff 0.4m + 0.15m lateral offset (Tasks 1 & 2)
`navigate_to_object_task3.py`	tidybot_bringup/scripts/	Proportional controller → standoff 0.3m, no lateral offset (Task 3)
`task1_pickup.py`	tidybot_bringup/scripts/	Task 1 arm — approach → descend → grasp → lift (holds object)
`task2_pickup.py`	tidybot_bringup/scripts/	Task 2 arm — approach → descend → grasp → lift → drop 25 cm right
`pickup_task3.py`	tidybot_bringup/scripts/	Task 3 bimanual — R_APPROACH → R_DESCEND → R_GRASP → L_APPROACH → L_DESCEND → L_GRASP → LIFT → ROTATE → RELEASE

End-to-End Data Flow

System architecture diagram showing ROS2 node communication

Coordinator State Machines

Task 1 — Object Retrieval

IDLE

→

SEARCHING

→

NAVIGATING

→

PAUSE

→

PICKING_UP

→

RETURNING

→

DONE

→

IDLE

Task 2 — Pick and Place

IDLE

→

SEARCHING

→

NAVIGATING

→

PAUSE

→

PICKING_UP

→

DONE

→

IDLE

Task 3 — Bimanual Pillow Retrieval

IDLE

→

SEARCHING

→

NAVIGATING

→

REDETECTING

→

PICKING_UP

→

DONE

→

IDLE

Any state can transition to FAILED on timeout, which sends an e-stop and resets to IDLE.

IDLE

Waits for a voice command or manual trigger on /coordinator/start.

SEARCHING (30s)

Tasks 1&2: accumulates 3 confident YOLO detections (≥0.35 confidence), averages x/y into a stable nav goal. Task 3: accumulates 15 HSV pose samples and averages them before locking the navigation target.

NAVIGATING (90s)

Drives to standoff position and aligns yaw; tilts camera down on arrival.

PAUSE (Tasks 1&2 only)

3-second settling delay for stale perception data to clear before pickup.

REDETECTING (Task 3 only · 30s)

Camera sweep across 6 pan-tilt positions for close-range re-detection of the pillow.

PICKING_UP (60–120s)

Triggers the task-specific pickup node. Task 1 holds; Task 2 drops 25 cm right; Task 3 bimanual grasp + rotate.

RETURNING (Task 1 only · 120s)

Drives back to the saved start position (0, 0).

DONE / FAILED

DONE returns to IDLE after 2s. FAILED publishes zero-velocity e-stop and resets immediately.

Node Details

🎯 Perception

Tasks 1&2: YOLOv11n on RGB frames, median depth patch, back-project to base_link via TF.
Task 3: HSV color segmentation for red (dual hue bands 0–10 & 170–180), largest contour above 500 px.

YOLOv11 detecting a banana with bounding box

YOLOv11 real-time detection on TidyBot2 camera feed

🧭 Navigation

Proportional controller at 50 Hz. Tasks 1&2: 0.4m standoff + 0.15m lateral offset. Task 3: 0.3m standoff, no lateral offset. Stop-and-rotate when >60° misaligned.

🦾 Pickup

Task 1: APPROACH → DESCEND → GRASP → LIFT (holds object).
Task 2: Same + DROP 25 cm right into bowl/bin.
Task 3: R_APPROACH → R_DESCEND → R_GRASP → L_APPROACH → L_DESCEND → L_GRASP → LIFT → ROTATE 90° → RELEASE.

🗣️ NLP Interface

Records audio, transcribes via SpeechRecognition, passes text to Google Gemini. Returns structured JSON {action, object, target} to dynamically configure the detector.

The Team

Team Contributions

Group 5 — Monday 6–7:30pm · TA: Giuse Pham

Esteban Rincon

Manipulation · Perception Integration · Coordinator

Authored task1_pickup.py and task2_pickup.py end-to-end, including the full pickup pipelines (approach, descend, grasp, lift, drop)
Bridged real-time vision detection with arm execution in detect_object_real.py — object poses flow directly from YOLO into IK-based grasp targets
Designed & implemented task1_coordinator.py and task2_coordinator.py state machines orchestrating the full detect → navigate → pick → return/drop pipeline
Built adaptive pan-tilt camera sweep in task2_pickup.py for robust close-range re-detection before grasping
Led end-to-end testing and debugging across perception, manipulation, and coordinator layers on real hardware
Built and maintained the project website

James Cheng

Perception · NLP Integration · Navigation · Coordinator

Built the YOLO perception pipeline for real-time object detection, later upgraded from YOLOv8 to YOLOv11
Rewrote navigate_to_object.py with standoff positioning, lateral offset, and coordinate transforms for arm-reachable approach
Built the original coordinator pipeline (coordinator_node.py) orchestrating the full detect → navigate → pick end-to-end flow
Integrated perception with NLP and navigation nodes for voice-driven object targeting and autonomous approach
Contributed to bimanual manipulation (Task 3) — built test_bimanual.py for dual-arm testing
Developed the project website

Yazhou Zhang

Perception · NLP · Manipulation

Implemented depth-to-world coordinate projection using CameraInfo intrinsics and TF transforms for object localization in the robot base frame
Built the initial NLP interface for natural-language command parsing and conversational robot interaction
Integrated perception with the NLP node for voice-driven object targeting
Added RealSense depth-color alignment handling and topic fallback for more reliable real-hardware perception
Contributed real-hardware manipulation tuning, including left-arm calibration and grasp tolerance adjustments

Marco Vizcarra

Navigation · Base Motion · Simulation

Developed the foundational base-motion scripts (movement_1–4.py) for robot navigation and motion control.
Ran simulation and real-robot tests to validate behaviors and support deployment.
Helped with calibration and integration, including frame alignment, yaw offsets, and pose consistency.
Implemented safe_movement.py and test utilities for safer motion and debugging.
Improved robustness through iterative troubleshooting, odometry checks, and sim-to-real validation.
Provided the navigation base later used for higher-level robot behaviors.

Ke Wang

NLP Interface · Perception Integration

Built the NLP interface node, introducing Google Gemini for natural conversational interaction between humans and the robot
Designed the voice command parsing pipeline that converts natural language into structured commands ({action, object, target}) through multi-turn conversation context
Implemented 3D object position extraction from YOLO detections using camera intrinsics and depth images, enabling accurate real-world localization in the robot base frame

Becky Miller

Manipulation Assist · Task 3 Planning · Documentation

Early collaboration with Esteban on manipulation pipeline
Develop Task 3 off-robot testing and code adaptation with Mathijs
Documentation: work on slides and website

Mathijs Ammerlaan

Task 3 — Perception · Navigation · Manipulation · Pipeline

Collaborated with Becky and Marco on Task 3 off-robot testing and code adaptation, extending the navigation and detection infrastructure for Task 3
Designed and built an HSV-based red pillow detector (detect_object_real_task3.py) with improved 3D pose estimation and a hardcoded-pose toggle for reliable hardware runs
Developed an offline tester with auto-scan support for rapid iteration on pillow detection without the robot
Authored the Task 3 navigation node (navigate_to_object_task3.py) with three targeted control fixes to prevent overshoot: stop-and-rotate when heading error exceeds 60°, proportional speed ramp in the final 0.2 m to avoid coasting past the standoff, and a tighter heading gate (30° vs 45°) to correct yaw earlier before forward motion; also added a post-arrival face_object alignment phase and fixed standalone mode for immediate pose acceptance
Implemented bimanual pillow pickup (pickup_task3.py) with hardcoded grasp positions for consistent two-arm grasping
Extended the coordinator for Task 3 with a skip_redetect parameter and resolved 3 critical pipeline bugs before hardware testing

Code

Codebase

All code is open-source and available on GitHub.

📁 Repository

github.com/jameszcheng/collaborative-robotics-2026-group5

ROS2 Humble workspace with MuJoCo simulation, full coordinator pipeline, perception, navigation, and manipulation nodes.

🔑 Key Files

scripts/task1_coordinator.py — Task 1 state machine
scripts/task2_coordinator.py — Task 2 state machine
scripts/coordinator_node_task3.py — Task 3 state machine
scripts/detect_object_real.py — YOLOv11 perception
scripts/detect_object_real_task3.py — HSV pillow detector
scripts/navigate_to_object.py — Navigation (Tasks 1&2)
scripts/navigate_to_object_task3.py — Navigation (Task 3)
scripts/task1_pickup.py — Task 1 arm (hold)
scripts/task2_pickup.py — Task 2 arm (drop)
scripts/pickup_task3.py — Task 3 bimanual pickup

Quick Start

# Build
cd ros2_ws && source /opt/ros/humble/setup.bash && colcon build

# Launch robot + task pipeline
source setup_env.bash
ros2 launch tidybot_bringup real.launch.py use_planner:=true  # Terminal 1
ros2 launch tidybot_bringup task1.launch.py                   # Terminal 2

# Manual trigger (no voice needed)
ros2 topic pub /coordinator/start std_msgs/String "data: banana" --once

Voice-Commanded Mobile Pick-and-Place

Problem Statement

Task 1 — Object Retrieval

Task 2 — Pick and Place

Task 3 — Bimanual Pillow Retrieval

System Architecture

End-to-End Data Flow

Coordinator State Machines

Node Details

🎯 Perception

🧭 Navigation

🦾 Pickup

🗣️ NLP Interface

Videos & Demos

Task Demonstrations

Technical Demos

Team Contributions

Codebase

📁 Repository

🔑 Key Files

Quick Start

System Requirements

Software

Hardware