Multicolored-Block-Pick-And-Place-Dataset

2026-04-17T00:00:00+00:00

Today, we collected a new pick-and-place dataset with 200 episodes using four different colored blocks. Most of the episodes were collected by picking up the object on the left and placing it in a bin on the right.

Our first “toy” dataset contained only 50 episodes and used an apple as the object to pick and place in a ceramic bowl. The apple barely fit in the robot gripper, which made the setup less reliable. The world camera in that dataset was also mounted directly above the workspace, looking straight down. We suspected that this initial setup was not ideal because the dataset was quite small, and the camera angle likely did not capture depth information well enough.

To improve on that first dataset, we made three major changes. First, we repositioned the world camera so that it viewed the workspace from an angle, looking down from the front of the robot rather than directly overhead. Second, we replaced the apple with Duplo LEGO blocks, which fit much more naturally in the SO-101 gripper. Third, we collected 200 episodes instead of just 50.

The new dataset includes four block colors, with each color picked up 50 times. Of those 50 episodes, 25 involve only the target block, while the other 25 include distractor blocks. Within those distractor episodes, 10 include one additional block, 10 include two additional blocks, and 5 include three additional blocks.

Each episode is paired with a language instruction corresponding to the target color. For example:

Language instruction:
“Pick up the {blue, red, green, yellow} block and carefully place it in the black bin.”

Our dataset is available on Hugging Face here:
View the dataset

Are VLAs Overhyped? First Results on a Real Robot

2026-04-10T00:00:00+00:00

Context

This week was about getting a full end-to-end pipeline working: robot setup → data collection → training → real-world inference.

The goal was simple: get a Vision-Language Action (VLA) model running on the SO-101 and see how quickly we could get any level of real-world performance on a basic pick-and-place task.

The Setup

I set up the SO-101 robot arm and followed the LeRobot installation process to create the Python environment. The installation itself was mostly smooth, although getting everything working cleanly on the GPU cluster took some effort later on.

After setup, I calibrated the arms (joint limits, motor ranges) using lerobot-calibrate CLI command and spent time teleoperating using the leader arm to get comfortable controlling the follower.

The Task

The task was:

Pick up an apple and place it into a white ceramic bowl.

Each episode consisted of a left-to-right pick-and-place motion, with some variation in object and target positions.

Dataset v1: 20 Episodes

The initial dataset consisted of 20 teleoperated demonstrations, collected using lerobot-record and:

a wrist-mounted camera on the follower arm
a world camera mounted above the workspace (top-down view)

I initally started with just 20 demonstrations as I was under the impression that would suffice for a VLA to learn a task. You can view the full dataset here.

The demonstrations were relatively unstructured, with moderate variation in object and target starting positions.

Training

I fine-tuned SmolVLA using LeRobot.

Training was done on Imperial’s GPU cluster using Slurm. This introduced some friction:

dependency issues during setup
occasional out-of-memory errors

Eventually, I got training running on an A40 node via the LeRobot CLI with the following command and flags.

lerobot-train \
  --dataset.repo_id=mattpidden/smol-vla-test-dataset \
  --policy.path=lerobot/smolvla_base \
  --output_dir=/vol/bitbucket/mdp25/outputs/smolvla2 \
  --job_name=smolvla_training \
  --batch_size=64 \
  --steps 40000 \
  --save_freq 1000 \
  --policy.device=cuda \
  --policy.repo_id=mattpidden/smolvla_apple_policy \
  --rename_map='{"observation.images.claw": "observation.images.camera1", "observation.images.world": "observation.images.camera2"}'

For inference, I used a local RTX 4090 with the async server. Runtime performance was good (~0.1s per step), so latency was not a bottleneck.

Results (v1)

Performance on the real robot was poor:

the robot consistently failed to grasp the apple
often missed the object entirely
when manually corrected into a grasp, it sometimes moved toward the bowl but dropped the object nearby

Success rate: ~0%

That said, the behaviour wasn’t random. The model appeared to partially understand the task structure — moving toward the bowl after interacting with the object — but could not execute a reliable grasp.

Grasping was clearly the main failure point.

Dataset v2: 50 Episodes

After some further reading of other blogs posts I read that to improve performance I needed more and better data. I collected 30 additional episodes (50 total), with a key change:

reduced task variability
constrained apple and bowl positions to tighter regions
more consistent demonstration trajectories

The goal was to increase demonstration density rather than diversity.

Results (v2)

After fine-tuning on the expanded dataset:

~30% success rate (3/10 trials)
a run was successful if the apple ended in the bowl within 60 seconds

This was a clear improvement over v1. The model began to complete the full task in some cases, although behaviour was still inconsistent.

What Changed?

The main difference wasn’t just more data — it was more consistent data.

By reducing spatial variability, the model was better able to learn precise grasp locations rather than a broad, under-specified policy.

Takeaways

A few things became clear:

20 demonstrations is not enough for this task or policy
~50 demonstrations begins to produce usable behaviour
grasping is the primary bottleneck
dataset consistency matters more than raw diversity

Next Steps

Next, I plan to:

improve the quality of grasp demonstrations
increase dataset size further
experiment with alternative policies (e.g. pi0.5, ACT)
potentially adjust the camera setup and resolution

Exploring VLAs on real robots