<?xml version="1.0" encoding="utf-8"?><feed xmlns="http://www.w3.org/2005/Atom" ><generator uri="https://jekyllrb.com/" version="3.10.0">Jekyll</generator><link href="https://justintiensmith.github.io/vla-research/feed.xml" rel="self" type="application/atom+xml" /><link href="https://justintiensmith.github.io/vla-research/" rel="alternate" type="text/html" /><updated>2026-04-22T10:52:48+00:00</updated><id>https://justintiensmith.github.io/vla-research/feed.xml</id><title type="html">Exploring VLAs on real robots</title><subtitle>Weekly updates on our MSc dissertation exploring VLAs, with a focus on real-world benchmarking using the SO-101 robotic arm.</subtitle><author><name>justintiensmith &amp; mattpidden</name></author><entry><title type="html">Multicolored-Block-Pick-And-Place-Dataset</title><link href="https://justintiensmith.github.io/vla-research/2026/04/17/multicolored-block-pick-and-place-dataset.html" rel="alternate" type="text/html" title="Multicolored-Block-Pick-And-Place-Dataset" /><published>2026-04-17T00:00:00+00:00</published><updated>2026-04-17T00:00:00+00:00</updated><id>https://justintiensmith.github.io/vla-research/2026/04/17/multicolored-block-pick-and-place-dataset</id><content type="html" xml:base="https://justintiensmith.github.io/vla-research/2026/04/17/multicolored-block-pick-and-place-dataset.html"><![CDATA[<p>Today, we collected a new pick-and-place dataset with 200 episodes using four different colored blocks. Most of the episodes were collected by picking up the object on the left and placing it in a bin on the right.</p>

<p><img src="/vla-research/images/blocks.jpg" alt="Our multicolored blocks" width="600" /></p>

<p>Our first “toy” dataset contained only 50 episodes and used an apple as the object to pick and place in a ceramic bowl. The apple barely fit in the robot gripper, which made the setup less reliable. The world camera in that dataset was also mounted directly above the workspace, looking straight down. We suspected that this initial setup was not ideal because the dataset was quite small, and the camera angle likely did not capture depth information well enough.</p>

<p>To improve on that first dataset, we made three major changes. First, we repositioned the world camera so that it viewed the workspace from an angle, looking down from the front of the robot rather than directly overhead. Second, we replaced the apple with Duplo LEGO blocks, which fit much more naturally in the SO-101 gripper. Third, we collected 200 episodes instead of just 50.</p>

<p>The new dataset includes four block colors, with each color picked up 50 times. Of those 50 episodes, 25 involve only the target block, while the other 25 include distractor blocks. Within those distractor episodes, 10 include one additional block, 10 include two additional blocks, and 5 include three additional blocks.</p>

<p>Each episode is paired with a language instruction corresponding to the target color. For example:</p>

<p><strong>Language instruction:</strong><br />
“Pick up the {blue, red, green, yellow} block and carefully place it in the black bin.”</p>

<video controls="" width="700">
  <source src="/vla-research/videos/pick-place-data.mp4" type="video/mp4" />
</video>

<p>Our dataset is available on Hugging Face here:<br />
<a href="https://huggingface.co/spaces/lerobot/visualize_dataset?path=%2Fjustintiensmith%2Fmulticolour_block_pick_place%2Fepisode_0">View the dataset</a></p>]]></content><author><name>justintiensmith &amp; mattpidden</name></author><summary type="html"><![CDATA[Today, we collected a new pick-and-place dataset with 200 episodes using four different colored blocks. Most of the episodes were collected by picking up the object on the left and placing it in a bin on the right.]]></summary></entry><entry><title type="html">Are VLAs Overhyped? First Results on a Real Robot</title><link href="https://justintiensmith.github.io/vla-research/2026/04/10/are-vlas-overhyped-first-results-on-a-real-robot.html" rel="alternate" type="text/html" title="Are VLAs Overhyped? First Results on a Real Robot" /><published>2026-04-10T00:00:00+00:00</published><updated>2026-04-10T00:00:00+00:00</updated><id>https://justintiensmith.github.io/vla-research/2026/04/10/are-vlas-overhyped-first-results-on-a-real-robot</id><content type="html" xml:base="https://justintiensmith.github.io/vla-research/2026/04/10/are-vlas-overhyped-first-results-on-a-real-robot.html"><![CDATA[<h2 id="context">Context</h2>

<p>This week was about getting a full end-to-end pipeline working: robot setup → data collection → training → real-world inference.</p>

<p>The goal was simple: get a Vision-Language Action (VLA) model running on the SO-101 and see how quickly we could get <em>any</em> level of real-world performance on a basic pick-and-place task.</p>

<h2 id="the-setup">The Setup</h2>

<p>I set up the SO-101 robot arm and followed the LeRobot installation process to create the Python environment. The installation itself was mostly smooth, although getting everything working cleanly on the GPU cluster took some effort later on.</p>

<p>After setup, I calibrated the arms (joint limits, motor ranges) using <code class="language-plaintext highlighter-rouge">lerobot-calibrate</code> CLI command and spent time teleoperating using the leader arm to get comfortable controlling the follower.</p>

<h2 id="the-task">The Task</h2>

<p>The task was:</p>

<blockquote>
  <p>Pick up an apple and place it into a white ceramic bowl.</p>
</blockquote>

<p>Each episode consisted of a left-to-right pick-and-place motion, with some variation in object and target positions.</p>

<h2 id="dataset-v1-20-episodes">Dataset v1: 20 Episodes</h2>

<p>The initial dataset consisted of <strong>20 teleoperated demonstrations</strong>, collected using <code class="language-plaintext highlighter-rouge">lerobot-record</code> and:</p>

<ul>
  <li>a wrist-mounted camera on the follower arm</li>
  <li>a world camera mounted above the workspace (top-down view)</li>
</ul>

<p>I initally started with just 20 demonstrations as I was under the impression that would suffice for a VLA to learn a task. You can view the full dataset <a href="https://huggingface.co/spaces/lerobot/visualize_dataset?path=%2Fmattpidden%2Fsmol-vla-test-dataset%2Fepisode_0">here</a>.</p>

<video controls="" width="700">
  <source src="https://huggingface.co/datasets/mattpidden/smol-vla-test-dataset/resolve/main/videos/apple-dataset-timelapse.MOV" type="video/quicktime" />
</video>

<video controls="" width="700">
  <source src="https://huggingface.co/datasets/mattpidden/smol-vla-test-dataset/resolve/main/videos/observation.images.world/chunk-000/file-000.mp4" type="video/mp4" />
</video>

<p>The demonstrations were relatively unstructured, with moderate variation in object and target starting positions.</p>

<h2 id="training">Training</h2>

<p>I fine-tuned <strong>SmolVLA</strong> using LeRobot.</p>

<p>Training was done on Imperial’s GPU cluster using Slurm. This introduced some friction:</p>
<ul>
  <li>dependency issues during setup</li>
  <li>occasional out-of-memory errors</li>
</ul>

<p>Eventually, I got training running on an A40 node via the LeRobot CLI with the following command and flags.</p>
<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code>lerobot-train <span class="se">\</span>
  <span class="nt">--dataset</span>.repo_id<span class="o">=</span>mattpidden/smol-vla-test-dataset <span class="se">\</span>
  <span class="nt">--policy</span>.path<span class="o">=</span>lerobot/smolvla_base <span class="se">\</span>
  <span class="nt">--output_dir</span><span class="o">=</span>/vol/bitbucket/mdp25/outputs/smolvla2 <span class="se">\</span>
  <span class="nt">--job_name</span><span class="o">=</span>smolvla_training <span class="se">\</span>
  <span class="nt">--batch_size</span><span class="o">=</span>64 <span class="se">\</span>
  <span class="nt">--steps</span> 40000 <span class="se">\</span>
  <span class="nt">--save_freq</span> 1000 <span class="se">\</span>
  <span class="nt">--policy</span>.device<span class="o">=</span>cuda <span class="se">\</span>
  <span class="nt">--policy</span>.repo_id<span class="o">=</span>mattpidden/smolvla_apple_policy <span class="se">\</span>
  <span class="nt">--rename_map</span><span class="o">=</span><span class="s1">'{"observation.images.claw": "observation.images.camera1", "observation.images.world": "observation.images.camera2"}'</span>
</code></pre></div></div>

<p>For inference, I used a local RTX 4090 with the async server. Runtime performance was good (~0.1s per step), so latency was not a bottleneck.</p>

<h2 id="results-v1">Results (v1)</h2>

<p>Performance on the real robot was poor:</p>

<ul>
  <li>the robot consistently failed to grasp the apple</li>
  <li>often missed the object entirely</li>
  <li>when manually corrected into a grasp, it sometimes moved toward the bowl but dropped the object nearby</li>
</ul>

<p><strong>Success rate: ~0%</strong></p>

<p>That said, the behaviour wasn’t random. The model appeared to partially understand the task structure — moving toward the bowl after interacting with the object — but could not execute a reliable grasp.</p>

<p>Grasping was clearly the main failure point.</p>

<h2 id="dataset-v2-50-episodes">Dataset v2: 50 Episodes</h2>

<p>After some further reading of other blogs posts I read that to improve performance I needed more and better data. I collected <strong>30 additional episodes</strong> (50 total), with a key change:</p>

<ul>
  <li>reduced task variability</li>
  <li>constrained apple and bowl positions to tighter regions</li>
  <li>more consistent demonstration trajectories</li>
</ul>

<p>The goal was to increase demonstration density rather than diversity.</p>

<h2 id="results-v2">Results (v2)</h2>

<p>After fine-tuning on the expanded dataset:</p>

<ul>
  <li><strong>~30% success rate (3/10 trials)</strong></li>
  <li>a run was successful if the apple ended in the bowl within 60 seconds</li>
</ul>

<p>This was a clear improvement over v1. The model began to complete the full task in some cases, although behaviour was still inconsistent.</p>

<h2 id="what-changed">What Changed?</h2>

<p>The main difference wasn’t just more data — it was <em>more consistent data</em>.</p>

<p>By reducing spatial variability, the model was better able to learn precise grasp locations rather than a broad, under-specified policy.</p>

<h2 id="takeaways">Takeaways</h2>

<p>A few things became clear:</p>

<ul>
  <li><strong>20 demonstrations is not enough</strong> for this task or policy</li>
  <li><strong>~50 demonstrations begins to produce usable behaviour</strong></li>
  <li><strong>grasping is the primary bottleneck</strong></li>
  <li><strong>dataset consistency matters more than raw diversity</strong></li>
</ul>

<h2 id="next-steps">Next Steps</h2>

<p>Next, I plan to:</p>

<ul>
  <li>improve the quality of grasp demonstrations</li>
  <li>increase dataset size further</li>
  <li>experiment with alternative policies (e.g. pi0.5, ACT)</li>
  <li>potentially adjust the camera setup and resolution</li>
</ul>]]></content><author><name>mattpidden</name></author><summary type="html"><![CDATA[Initial real-world experiments with VLAs on the SO-101, exploring the gap between few-shot expectations and practical performance.]]></summary></entry></feed>