Humanoid Robots: Dollars and GPTs
(This is Part 2. Part 1 was about various ways robots could be human like and is not required to enjoy part 2)
In part 1, I talked about the various benefits and drawbacks of making robots that mimic human form, but we didn’t actually get to answering the core question: “Why are there so many humanoids now?” and “Is this a good idea?”
To answer “why now?” we have to ask “what has changed?” Here is a timeline of what I consider the breakthroughs in state-of-the-art which enable what we see today.
20181: Boston Dynamics releases their first parkour video. Bipedal motion works well enough to count on.
2019: Convolutional deep neural networks make perception (object detection/localization) work well enough to do manipulation.
2022: ChatGPT3 convinces everyone that natural language is going to work well enough to power actual products.
2023: Tesla releases videos of their humanoid robots. VCs notice and funding for humanoids is suddenly much more attainable.
2024: Behavior cloning2 for manipulation starts to work well enough for tech demos and videos.
Other helpful general trends:
Batteries have gotten much more energy-dense and much less expensive.
Rapid prototyping has gotten much cheaper and faster (both from 3D printing and more automation in prototype-scale traditional manufacturing).3
But is that enough? Are these the missing ingredients? To answer that, let's talk about money.
Making Money
The purpose of a robot company is to replace expensive human labor with cheap robot labor.
That “value produced” is split between the robot company and the customer. The customer saves money (which is why they bothered buying the robots) and they pay the start-up using part of the money saved, so that number has to be big enough to make it worthwhile for both parties. Let's do some back-of-the-envelope math.
Cost to Run Robots:
We’ll take a very low-ball guess at $60k.5
Lifetime of a robot: 3 years.6
Which means we get an annual hardware cost of $20k.
Cost of Support
Let's say that you’ve got pretty good reliability despite being cutting edge hardware. Your robots only need service every 6 months on average (either routine maintenance or repairing something that has broken).
Each time that happens you have to ship it back to your repair facility (say $300 each way) and spend 3 hours repairing it ($300 in technician costs, amortizing in downtime for your technicians because they are not used 100% of the time). We’ll round up to $1k a repair. Plus you don’t do work for probably a whole week while it gets resolved, costing another 1k in lost revenue. $2k twice a year is $4k.
Cost of support: 4k per year per robot.
Cost of Supervision
Robots are less flexible and adaptable than humans, get confused easily and sometimes get stuck doing the wrong thing. The state of the art is to have a ‘call center’ of folks who can remotely supervise robots and unstick them. A huge lever on the economic viability is how many robots a single remote supervisor can handle. 1:1 is an obvious non-starter, but there is an inflection point around 5:1 where you can no longer have people watching camera feeds ready to jump in and help and have to have the robots recognize that they are in a pickle and ask for help. Let's say that we’ve solved that problem, for the most part, and a single remote operator can supervise 50 robots. That person costs 50k a year.7 That adds another $1k a year per robot.
We also have to have someone on-site to supervise and assign jobs, and do smaller troubleshooting and maintenance. Let’s say a 20 robot deployment takes 10% of one person’s time to manage. (We’ll dump the responsibility on our IT department because robots are kind of like computers). 10% of 200k divided by 20 robots is another 1k per year.
Cost of Supervision: 2k per robot per year
Total cost of operating the robot: $26k per year
Value Created
The ‘cost of human labor saved’ is pretty straightforward. It is the fully loaded cost of the human. Fortunately for robotics startups, this is quite a bit higher than minimum wage. In Silicon Valley, when you add taxes, health-care, salaries for their supervisors and profit for the contracting company, someone to clean an office building can easily be more than $70 an hour. For this reason, labor intensive workplaces, like warehouses, tend to not be in the heart of Silicon Valley, and $30 an hour is probably a more reasonable national average.
Annual savings = hourly-cost x hours-per-shift x shift-per-week x weeks-per-year
Making some reasonable assumptions:
$30 x 8 hours/shift x 6 shifts/week x 52 weeks/year
$74880. We’ll round up to $75k
$75k seems pretty darn good. With our cost of $26K that leaves $49k of value, meaning we get a 188% return on our investment. Not bad!
But wait, there’s less!
We’ve made an assumption that is really, really dodgy. We are assuming an hour of robot labor is as valuable as an hour of human labor. There are two big reasons to suspect that is not the case.
1: Robots are slow
There is a reason that you see a ton of sped-up robot videos. Robots work slowly. I love this video from 1X. It is an incredible (and incredibly honest8) depiction of the state of the art.
The tech here is amazing, and (although I’m sure that video took many takes to get), the robots must work pretty reliably to get that many things working all in a row. But you can see that the robots are doing pretty simple things about 2x-4x slower than a human would.
The ratio is better for carrying things long distances (humanoids are probably a little faster than half human walking speed, i.e. between 1x and 2x) and worse for fiddly manipulation things. This (incredible) video shows some very challenging manipulation tasks:
It is super impressive. This video shows things beyond what I thought was state of the art. But it also takes 45 seconds to put the shirt on a hanger, a task that probably takes a human less than three (a ratio of about 10x). I remember doing a focused speed-up sprint on our table wiping at Everyday Robots. We parallelized motion and planning and increased arm speed and pre-computed things so that they’d be ready when we needed them and managed to get a table-wipe down to about 45 seconds. Then I timed a human and they did a better job in 4.5 seconds (also a ratio of about 10x).
So if we are projecting the state of the art forward 2-3 years, we should probably give robots at least a 3x speed penalty, meaning it takes three robots to do the work of one human.9 Faster than that would be assuming that autonomous robots can do things faster than we can teleoperate them, which feels like an open research problem.10 Safety also gets harder if you let robots move that fast.
2: Robots are less adaptable
It turns out that actual jobs are made of dozens of individual tasks. Automation tends to automate tasks one-at-a-time. This is a place where humanoids have more potential than other kinds of robots. In 2011 I was working at Anybots Inc and we were looking into teleoperated security, and were very enthusiastic until we interviewed someone who did after-hours security work and he told us that in addition to ‘walking around and looking for bad guys’ his job consisted of testing all the doors to make sure they were locked, unplugging the coffee machine if it had been left on, turning off lights, etc, etc. We had only considered the main task, and not realized the whole slew of other smaller tasks that were part of that job.
If there is a human picking items in a warehouse and all the orders get picked, that person will switch to something like refilling or doing inventory or preparing labels or folding and taping shipping boxes, or any of the countless other tasks. Even a robot (humanoid or otherwise) who could physically do those tasks would go idle during that time unless you had programmed/trained those behaviors. The lack of flexibility costs at least 15% of the robot’s time.
So we take our 75K, subtract 15% for the flexibility penalty ($64k) and divide by 3 for the productivity ratio and we get $21k of value produced per robot per year.
Now we subtract our $26k costs and all of a sudden humans are cheaper than robots.
Womp-womp.
What can you do?
Your options are rough. You can try and find a customer that operates more than 1 shift per day, because the same robot can work both shifts.11 However you start to need to spend more of your shift charging instead of just charging between shifts and your robot probably doesn’t last as many years and needs servicing more often if you run it more hours per day.
You can try to reduce the cost of your robot, but cutting the cost in half is going to be very hard. (Getting your cost down to $60k is already pretty optimistic. Shooting for $60k and accidentally ending up at $120k feels likely. $30k feels nearly impossible).
So the biggest lever you have is your productivity ratio. Getting down from 10x-3x to between 2x and 1x is the most plausible way to make the math work. So this seems like the most important metric humanoid robot startups should track.
I made12 a “Robotics Startup CEO Simulator” website so you can fiddle with numbers and see how they work out:
What about ChatGPT? Doesn’t AI solve all of this?
LLMs (large language models like ChatGPT) do seem poised to solve two real problems in robotics.
The first is dealing with the really long tail of weird shit the world will throw at robots.
The second problem LLMs seem like a good fit for is high level planning to connect together lower level primitives.
Lets watch this video from Figure AI:
It's a very slick video showing some amazing tech. It can be hard to guess from videos of robot demos what's actually going on, but I commend @coreylynch (the human in the video) for explaining what we’re seeing in this tweet. The LLM is able to interpret spoken requests and select from pretrained execution models (he doesn't specify but these must be behavior-cloning.)
It’s funny, because I think the thing most people find impressive with the video is the conversational part, but that strikes me as a bit of demo showmanship. My guess is that the GPT input-output looks something like.
Input:
human_said: “hey, can I have something to eat?”,
camera_image.png
available_controllers
give_apple_to_person
put_crumpled_paper_in_basket
put_cup_in_drying_rack
put_plate_in_drying_rack
Output
say: “Sure thing”
selected_controller:
give_apple_to_person
I say, “I bet it looks like that,” because that’s how I would do it. It is cool that LLMs are smart enough to have the context to select that controller, but (if I’m right about the general structure and generality) you can tell that the secret sauce here isn’t really the LLM. The bottleneck on being multi purpose is still the dexterity, not the natural language interface. Given the small list of things this robot has controllers for, the speech part could be faked with some speech-to-text and a bunch of regexes. It would be more brittle, but would still work. And each time you wanted to add another task you would spend hours collecting teleop demonstrations, hours training a new neural network and 3 minutes adding another regex. Sure, it would break if you said, “I’m hungry” instead of “Can I have something to eat”, but I bet it breaks now if you say “pass me the plate” because there is no give_human_plate controller.
It certainly feels like we’re living in the future when you watch the conversation, but the value the LLM is adding, in terms of the economics of the business, goes towards the 15% flexibility penalty. No, the thing that blows my socks off with that video is that absolutely gorgeous two handed plate grasp. It is truly a thing of beauty.
If you can tell robots to do other tasks in their down time (and the LLMs can figure out how to actually do those things and you have appropriate controllers to accomplish them), then they can be busy more of the time. But we only discounted 15% for flexibility. Even if LLMs gave you enough stuff to stay constantly busy that is a very small change to your viability, where the speed of productive work and cost of the robot are both big levers on viability.
That's why that plate grasp is what impresses me so much. It is very close to human speed.
Behavior Cloning Is All You Need?
Behavior cloning suddenly feels like perception did 10 years ago. You can make it work in narrow domains by overfitting to a small problem. In both cases you collect many hours of labeled data and end up with something that works nicely as long at it matches the data you collected very very closely. But that's not what perception feels like today.
2014: Perceive this bottle on this background.
2019: Any bottle, any background.
2024: Pick up this plate from your right and put it into this basket on your left
2029: …?
Today you can basically assume perception will work well enough for your problem. If learned manipulation from demonstration is on that trajectory it is a big fucking deal. This could be our solution to the dexterity problem.
It’s not guaranteed that it is on the same trajectory. The tipping point for perception was internet-sized training sets, which will be much harder to come by for robotics. The two ways I can see it happening are:
Folks figure out how to package and sell “fine tuning” a robot to a particular task and location. You demonstrate/teleop a new task for a few hours and then the robot will be able to do that task in that place with minor variation. Make robot-plus-learning-toolkit into a product, sell that product to lots of people, and use everyone’s data to bootstrap more and more general models which give you better starting points for fine tuning which give you more performance which gives you more customers.
Figure out how to do transfer-learning from videos of humans doing things so you can use a data source like YouTube to bootstrap your model.
Option 2 is a reason you might want to bet on humanoids. If someone can make that transfer work (though it does not seem guaranteed that it will work) then I can imagine using “robots that are human shaped” helping with the transfer. If that's what's needed, then it is absolutely worth the extra cost and complexity of the human form. That's a lot of ‘if’s though.
Is it a Humanoid Revolution or a Bubble?
This is the big question, and I really don’t know. I’m voting with my time, working at Robust AI on a decidedly non-humanoid robot: one that puts a heavy emphasis on simplicity, low cost, and short term ROI. It's also one that skews much more heavily towards being a tool, rather than a servant. We are trying to build something more like a working dog than an employee.
I’m here because, after 8 years working on general purpose robots at Google X, I wanted to build something that we could use to find market fit, deploy and scale quickly. Humanoids are the high-risk-high-reward choice, and I’m currently excited to try something with a high probability of large deployments to real customers.
I’m not saying humanoids aren’t going to happen, but there are a lot of challenges to be solved before the economics of humanoids can work out. The progress is amazing but making the value larger than the cost is really hard: folks are going to have to nail both “very low cost robots” and “high productivity speeds”. On the other hand, technology is moving a lot faster than I thought it would a year ago. It is an absolutely incredible time to be a roboticist, and we won’t have to wait very long to find out what the future holds.
https://generalrobots.substack.com/p/humanoid-robots-dollars-and-gpts
Benjie Holson
Director of Robotics