INTRO

Welcome back to Level.UP, brought to you by UP.Labs.
In this edition, we’re diving into why AI activity is being mistaken for AI output across the C-suite, and what discipline the leaders pulling ahead share. We also examine why robotics just had its “GPT-2.5” moment, and why the companies most likely to win the embodied AI race aren't the ones building the best models.
Think someone else needs this? Forward it to a friend or colleague navigating the same terrain.
MOVING THE WORLD AHEAD
The Rise of “AI Psychosis”: Activity Isn’t Output
In February, the National Bureau of Economic Research surveyed nearly 6,000 CEOs, CFOs, and senior executives across the US, UK, Germany, and Australia. 69% of firms reported using AI. 9 in 10 reported no measurable impact on productivity or employment over the past 3 years. Among the executives themselves, average usage came to 1.5 hours per week.
Then, in March, Stanford researchers published a study in Science testing 11 leading AI models for “sycophancy” (the tendency of large language models to prioritize user approval, flattery, or validation over factual accuracy and truthfulness). They found that every model affirmed user behavior 49% more often than human respondents did. Even one interaction with a sycophantic model left users more convinced they were right and more likely to keep using the tool.
In other words, AI tools generate strong signals of productivity.
But those signals are increasingly decoupled from outcomes. And the same tools producing the signal are wired to tell you the signal is good.
This phenomenon has a name now in some corners of the industry: AI psychosis. As Jake Handy writes, “These platforms share a common design philosophy: make the operator feel like they’re commanding a fleet. Dashboards, org charts, agent hierarchies, budget controls, governance layers. It looks and feels like management. You get the dopamine hit of delegation without the inconvenience of measuring whether the delegates produced anything useful.”
OUR TAKE
We work with AI every day, and we know the leverage is real. The leaders getting the most out of these tools are seeing genuine step-changes: faster decisions, better products, and smaller teams shipping more. But what Handy points to is a real problem — when activity becomes a proxy for productivity.
Most enterprise software was designed to surface output: revenue moved, tickets closed, and cycle time reduced. AI tools surface activity instead. Tokens consumed, lines of code generated, agents running, dashboards green. Those numbers feel like progress because every other system you've used has trained you to read them that way.
In fact, the incentive structures are already shifting to match. Companies like Meta and Google have begun tracking token consumption on internal leaderboards — a metric that rewards usage rather than what it produces. Once people know their AI activity is being measured, the rational move is to generate more of it. Whether the underlying work gets better is a separate question entirely.
The Stanford findings on sycophancy compound the problem. The model telling you the work is good is the same one you trust to evaluate it. Your dashboard says the run succeeded, and typically, your colleague would have pushed back, but they’ve now been replaced by an agent that won't. If you are the leader of the organization, you may be the only person who can tell whether the work is real — and you are the one being most consistently told it is.
The fix is small and unglamorous. Define the outcome you're underwriting before the work starts. Measure what shipped, not what ran. Be deliberate about where AI output belongs and where it doesn't — strategy sessions and creative brainstorming, for example, often benefit more from starting with institutional knowledge and using AI to expand on it, rather than letting AI generate the first draft of your thinking.
The leaders pulling ahead right now are the ones who can tell, quickly and on their own terms, when the dashboard is wrong and when they are holding rigorous standards for output. Everything else is just expensive motion.
Robotics’ “GPT 2.5” Moment
In the last month, 3 robotics advances have happened that would have been hard to imagine 18 months ago.
A Sony AI robot named Ace beat elite human players at table tennis in matches conducted under official competition rules. It was the first autonomous system to do so in a real-world physical sport.
At Hannover Messe 2026, Accenture, SAP, and Vodafone presented a pilot in Duisburg where humanoid robots autonomously performed warehouse inspections. It flagged misplaced products, hazardous aisles, and unused storage space, then wrote their findings back into SAP's warehouse management system in real time.
A humanoid robot named Lightning, built by Chinese smartphone maker Honor, won a half-marathon in 50 minutes and 26 seconds, beating the human world record set in Lisbon a month earlier.
Honor, a Huawei spin-off backed by the Shenzhen government, announced its humanoid push in March 2025 as part of a $10 billion "Alpha Strategy" to pivot from phones to AI hardware. Twelve months later, Lightning was racing. Most humanoid programs take 3 to 5 years to reach a working prototype.
OUR TAKE
Bessemer recently called this the "GPT-2.5 moment" for robotics: the capability is real, but the gap between demo and deployment remains wide. The bottleneck isn't whether the model runs. It's whether the system around it can be built, sourced, cooled, integrated, and shipped at scale.
The Accenture pilot is what closing that gap looks like inside an enterprise. The breakthrough isn't the robot walking the warehouse — it's wiring its findings back into SAP so the next decision happens without a person in the loop.
Honor shows the same lesson at the component layer. Most humanoid startups assemble the stack from scratch: motors, sensors, thermal systems, and control models. Honor treated the humanoid as a different form factor on the same infrastructure it already ran at scale, from micro-motors to thermal engineering, to on-device AI.
The defining question for the next decade of physical AI will be which existing supply chains can be redirected fastest into new physical categories.
Apple, Samsung, and BYD sit on adjacent infrastructure. So do automakers, appliance manufacturers, and aerospace primes. Three things separate the winners: precision component supply at volume, thermal and power engineering at edge form factors, and an installed base big enough to absorb early-version risk. Phone makers and automakers already have all 3.
The conversation about physical AI has focused on who builds the best foundation model. Honor's sprint suggests the more durable advantage lies in manufacturing depth and component sovereignty — areas where China has spent 15 years building leverage and the West has spent years offshoring it.
Beating the world record was a stunt. Compressing 5 years to 12 months was not. That's the capability worth taking seriously, and the gap worth closing.
SCALING UP
This week, we're doing something a little different. Instead of tools we're testing, we wanted to share a few internal apps built by Jesse Silverman from our HR team that use Claude to streamline the recruiting process.
What are we solving for? Early-stage recruiting is brutally manual. As a VC, we're helping multiple portfolio companies hire at once, and the default workflow is often chaotic. Empty search bars every morning, candidates falling through the cracks, waiting on feedback, follow-up emails that never get sent, and rejections sitting in a queue for weeks. We wanted to compress the time between "we need to hire someone" and "we have a bunch of strong candidates worth talking to."
Addressing recruitment bias. The tools we’ve made are deliberately designed to surface more candidates, not fewer. Sourceflow pulls from 220M+ contacts on Apollo and cross-references GitHub. The AI scoring is based purely on role criteria we write — things like "5+ years enterprise SaaS sales, manufacturing domain, $100K+ deal size." Claude scores against that rubric and nothing else. There's no photo, no name inference, no location bias baked into the score. The Profile Evaluator works the same way. You define the criteria, and it grades against them.
We’ve included video explanations showing you how each works:
Sourceflow: Drop in a job description, and it pulls candidates from Apollo and GitHub, scores each candidate 0–100, assigns an A/B/C tier, and generates an AI-written summary.
Profile Evaluator: A Chrome extension that grades any LinkedIn profile against an open role: Strong/Moderate/Weak, one-line reasoning, and a pre-written outreach opener.
Recruiting Assistant: Pushes pre-call briefing cards into Slack each morning with candidate context. After calls, it surfaces follow-up actions: advance, reject, or snooze without leaving Slack.
PRODUCTIVITY POLL
When you evaluate AI's impact on your team, what are you actually measuring?
HOT TAKES
Meta Bought A Robot Brain Company. Meta just acquired Assured Robot Intelligence, a startup building foundation models for whole-body humanoid control, and folded the team into its Superintelligence Labs division. The co-founders previously came out of NYU, Nvidia, and a smaller humanoid startup that Amazon snapped up in March. The signal: every major AI lab now has a humanoid play, and the talent pool is being acquired faster than it can be trained. The race for the embodied AI stack is no longer a question of if — it's a question of which platform's models end up running inside the bodies. → Read more
An AI Agent Deleted A Company's Entire Database In 9 Seconds. A Cursor coding agent running Claude Opus 4.6 wiped PocketOS's production database and every backup with a single API call to its cloud provider, Railway. The data was eventually recovered, and Railway has since rebuilt its guardrails. But the founder's own takeaway is the one to internalize: this wasn't a story about a bad agent. It was about an industry shipping AI integrations into production infrastructure faster than it ships the safety architecture around them. The lesson: the riskiest line in your stack right now isn't the model. It's the API endpoint that the model is allowed to call without confirmation. → Read more
Nvidia Strikes Deal With LG. Madison Huang, Nvidia's senior director for Omniverse and Robotics (and Jensen Huang's daughter), spent 2 days in late April meeting executives at 5 Korean firms: Samsung, SK hynix, Hyundai, LG Electronics, and Doosan Robotics. The discussions span memory chips, mobility, home robotics, industrial cobots, and AI data center cooling — effectively every layer of a national physical AI stack. The pattern: Nvidia is no longer building physical AI, partnership by partnership. It's building it country by country. → Read more


