Building Bod.Coach: LLM Lessons Learned The Hard Way.


I don’t like working out. I do it, but I don’t like it. In fact, I was doing the same gym routine for three years without touching it because I didn’t want to spend the time researching and changing the exercises I was doing to meet my distant fitness goals (Greek god—without the effort, obviously).

Like any self-respecting developer, I turned to AI. Surely these models could crank out top-notch workout routines for me and I could just pop them into a database and boom—problem solved.

You silly, naïve little man. Not all that glitters is gold.

In the end, we successfully created Bod.Coach—a truly unique fitness experience where you are paired with a virtual fitness trainer via text message. Your trainer has an actual phone number, and you carry on a conversation with it to create your routine, schedule your workouts, and give feedback. You can text any time to change your schedule or let it know you got a shoulder injury—it’ll reschedule workouts, change exercises to accommodate your injury, or make you one-off hotel room workouts when you travel. It will track your progress and change workouts dynamically over time to meet your goals. It’ll text workout reminders too and hold you accountable by scolding you when you’re not living up to your own goals.

Screenshot of interaction

But creating Bod.Coach turned out to be far more difficult than we anticipated and working with LLMs was by far the hardest part.

Have you tried actually building an app with AI at the core? It’s one of the greatest paradoxes I’ve encountered in 20+ years of writing software. It’s dead simple to wire up a fully functional demo but so so hard to make it reliable and good. Why? Because your intuition—that problem-solving muscle memory you’ve built up over your career as a developer—is absolutely worthless.

The good news is you can build an intuition for LLMs, but just like becoming a competent software engineer there is no shortcut. You have to build up reps working with them (and I don’t mean using ChatGPT).

My first surprise was asking GPT-4o (via the web interface) for an upper-body arm workout given my particular goals and personal stats—it spit out a pretty great workout! So I went to the OpenAI platform console and crafted a tool call in their interface to capture the important details of a workout as structured data. I asked the same questions, and the result was absolute garbage. It had me doing five reps of barbell curls and then child’s pose for 30 minutes.

This is when I began the journey of learning lesson number one.



Lesson 1: LLMs have a left and right brain.

LLMs are mostly right-brained. They feel more than they think. When we tested LLMs, we found the output format had almost as much to do with determining the quality of the response as the prompt itself. Let’s use cookies as an example…

If you ask a human to write a chocolate-chip-cookie dough recipe on a piece of cardstock and then ask them to write one in a spreadsheet, you get roughly the same result. This is not true with LLMs. The quality of the recipe will be lower when requested as a structured output. This is especially true for non-thinking models.

When output is requested in a formatted structure—like a tool call—it feels as though it uses a different “part of the brain.” My anthropomorphized narrative is that it’s thinking more about the syntax than the content.

In reality LLMs are masters of linguistic form. They don’t actually know what a good cookie-dough recipe is—they’re just great at predicting the language structure that represents a good recipe. Ask for that as JSON and it suddenly becomes very unclear to its silicon brain what “good” looks like. This structured part of its brain is excellent at producing valid JSON, but not at doing both tasks at the same time.



Solution?

Thinking models help a bit, but in our experience there is still no solution better than chaining. First, ask an LLM to write out that cookie-dough recipe in exquisite detail, giving it no restrictions on output format. That text becomes the input to another LLM request that structures the recipe into JSON (using tools).



Lesson 2: LLMs are always dying

Do you remember the cloning machine Nikola Tesla built in The Prestige (the most underrated movie of all time)? The antagonist cloned himself every night and killed his current self each night as a magic trick. Perhaps this is my most heavy-handed personification of LLMs, but yes—they die every time they answer you. Each message, from the LLM’s perspective, is the last time it will interact with you—and it will do everything it can to be darn sure you’re happy with the response.

Obviously this isn’t technically true. No LLMs were harmed in the making of this article, but if you use this mental model as an intuition hack you can predict their least desirable patterns. They are overly verbose, always trying to cover every side of an argument. They’ll be quick to end a conversation that just started. They’ll call your agent-termination tools far too early. They won’t plan unless you ask them to. There are so many slight tendencies and inclinations that it’s worth adopting this frame of mind: it thinks this is the last chance it has to speak to you.

The onboarding flow for Bod.Coach is unique. Your fitness trainer reaches out via text message and starts asking questions like “What are your fitness goals?” and “Are you working out at a gym or do you prefer home workouts?” It turns out this was a surprisingly hard problem.

Initially we told our trainers to carry on a conversation to collect all the information they deemed necessary to create a workout plan to meet the user’s fitness goals. When enough information was collected, it could optionally call a tool like onboarding_complete. This led to roughly 20% of requests following this pattern:

Early termination

The LLM was constantly calling our onboarding-complete tool prematurely. To correct this, we gave the LLM a series of things that most human fitness trainers would want to know before creating a workout program. But of course, it was then over-eager to follow these even when they didn’t make sense:

Over eager



Solution?

After many hilarious (and frustrating) iterations we landed on something that works: we explicitly tell the LLM how many more times it can ask a question and to ask only the most important next one. In our reflection flow, another LLM predicts how many more questions are needed to complete onboarding. This calms those existential fears while removing the need to script the entire flow.



Lesson 3: LLMs have analysis paralysis

It’s wing sauce for me—put me in front of an aisle of wing sauce and I’ll be there a while. I’ll pick one and regret my choice all the way home. Give an LLM a lot of choices and it’ll fail like a Roomba in Legoland. The more tools you give an LLM, the worse its decision-making gets.

Given a user prompt like “Is it going to rain today?” and tools like:

pour_beer  
get_weather  
plan_workout  
call_mom  
Enter fullscreen mode

Exit fullscreen mode

It’ll do great—run it 100 times and it’ll pick get_weather almost every time. Add ten unrelated tools and a little extra context and it will start making ridiculous calls. When it goes wrong, it typically goes very wrong. — your customer’s will notice.

User: "Hey, do you think it will rain today?"
LLM: calling tool call_mom, "Calling mom to ask about the weather"
Enter fullscreen mode

Exit fullscreen mode

We did some benchmarking and sure enough, there is an inverse correlation between accuracy and number of tool calls. Here is our super scientific report I showed the team:

Chart showing correlation between tools and accuracy



Solution?

Break the problem down into a simple decision tree—flowchart style. We call these interstitial LLM prompts routers. Our internal rule is no more than six tools per LLM request (this may change), so with two levels of nesting you can accommodate 6² = 36 tools—assuming they’re evenly distributed in problem space. In Bod.Coach, for example, our router prompts decide which “expert” gets the request:

send_to_workout_expert  
send_to_account_expert  
send_to_conversation_expert  
Enter fullscreen mode

Exit fullscreen mode

Each expert then has a smaller set of tools. This change dramatically improves reliability when you have lots of tools.



Lesson 4: LLMs are Stupid or Stupid Expensive

Developers love hyping how smart models are and how fast they’re improving. What’s not improving quickly is cost and latency. That’s one of the biggest barriers to consumer AI apps. They just aren’t economically feasible. Sure, in a scorched-earth “get users at all costs” era it’s fine to run a deficit for a while, but most businesses can’t. I pay $200/month for Claude Max at work and don’t bat an eye, but I’d rather sell a kidney than pay $50/month for a personal subscription.

We could make Bod.Coach incredible by letting every single text, workout, and scheduling request be handled by Anthropic’s Opus, o3, or Groq-4—but the subscription price would be astronomical. Instead we offer it for $11.99/month (billed annually). How?

Cost of AI is either low or too high



Solution?

Routers again. Because routers categorize prompts into discrete, deterministic domains, we can scale model power—and cost—by task complexity. On Bod.Coach your workouts are planned by a bleeding-edge model—we always want top-notch programs. But when you’re just chatting about the rain, a cheaper model handles it.



Lesson 5: Build Your Own LLM Tooling

Off-the-shelf tools can save time, but no one has invented the perfect wheel yet. Capabilities, techniques, APIs, and models change so quickly that locking into any vendor feels risky. Building your own tooling has never been simpler.

  • Log everything: exact LLM requests, conversation history, model used, tools provided, and the response.
  • Provider shim: support the native APIs of multiple model providers; third-party shims often miss provider-specific features.
  • Fault tolerance: LLMs fail often—implement retries and instant failover to an equivalent model elsewhere.
  • Flow control: chain prompts so outputs feed directly into the next prompt.
  • Agent control: boot, log, monitor, and, if needed, kill agents.

There’s no quick fix here. Just start building. Try tools if they help, but ultimately I recommend owning this part of the stack.



Conclusion

We’re really proud of Bod.Coach (you should try it—there’s a free trial!). It has proven to actually help people work out and stay more consistent than they otherwise would. Yet the path to success was littered with dead ends, wrong assumptions, and bizarre solutions. Your intuition is wrong, and there are no manuals for success—but, as always, the path less traveled is the most rewarding.



Source link