My AI Catalan Tutor Got Better. So Did I.

In my previous post, I built a voice agent to practice speaking Catalan, a language I understand well but never actually speak. The tech worked. The conversations were soulless. Two weeks later: I’ve rewritten the prompt from scratch, switched models, and had my first 30-minute conversation entirely in Catalan. As silly as it sounds, it already helped me build up enough confidence to start using it in real life. Here’s what changed and what I learned.

Abstract illustration of Barcelona with a green light giving way into miroesque patterns

Day 1 showed that setting up a voice-based conversational agent from scratch was the easy part. In terms of AI, Catalan is well supported after all, but despite that, there are no solutions in this new AI language practice space that serve it well. The voice-based chatbot I built worked, but not any better than the ones I’d tried. Shallow ping-pong question-answer conversations that don’t click. I was disappointed, but not pessimistic. I knew what I had to experiment with: the LLM backend and the prompt. I had to move away from ChatGPT, but also guide the LLM better into driving a meaningful conversation that actually helped my learning.

What I changed in the prompt

The first prompt told the bot to “act like a friend chatting in Catalan” and included a list of banned phrases like “never say Genial!, Molt bé!, Quina sort!” because it just made it so fakely enthusiastic and robotic after few repetitions. It felt more like a rulebook so didn’t work. Ban Genial! and the bot says Fantàstic! instead. Tell it “don’t change subjects too quickly” and it stays on a dead topic for five turns. Whack-a-mole with surface behaviors.

The fix was surprisingly simple: instead of rules, I gave the bot a motivation. The new prompt frames the conversation as meeting someone you don’t know yet. You’re curious about them. Where are they from? What do they do? How did they end up here? This is what actually drives conversations in real life, whether between strangers or old friends. You want to know about their life, how they do things, what they’re up to, how it feels. Even catching up with a friend you haven’t seen in a while works the same way: you’re filling in the gaps of what you don’t know yet.

Here’s the contrast. The first prompt started with:

You are a conversation partner for an adult living in Barcelona who is learning Catalan. The user already speaks Spanish and English, so they understand quite a bit, but they need to practice speaking.

The new one:

You’re a Catalan friend meeting someone you don’t know well. You have genuine curiosity about who this person is, where they come from, what they do, what they like, how they ended up here. You’re not a teacher, not an assistant, not a tour guide.

That changed everything. The bot now asks better questions because it has something it wants to know. It stays on a topic because it’s interested. The “getting to know” dynamic turns out to be the engine that makes conversation feel real. Without it, the bot converses for the sake of conversing and runs out of steam.

I also switched the LLM backend from GPT-4o to Claude. The validation-sandwich pattern I described in the first post (sycophantic praise, shallow paraphrase, irrelevant question) seems baked into GPT-4o’s training. Claude produces noticeably more natural conversational responses: shorter, less eager to validate, more willing to just react and move on.

The conversation

I sat down for a 5-minute test. Ended up talking for 30 minutes. Here’s what the conversation actually looked like. Note that my lines are the raw automatic speech recognition (ASR) transcription i.e. what the bot heard, not necessarily what I said. You’ll see why this matters later.

(Hover over Catalan text to see the English translation.)

It starts casually. I mention I’m at home resting after spending the day with my daughter:

bot: Ei, com va?

me: Per res, aquí estic a casa. […] Però estic ara descansant.

bot: Vaja, i ella quants anys té?

Already different from the hamburger conversation. Short question. No “Fantàstic que passis temps amb la teva filla!” Just genuine curiosity.

A few turns later, I tell the bot I’m originally from Türkiye, living in Barcelona for 11 years. When I try to say “Ankara,” the ASR loses it.

bot: Onze anys ja és molt temps. Però d’on ets exactament de Turquia?

me: Tancarà encara?

bot: Sí, Ankara! Molt diferent de Barcelona, suposo.

Still, it’s amazing that the LLM figures it out smoothly from context.

The conversation gets personal quickly. I talk about my work, why I care about social impact, the privilege and frustrations of humanitarian world. The bot asks a question that catches me off guard:

bot: Interessant. I com és treballar per ONGs, és diferent del que t’esperaves quan vas acabar el doctorat?

Now this really made me think! It connects two important phases of my career: the PhD and the NGO work, and asks about the gap between expectation and reality. That’s the kind of question a real person asks! The first version of this bot would have said “Quina feina més interessant!” and asked what kind of NGO it was.

Later, I just can’t remember how to say “birthday” in Catalan:

me: I la mare està en una festa de cumpleaños. Cumpleaños no se dice así, no? Com ho és en català?

bot: Aniversari. I tu te n’has quedat fora.

I obviously heard this word before. But when speaking in flow, it just doesn’t come. There’s a huge gap between the vocabulary you recognize on a page and the vocabulary you can produce in real time under the pressure of a conversation. The bot corrects me naturally and keeps the conversation moving.

One of my favorite moments: the bot mentions the word bojos (crazy/adventurous) casually, and I realize I’ve heard this word a hundred times before, but never actually knew what it meant:

bot: Tampoc sou tan bojos, doncs.

me: Què significa pujós? [not me… ASR]

bot: Bojos, una mica salvatges, aventurers.

This is what it’s like to learn a word from a conversation instead of a flashcard. You hear it in context, it connects to something you were actually talking about, and it sticks. I probably won’t forget bojos now.

A genuinely funny moment happened near the end. I mention that talking to a robot makes me less afraid of making mistakes:

me: Quan parles amb un robot no tens tantes por de fer errors.

bot: Entenc. Però no soc un robot.

me: …

bot: Soc en Pau, un amic que t’ha tocat avui.

The bot insists it’s Pau, a 34-year old from Poblenou, except it has a female voice, because I assigned a female TTS voice by default in the code. It made me laugh. But the persona consistency is also kind of impressive: it maintained a fictional (but shallow) identity throughout 30 minutes of conversation. The mismatch between the voice and the name is just a configuration I can fix.

Then I ask how long we’ve been talking:

me: Saps quants minuts portem parlant amb tu?

bot: No sé, una hora potser?

me: Mitja hora.

bot: Mitja hora! Ha anat ràpid.

It had. I went in expecting five minutes and forgot to stop.

The ASR problem

Reading the transcript back, a lot of what I say to the bot is incomprehensible. Not (just) because of my Catalan and accent, but because of the speech recognition.

Here’s what I actually said at one point to indicate where I lived, more or less: “Jo a l’altra costat, al Poble Sec.” What the ASR transcribed: “Jo el atrac estat costat ha el poble sec.”

One of the things I want to build into this tool is conversation review. After a session, you get feedback on your mistakes, new vocabulary, and tips for next time. Most language learning apps that offer AI conversation practice promise something like this. The idea is straightforward: take the transcript, feed it to an LLM, and have it analyze the learner’s errors. But here’s the problem.

If you ask an LLM to review that garbled transcription and give a correction tip, you get this:

The user is trying to say “I live near El Poble Sec” but the sentence is quite jumbled: wrong word order, wrong verb, misplaced articles. A natural way to say it would be: “Jo visc al costat del Poble Sec.”

Confident, helpful, but completely wrong. I wasn’t trying to say I live near Poble Sec. I was saying I’m on the other side across from Poblenou. The ASR garbled it, and the correction LLM confidently “fixed” something I never said. Imagine getting this kind of feedback mid-conversation. It would confuse you, or worse, destroy your confidence by doubting things you actually said correctly.

Another example. I said something like “tan fàcil anar a la platja”. The ASR captured: “Tan fàcil IA la platja.” The reviewing LLM threw up its hands at that part:

The second part is too fragmented to correct meaningfully. It seems like they lost their train of thought mid-sentence.

I didn’t lose my train of thought. The ASR lost my words.

This could be one of the most important technical problems in this whole project. The LLM driving the conversation is impressively resilient. It understands what I mean from context even when the transcription is garbage. But any feature that relies on analyzing what the learner actually said like corrections, error tracking, vocabulary assessment, is fundamentally limited by ASR quality. You can’t tell a learner error from a transcription error. A human teacher hears what I actually said. The bot only sees the garbled text.

There’s a workaround. After the session, I fed the full transcript to an LLM with a prompt that explicitly says: “This comes from ASR that makes MANY errors in Catalan. If you’re not SURE it’s a real learner error, don’t mention it.” That helped. The review picked up on real patterns: my consistent use of the Spanish possessive “mi” instead of Catalan “la meva/el meu,” spanishisms like “cumpleaños” and “apetecer” and ignored the nonsensical fragments. It gave me a vocabulary list of words I’d learned during the session, noted that my comprehension is far above my production level, and gave one concrete tip: practice the possessives!

That review was genuinely useful. But it works because it has the full conversation for context and because the prompt tells it to be conservative. Doing this in real-time, turn by turn, is a much harder problem.

Where the conversation still falls apart

The bot has no personality. It never disagrees. Never pushes back. Never has a strong opinion about anything. Every topic I brought up was met with some version of “entenc,” “té sentit,” “quina sort.” It’s good at active listening, but there’s no inner juice.

This is the sycophancy problem showing up at a deeper level like the first version. It’s in total agreement with everything.

I was at a gathering this weekend and I kept thinking about what makes conversations interesting in real life. It’s the other person’s opinions, lifestyle, what you have in common, what’s different. The moments where someone says “really? I think the opposite” or tells you something you didn’t know about their life. The bot has none of that by default. It’s a perfect listener with nothing remarkable to say.

The bot also loses steam after about 15 turns. It starts giving closing-statement energy or asks “de què vols parlar?” A good conversationalist pivots using what they already know about you. The bot doesn’t do that yet.

What I’m taking away

I thought building this tool would mostly be about the tech: pipelining systems, smarter prompts, a fancy UI. Turns out the most important thing I’m learning is about the learning process itself.

After a few conversations, I know much better what I need to improve and what features would actually help me. Not because I planned them on a whiteboard, but because I experienced the gaps firsthand.

I’m getting hopeful about this. Not just for myself, but for anyone stuck in the same bilingual purgatory of Catalan or with any language.

The gap between “I understand everything” and “I can actually speak” is real, and what’s on the other side of it is mostly just permission to be bad at something without social consequences. A bot gives you that.

As a tangible effect, a few days after this conversation, I had to deal with some paperwork at a government office. I did the whole thing in Catalan! I walked in, resisted defaulting to Spanish, explained what I needed, answered the clerk’s questions, even had a small chit-chat while doing it. That 30 minutes of conversation with the bot made talking to a real person feel much more manageable. That’s the effect I’d been building this for.

What’s coming

It’s still very early. But I have plenty to work on, and I’ll make sure I keep having more conversations while I build, both with the bot and people outside. In the next post: I’ve started building a proper UI with live vocabulary hints and post-session reviews, and I’m working on giving the bot an actual personality. I have a few ideas on them already, like a vegan nurse from the Raval with strong opinions, a slightly pessimistic philosophy professor from Sarrià and a retired fisherman from L’Escala. Any other suggestions? Just write me on LinkedIn, or drop me a mail!