Teaching Cooklang to read Polish

Teaching Cooklang to read Polish

I love coding, and I love cooking, but most of the recipes are not written for my engineer's mind... I mean I really don't care that you took that recipe from your grandma when you were 6 and on vacations on the country, that's nice, but doesn't add to the idea of recipe.

Moreover most of the recipes are written in non intuitive way, I mean, yeah, you have some assumed number of servings, then you have list of products with mixed units (sometimes it's in grams, milliliters or... in worst case in "cups", whatever that means...). And although for the things I cook often I don't follow any recipes and do it intuitively, but for something that I do once in a blue moon or the recipe has two hundreds of ingredients I want to have a recipe, but the one that works for me.

The perfect recipe:

  • has each step is atomic, not "add X, Y, Z and mix and do 10 other things" in a single step, that should be split into separate points
  • I hate that constant lookup when I have step like "add X" and then I need to look at ingredients to find out how much - the perfect recipe should say "add X (Y units of that thing)"
  • the units - yeah, I mean it makes perfect sense to have "100 milliliters of cream" until... all other ingredients are in units of weights... that's why I always ask AI to rewrite all ingredients to multiple units by their density, so I get "X milliliters (or Y grams) of cream"
  • for one serving - as it's easier to multiply all things than divide
  • has indication of all cookware I need, and their sizes - I hate to start some new recipe just to find out that... the pan I used is too small...
  • should be in chronological order - yeah, I know you should read whole recipe before starting, nevertheless if something takes 40 min to be ready it should be indicated at the beginning, not somewhere in the middle of the recipe, so you can follow logical order step by step and be sure the timing is fine

And now let's go to the solution which is a project called... Cooklang!

Cooklang

Let me copy from official Cooklang page its description:

Cooklang is a simple, human-readable text format for writing recipes that can be understood by both cooks and computers.

In practice it looks like that:

Why is it so super cool? It's due to the fact that you can then process above recipe file and Cooklang can automatically get a list of ingredients, cookware, can scale that for the numbers of serving and if you plan for a bigger meal you can also combine recipes and generate a nice shopping list 😀

Couple of years ago you needed to convert most of internet-found recipes to Cooklang, but now with the power of AI it's just a simple prompt with my above requirements for the recipe plus the requirement for outputting as valid Cooklang file and you're ready to go... until you speak Polish and use Polish recipes!

What's the issue with Polish and Cooklang?

Polish is a highly inflected language — the same noun can appear in 7 different grammatical cases. The word for "onion" alone has forms like cebula, cebuli, cebulę, cebulą, cebulo... This is a nightmare for any parser that needs to deduplicate ingredient names.

Take a look at sample part of recipe in Cooklang in Polish:

Podsmaż @cebulę{1 szt} na #patelni{} przez ~{5 min}.
Dodaj @cebula{pokrojoną} i @pomidory{2 szt}.

We have couple of issues here:

  • onion - "cebulę", "cebula" - basic form (nominative case) "cebula"
  • pan - "patelni" - basic form "patelnia"
  • tomatoes - "pomidory" - plural for "pomidor"

Finding a solution

If you have ever worked with full text search, the solution is obvious - you need to do a process of lemmatization, which is, as Wikipedia says: "grouping together the inflected forms of a word so they can be analyzed as a single item, identified by the word's lemma, of dictionary form"

First we need to find a dictionary, fortunately there is one: https://sjp.pl/sl/odmiany/ with multiple open licenses to choose from.

The first word in each line is the base form (nominative case). The entire file is loaded into a Map<form, lemma> at startup, then every ingredient name is looked up before deduplication:

async function getLemmaMap(): Promise<Map<string, string>> {
  const lemmaMap = new Map<string, string>();
  const rl = createInterface({ input: fs.createReadStream('odm.txt') });
  for await (const line of rl) {
    const forms = line.trim().split(', ');
    if (forms.length < 2) continue;
    const lemma = forms[0];
    for (const form of forms) {
      if (!lemmaMap.has(form)) lemmaMap.set(form, lemma);
    }
  }
  return lemmaMap;
}

Loading takes ~3 seconds on first run, but the map is cached in memory for subsequent calls.

Then we need to create a simple map:

let lemmaMap: Map<string, string> | null = null;

async function getLemmaMap(): Promise<Map<string, string>> {
  if (lemmaMap) return lemmaMap;
  lemmaMap = new Map();
  const rl = createInterface({ input: fs.createReadStream(path.join(__dirname, 'odm.txt')) });
  for await (const line of rl) {
    const forms = line.trim().split(', ');
    if (forms.length < 2) continue;
    const lemma = forms[0];
    for (const form of forms) {
      if (!lemmaMap.has(form)) lemmaMap.set(form, lemma);
    }
  }
  return lemmaMap;
}

And we are almost there, but Polish has a couple more surprises for you... 🗡️

Rejecting bad lemmas

The dictionary covers all parts of speech, which causes false positives. For example, olej (oil) is also the imperative form of the verb olać (to ignore), so the dictionary maps it to olać. Similarly, sól (salt) maps to solić (to salt).

A lemma is rejected if:

  1. It is longer than the original word — sólsolić
  2. It is a verb infinitive (ends in ć) but the original is not — olejolać
function lemmatize(word: string, map: Map<string, string>): string {
  const w = word.toLowerCase();
  const lemma = map.get(w);
  if (!lemma) return w;
  if (lemma.length > w.length) return w;
  if (lemma.endsWith('ć') && !w.endsWith('ć')) return w;
  return lemma;
}

The parser

And then we need our own parser which will process Cooklang recipe with that nuances for Polish:

async function parseRecipe(content: string): Promise<ParsedRecipe> {
  const map = await getLemmaMap();
  const result: ParsedRecipe = { metadata: {}, ingredients: [], equipment: [], timers: [], steps: [] };
  const ingredientMap = new Map<string, Ingredient>();
  let currentStep = '';

  for (const line of content.split('\n')) {
    if (line.startsWith('>>')) {
      const m = line.match(/^>>\s*([^:]+):\s*(.+)$/);
      if (m) result.metadata[m[1].trim()] = m[2].trim();
      continue;
    }

    if (line.trim() === '') {
      if (currentStep.trim()) { result.steps.push(currentStep.trim()); currentStep = ''; }
      continue;
    }

    let m: RegExpExecArray | null;

    // Ingredients: @name{quantity}
    const ingRe = /@([^{]+)\{([^}]*)\}(?:\{([^}]*)\})?/g;
    while ((m = ingRe.exec(line)) !== null) {
      const key = lemmatize(m[1].trim(), map);
      if (!ingredientMap.has(key))
        ingredientMap.set(key, { name: key, quantity: m[2].trim(), description: m[3]?.trim() });
    }

    // Cookware: #name{}
    const eqRe = /#([^{]+)\{\}/g;
    while ((m = eqRe.exec(line)) !== null)
      result.equipment.push({ name: lemmatize(m[1].trim(), map) });

    // Timers: ~{duration}
    const timerRe = /~\{([^}]+)\}/g;
    while ((m = timerRe.exec(line)) !== null)
      result.timers.push({ duration: m[1].trim() });

    currentStep += line + ' ';
  }

  if (currentStep.trim()) result.steps.push(currentStep.trim());
  result.ingredients = Array.from(ingredientMap.values());
  return result;
}

(yapp, AI wrote that, but the rest of this post is, as always, handwritten)

Let's run it!

Now some file read stuff:

const content = fs.readFileSync(process.argv[2] ?? 'recipe.cook', 'utf-8');
const recipe = await parseRecipe(content);
console.log(JSON.stringify(recipe, null, 2));

Then let's create recipe.cook for tomato soup:

>> title: Zupa pomidorowa

Pokrój @pomidory{4 szt} i @cebulę{1 szt}.
Podsmaż @cebula{pokrojoną} z @czosnkiem{} na #patelni{}.
Dodaj @pomidor{pokrojony} i gotuj przez ~{20 min}.
Podawaj z @śmietaną{2 łyżki}.

Run it:

npx ts-node --esm parser.ts recipe.cook

And enjoy:

{
  "metadata": { "title": "Zupa pomidorowa" },
  "ingredients": [
    { "name": "pomidor", "quantity": "4 szt" },
    { "name": "cebula",  "quantity": "1 szt" },
    { "name": "czosnek", "quantity": "" },
    { "name": "śmietana","quantity": "2 łyżki" }
  ],
  "equipment": [{ "name": "patelnia" }],
  "timers":    [{ "duration": "20 min" }],
  "steps": [...]
}

@pomidory, @pomidor → one ingredient pomidor. @cebulę, @cebula → one ingredient cebula. @czosnkiemczosnek. #patelnipatelnia.

Conclusions

As you can see it's doable to use Polish in Cooklang, but unfortunately it required my own parser, which is ok for simple use cases, but does not offer all of the features of the Cooklang (e.g. it misses scaling of recipes; combining multiple ones), so it's not perfect, but for now... let's call it a day :)