Category Archives: Programming

My passion for languages

For me, languages are everywhere:

  • When programming using C#
  • When I talk in sentences with someone
  • When I think about my past as a sequence of events
  • When I am calculating 1+2, the language of math
  • When I cook after a receipt, executing one action after the other

What is language?

So what is language? There are several aspects, I like to list:

  • Recognizing a sequence of perceptible units (reading a program, hearing a song, touching some texture)
  • Create and interact with a model of the sequence (evaluate 1+2, executing a program, think about physical objects)
  • Producing a sequence of perceptible units (talking with others, cooking)

One thing all these aspects have in common: the model or the sequence is listening to laws or rules which specify the underlying language. Some examples:

  • In math, 1+2 always evaluates to 3
  • In imperative programming languages, an if keyword is followed by a left round parenthesis, followed by a boolean expression
  • After each „Alice?!“, there must follow an „Who the fuck is Alice?“

So, not every sequence of perceptible units or so called sentence is part of a language. A language is defined by the set of sentences that are valid within it.

How hard is it to recognize a sentence?

But how to recognize, whether a sentence is in a language? In computer science, the problem is well-understood and can be split into four classes of language groups: the Chomsky hierarchy.

  • Regular expressions are the easiest group. They can be described with one rule that can reference itself and only grows to left or right. Example: the language of all LOLs (LOL, LOLOL, LOLOLOL, …) can be described as
    • T ::= „LOL“|T „OL“
  • Context-free languages have an arbitrary amount of rules, where the left hand-side of the rule is only a name that can be reused on the right side of all other rules. Like the language of all matching parentheses. Notice, that the rules grow in both directions, even if it is one rule!
    • P ::= „(“ P „)“ | P P | e
  • Context-sensitive and recursively enumerable languages are harder to recognize. They also have rules. These rules can have more than one symbol on the left side.

Programming languages are normally in the last two hard classes (worse than O(n^3) where n is the length of input), so language engineers use context-free and regular languages to recognize them and add logic programmatically to validate the resulting models to add the context-sensivity.

How to recognize these simple languages?

Language engineers use finite automata and push-down automata for this job. That is a very theoretical answer. My pragmatic self recommends parser generators like ANTLR for this job. The idea is, why wasting time with programming when you already know the rules? Write down the rules and derive the program!

There is a lot to write about language engineering. Text-to-model transformations are a small piece of the cake called languages. It is an important piece, but there is much more to discover. Remember that the hardest sentence is free of context, like 5+9=145. After recognizing the text you still have to bring in the missing context (the equation is wrong, because 5+9 is 14 and not 145).

There can be model rules like

  • Operator precedence
  • Implicit conversions to other types of models
  • Inherited behavior between models
  • Execution rules depending on the environment
  • Or or or…

I hope that I was able to show you what makes language engineering so interesting for me.

Extract all images from a PDF

If you would like to do that, just use my notebook here.

This notebook was a little helper for me, when I received a family tree from my cousin. I could not convince her to give me the raw images, so I extracted the from the tree PDF. All you need to do is opening a PDF file with the button below. The extraction will start immediately. Output is a zip file with all images in JPEG format.

It all happens on client side, so you need no app for this easy task.