Programming | create emotions

For me, languages are everywhere:

When programming using C#
When I talk in sentences with someone
When I think about my past as a sequence of events
When I am calculating 1+2, the language of math
When I cook after a receipt, executing one action after the other

What is language?

So what is language? There are several aspects, I like to list:

Recognizing a sequence of perceptible units (reading a program, hearing a song, touching some texture)
Create and interact with a model of the sequence (evaluate 1+2, executing a program, think about physical objects)
Producing a sequence of perceptible units (talking with others, cooking)

One thing all these aspects have in common: the model or the sequence is listening to laws or rules which specify the underlying language. Some examples:

In math, 1+2 always evaluates to 3
In imperative programming languages, an if keyword is followed by a left round parenthesis, followed by a boolean expression
After each „Alice?!“, there must follow an „Who the fuck is Alice?“

So, not every sequence of perceptible units or so called sentence is part of a language. A language is defined by the set of sentences that are valid within it.

How hard is it to recognize a sentence?

But how to recognize, whether a sentence is in a language? In computer science, the problem is well-understood and can be split into four classes of language groups: the Chomsky hierarchy.

Regular expressions are the easiest group. They can be described with one rule that can reference itself and only grows to left or right. Example: the language of all LOLs (LOL, LOLOL, LOLOLOL, …) can be described as
- T ::= „LOL“|T „OL“
Context-free languages have an arbitrary amount of rules, where the left hand-side of the rule is only a name that can be reused on the right side of all other rules. Like the language of all matching parentheses. Notice, that the rules grow in both directions, even if it is one rule!
- P ::= „(“ P „)“ | P P | e
Context-sensitive and recursively enumerable languages are harder to recognize. They also have rules. These rules can have more than one symbol on the left side.

Programming languages are normally in the last two hard classes (worse than O(n^3) where n is the length of input), so language engineers use context-free and regular languages to recognize them and add logic programmatically to validate the resulting models to add the context-sensivity.

How to recognize these simple languages?

Language engineers use finite automata and push-down automata for this job. That is a very theoretical answer. My pragmatic self recommends parser generators like ANTLR for this job. The idea is, why wasting time with programming when you already know the rules? Write down the rules and derive the program!

There is a lot to write about language engineering. Text-to-model transformations are a small piece of the cake called languages. It is an important piece, but there is much more to discover. Remember that the hardest sentence is free of context, like 5+9=145. After recognizing the text you still have to bring in the missing context (the equation is wrong, because 5+9 is 14 and not 145).

There can be model rules like

Operator precedence
Implicit conversions to other types of models
Inherited behavior between models
Execution rules depending on the environment
Or or or…

I hope that I was able to show you what makes language engineering so interesting for me.

create emotions

…and source code

Category Archives: Programming

My passion for languages

What is language?

How hard is it to recognize a sentence?

How to recognize these simple languages?

Extract all images from a PDF

Let‘s do something else