Dylan Carroll
Dylan Carroll

Markov-Chain Text Generator

June 2020


I had the idea for this text generator a few months before I decided to try to implement it. At the time, I wasn't thinking about it in terms of a Markov-chain, but I have since decided that that is the best way to describe the way it functions.

In the training stage, the program analyzes a text document with the training data. For each character in the data, the program looks at the previous 6 characters. For each combination of 6 characters present in the data, the program keeps a tally of the number of occurrences of the character that came next. Essentially, the program creates a probability distribution of all characters based on the previous 6 characters. The number 6 is arbitrary, and the value can be assigned at runtime by a command-line argument. I chose 6 because it resulted in the best speed for how coherent the output text was.

In the generation stage, the program takes 6 characters of text (or whatever the specified value is) as an input. It then repeatedly selects random character based on the probability that was calculated in the training stage and the last 6 characters that were generated.

I was surprised to find how coherent the generator was. It has no understanding of sentence structure, grammar, or meaning, but the overwhelming majority of words are real words and many neighboring pairs of words seem to make sense together. Small segments of each sentence feel like they will begin to make sense if I continue reading, but when I try to understand any meaning from the sentence as a whole, any sense of coherence beaks down, which makes the text confusing to read. The length of sentences seems to match that of the input data, which I assume is just randomly based on the average probability that any particular word has a period afterwards.

I generated text based on a few different sets of training data including The Bee Movie Script, The Book of Genesis, and Frankenstein. I have example text from the Bee Movie and Genesis on my website.