Code Explains: What IS a Markov Chain?
6 months ago
General
This year's anniversary post might be a little delayed, as I'm still working on what I need for it. In the meantime, here's something I wrote up earlier.
I've done several anniversary journals now where I include some output from a Markov Chain I have trained. As I've stated before, they're not GenAI. They're named after a Russian mathematician, Andrey Markov, who (at least according to Wikipedia) first wrote about them in 1906.
But I haven't said much about what they are, or how they work. So today I'm posting a primer on Markov Chains so you have a better idea of how and why they produce such semi-coherent absurdity.
To put it very simply, a Markov Chain is a set of instructions to generate random output. The randomness is weighted so that you don't get complete nonsense, but it is still purely down to random chance what you get out of it.
It is not smart. It does not have any understanding of meaning, context, or topic like an LLM does. It has zero intelligence, artificial or otherwise. It simply has a pre-defined set of states it can be in, and it moves between them completely randomly. Imagine a board game where you roll two dice to move, and for each possible result there is a specific square you get taken to from the one you're on. Where you end up is weighted – 7 is most likely, while 2/12 are least likely – but it's still just down to randomness. Now imagine that, but with thousands of squares and dice, and each square is connected to hundreds of other squares, and that's basically a Markov Chain.
As you might expect from a 120-year-old mathematical model, they are relatively simple compared to the stuff we see today, but this simplicity also means that it only took me one afternoon to write a Python script (using the Markovify Python library) that trained a Markov Chain and could generate output, with a training process that only took a minute or so. I've been refining it a bit over the years, but the core of it hasn't changed much.
In the specific case of what I'm doing with these anniversary journals, I am using a Markov Chain to generate sentences of text (which is also what the Markovify library is specifically written to do). These sentences are essentially random, but the training data I provide (ie, my stories) weights the probabilities of what words appear, and in what order, to generate sentences that kinda, sorta resemble English (because that's what my stories are written in; Markov Chains could work for any language), in my style of writing.
Hopefully, anyway. RNG permitting.
Markov Chains function in two parts: training the model, and generating outputs from the model. I will cover generation first and then pull back the curtain on training.
So, we'll first pretend we simply have a model. What is the model? Essentially, it's a big list of two things: what state we are currently in, and what states we can move to, with what probability. For the purposes of generating text, the current state is what the last word we generated is (with a special state for the first word when we are starting a sentence). The list of states we can move to is a series of words with a probability of going to each.
For instance, let's say our model has a 50% chance of starting a sentence with 'The', and a 50% chance of starting a sentence with 'But'. We flip our coin, it comes up heads, and we start with "The".
The model now says that, if you're on 'The', there's a 25% chance of 'next', a 25% chance of 'one', and a 50% chance of 'dragon'. We roll and it comes up on 'one', so our sentence is now 'The one', and we look up what we do when we're on 'one'.
Eventually we reach a state in our model that is considered sentence-ending, and when we hit that we consider the sentence completed and the generated text is outputted. So let's say that 'one' has a chance of being followed by 'day.' which is considered the end of a sentence, and it comes up when we roll for the next state. This gives us the sentence "The one day." which… is a sentence, if not a very interesting or coherent one.
But if we run through the model again, we'll get a different output. Our initial coinflip might start us off on 'But' instead, or perhaps we get a different third word and we get a sentence starting with 'The one and' which keeps going from there for several more words.
A real model, of course, is operating on a much greater scale. An English dictionary has about 170,000 words, and while the number of words that actually show up in common usage is far fewer than that, it's still going to be five digits. Any given word could be followed by hundreds or even thousands of other words with varying probabilities, but there's also context to consider – the words and ordering used in scientific journals would be vastly different than what you'd see in, say, erotic furry fiction. You know, just as a totally arbitrary example. ;)
So, how do you build a model that represents the type of text you want to generate? Well, you need a sample of that text. You can't write a scientific paper without having seen what they look like, and computer models are no different. Which brings us to…
I've shown you how a model is used, but how do we get it? We need to build up that big list of words and probabilities somehow.
Let's start with a single sentence: "The fox goes murr." We don't care that 'murr' isn't a word in the English language, as Markov Chains are language-agnostic and don't care either. The upshot of this is that character names work just fine too. Anyway; we see that our word order is "The", "fox", "goes", "murr". Each word is a single token as far as the model is concerned.
The problem here is that this is just one sentence. If we build a model from this, every sentence we get will be "The fox goes murr" because that's all the input data we gave. So let's add a second sentence to our training data, "The lizard goes mlem". With these two sentences together our model is something like:
(Start of sentence) -> "The", 100% chance
The -> "lizard" or "fox", 50% chance for each
lizard -> "goes", 100% chance
fox -> "goes", 100% chance
goes -> "murr" or "mlem", 50% chance for each
With this model, it has a chance of generating either of the original sentences, but if the random values come up right, it can generate new sentences that were not in the original input: "The fox goes mlem" (I'm sure they can) or "The lizard goes murr" (I don't know about that one…)
Furthermore, this model will never generate sentences about wolves or birds or fish, because there are none of those in the training data – it can only generate words it's seen before. If we want to have sentences about other animals, we need to include them in the training data.
If I throw "The fox goes yip" into the training data, the additional 'fox' following 'The' now means there's a 2-in-3 chance of 'fox' and a 1-in-3 chance of 'lizard', so the probabilities are no longer split evenly. Which makes sense, we have two sentences about foxes and only one about lizards. If I kept adding more sentences about silly fox noises without anything else, the model would accordingly be built under the assumption that, generally, we want to generate sentences about foxes, but once in a blue moon we'll have a sentence about lizards instead.
This is simple enough for an explanatory example, but natural language is complex. So we need more training data – a LOT of data. Usually you would use a 'corpus' which is considered a representative sample of text; for instance, the Brown Corpus is a collection of 500 samples of text each over 2000 words, such as letters, news articles, scientific journals, and more. It's a little over a million words in size.
I, of course, crossed that million words threshold quite some time ago (I'm almost at 1.7 million words in October 2025), so MY corpus is built entirely, 100% from my own work – there is absolutely zero outside data in the model that I have trained. Not one word. The end result is a model that will generate sentences in the style of the author known as Codelizard… or at least, try to, because at the end of the day it's just rolling dice and randomly deciding what word to take next from the options presented in the model.
For example, if I crack open the model I trained for last year, then the first thing I notice… is that PyCharm insists on read-only mode because the model, a 24MB JSON file, too large for it to edit. Whoops.
The second thing I can notice from a glance is that "Fidget" has a weight of 128 for beginning a sentence, which means that in the training data I supplied, "Fidget" is the first word in 128 sentences. The chance of that actually happening in the output is going to be 128 in ((whatever the grand total sentence count is)).
The training is a separate step from the generation – once a model is generated, it can be re-used multiple times. This is nice because training takes a lot more time, as it has to look through every file I provide it for training data, and catalogue the chance of every word following every other word. Generation is just taking the pre-made model and rolling a bunch of dice, and is much faster.
Now, the thing is, if we were to look at just the previous word, we would be very likely to get nonsense, like the "Code took a home should be granted" I got just now while generating some sentences with such a model.
We can get more coherent results by adjusting the 'state size', which represents how many previous words we look at when selecting the next word. This number is chosen at the time we train the model.
By default in Markovify, this is 2 – if we're up to "Code stared at", the model will look at what follows the pair of words "stared at" without caring that it's Code doing the staring. To do this the model must have an entry for that pair of words specifically, which is why the state size must be specified when training, not just generating.
When I outlined above how generation works with the animal noises examples, I only used a state size of 1 for simplicity. It's good for teaching how Markov Chains work, but with a real data set you will get nonsensical (though highly varied) results.
For my actual generations, I normally use 2, but starting this year I'm going to try bumping the state size from 2 to 3. A higher state size will get you more coherent sentences, but it's also more likely for sentences from the training data to be re-generated verbatim, because any given combination of words becomes less probable as the number of words in the combination increases. (This can still happen with smaller state sizes when very uncommon words get used, but it's less likely.)
For example, a state size of 3 considers the entirety of the "Code stared at" sentence fragment above, and thus would only generate another word based on things Code has previously stared at… unless it picks 'the' as the next word, in which case it'll be generating a word to come after 'stared at the' leading into any singular object that anyone has previously stared at in one of my stories.
The other downside of increasing the state size is that it also increases the size of the model because there are more potential states to be in. For comparison, my models made from the first 6 years of stories are:
- 5MB for a state size of 1
- 24MB for a state size of 2
- 32MB for a state size of 3
That covers the basics, but there's more you can do with this.
First, when I'm generating a sentence, I can nudge it in a particular direction by providing one or more words to start a sentence, which I have done before by telling it to start a sentence with "Code" and seeing what nonsense the chain decides I should be doing this time. I could give it more than that, such as "Code grabbed", which when I tested it, resulted in a surprisingly SFW spread of towels, remotes, and Zhen's arm being grabbed onto.
There's also nothing really stopping the generation from just going on indefinitely in an unreadable run-on sentence, which can and does happen, so I can provide a maximum sentence length to the generator and tell it to stop after a certain number of words.
Starting this year I have also added some pre-processing and post-processing steps. By default every word is a single token so "Code" and "Code's" would actually be tracked separately and have different probabilities. One step I took is to split the possessive "'s" off from my name, as well as every other time this is used, making it its own token with its own set of probabilities, so that all names share the chances for what comes after them. Then in the output data I remove the space between a name and 's to make them a single word again. I did the same thing with non-terminal punctuation as well such as commas, colons and semicolons, because otherwise they would be counted as part of the previous word. This adds more variety to the output. It doesn't distinguish between possessives and contractions, but due to how Markov chains work, it will never slap a 's in a place it shouldn't be in as it can only follow words it's previously seen a 's follow in the training data.
I also do a little cleanup to remove certain characters from the text entirely such as the < and > symbols (which I've used in some early stories to indicate written or telepathic messages) and my twenty-hyphen section breaks which aren't a sentence but would otherwise show up as one if I didn't strip them from the training data.
Finally, I can also turn 'strict' mode on/off when generating. With it on, it will only start a sentence with a word it has seen starting a sentence before; with it off, it will ignore that restriction. I typically leave this on when generating a whole sentence from nothing (as I'm more likely to get sentence fragments otherwise) and turn it off when I give it a starting word to turn into a sentence (to improve the variety of the output).
If I wanted to go the extra mile, as part of the preprocessing I'd run the text through a stemmer to cut each word down to just its base form. This would make the output very grammatically incorrect, but would further reduce confusion by treating words like "stood", "stand", and "standing" as different forms of the same word rather than separate words. That's a little further than I'm going to go, though, since I don't want to make more effort for myself re-formatting the output and correcting its grammar any more than necessary.
The end result of this is a big ol' random text generator that only uses words I've used before and strings them together into an order that will usually look like something I'd write. More or less. If you've ever used (or been forced by an employer to use) an LLM you can see the differences and limitations of this model.
As I've stated in previous journals, I do this for amusement purposes only, for my anniversary posts and for times I just want to chuckle by subjecting a character to the whims of RNG. I do not use it for writing new stories, especially since a good 80-90% of what's generated is nonsense, sentences regenerated from the original input, or boring 'normal' sentences - what you see in my anniversary posts is me cherrypicking the best of 100 or so generated sentences.
The sweet spot of a Markov generation is when it has a valid sentence except for one word out of place that makes it hilariously weird. It has zero understanding of narration or story and will mix and match characters from different stories with ease, bouncing between topics and subject matter. If you get a laugh out of the Markov Chain generations I put in my anniversary posts, then my mission is accomplished!
I've done several anniversary journals now where I include some output from a Markov Chain I have trained. As I've stated before, they're not GenAI. They're named after a Russian mathematician, Andrey Markov, who (at least according to Wikipedia) first wrote about them in 1906.
But I haven't said much about what they are, or how they work. So today I'm posting a primer on Markov Chains so you have a better idea of how and why they produce such semi-coherent absurdity.
What is a Markov Chain?
To put it very simply, a Markov Chain is a set of instructions to generate random output. The randomness is weighted so that you don't get complete nonsense, but it is still purely down to random chance what you get out of it.
It is not smart. It does not have any understanding of meaning, context, or topic like an LLM does. It has zero intelligence, artificial or otherwise. It simply has a pre-defined set of states it can be in, and it moves between them completely randomly. Imagine a board game where you roll two dice to move, and for each possible result there is a specific square you get taken to from the one you're on. Where you end up is weighted – 7 is most likely, while 2/12 are least likely – but it's still just down to randomness. Now imagine that, but with thousands of squares and dice, and each square is connected to hundreds of other squares, and that's basically a Markov Chain.
As you might expect from a 120-year-old mathematical model, they are relatively simple compared to the stuff we see today, but this simplicity also means that it only took me one afternoon to write a Python script (using the Markovify Python library) that trained a Markov Chain and could generate output, with a training process that only took a minute or so. I've been refining it a bit over the years, but the core of it hasn't changed much.
In the specific case of what I'm doing with these anniversary journals, I am using a Markov Chain to generate sentences of text (which is also what the Markovify library is specifically written to do). These sentences are essentially random, but the training data I provide (ie, my stories) weights the probabilities of what words appear, and in what order, to generate sentences that kinda, sorta resemble English (because that's what my stories are written in; Markov Chains could work for any language), in my style of writing.
Hopefully, anyway. RNG permitting.
Generation
Markov Chains function in two parts: training the model, and generating outputs from the model. I will cover generation first and then pull back the curtain on training.
So, we'll first pretend we simply have a model. What is the model? Essentially, it's a big list of two things: what state we are currently in, and what states we can move to, with what probability. For the purposes of generating text, the current state is what the last word we generated is (with a special state for the first word when we are starting a sentence). The list of states we can move to is a series of words with a probability of going to each.
For instance, let's say our model has a 50% chance of starting a sentence with 'The', and a 50% chance of starting a sentence with 'But'. We flip our coin, it comes up heads, and we start with "The".
The model now says that, if you're on 'The', there's a 25% chance of 'next', a 25% chance of 'one', and a 50% chance of 'dragon'. We roll and it comes up on 'one', so our sentence is now 'The one', and we look up what we do when we're on 'one'.
Eventually we reach a state in our model that is considered sentence-ending, and when we hit that we consider the sentence completed and the generated text is outputted. So let's say that 'one' has a chance of being followed by 'day.' which is considered the end of a sentence, and it comes up when we roll for the next state. This gives us the sentence "The one day." which… is a sentence, if not a very interesting or coherent one.
But if we run through the model again, we'll get a different output. Our initial coinflip might start us off on 'But' instead, or perhaps we get a different third word and we get a sentence starting with 'The one and' which keeps going from there for several more words.
A real model, of course, is operating on a much greater scale. An English dictionary has about 170,000 words, and while the number of words that actually show up in common usage is far fewer than that, it's still going to be five digits. Any given word could be followed by hundreds or even thousands of other words with varying probabilities, but there's also context to consider – the words and ordering used in scientific journals would be vastly different than what you'd see in, say, erotic furry fiction. You know, just as a totally arbitrary example. ;)
So, how do you build a model that represents the type of text you want to generate? Well, you need a sample of that text. You can't write a scientific paper without having seen what they look like, and computer models are no different. Which brings us to…
Training
I've shown you how a model is used, but how do we get it? We need to build up that big list of words and probabilities somehow.
Let's start with a single sentence: "The fox goes murr." We don't care that 'murr' isn't a word in the English language, as Markov Chains are language-agnostic and don't care either. The upshot of this is that character names work just fine too. Anyway; we see that our word order is "The", "fox", "goes", "murr". Each word is a single token as far as the model is concerned.
The problem here is that this is just one sentence. If we build a model from this, every sentence we get will be "The fox goes murr" because that's all the input data we gave. So let's add a second sentence to our training data, "The lizard goes mlem". With these two sentences together our model is something like:
(Start of sentence) -> "The", 100% chance
The -> "lizard" or "fox", 50% chance for each
lizard -> "goes", 100% chance
fox -> "goes", 100% chance
goes -> "murr" or "mlem", 50% chance for each
With this model, it has a chance of generating either of the original sentences, but if the random values come up right, it can generate new sentences that were not in the original input: "The fox goes mlem" (I'm sure they can) or "The lizard goes murr" (I don't know about that one…)
Furthermore, this model will never generate sentences about wolves or birds or fish, because there are none of those in the training data – it can only generate words it's seen before. If we want to have sentences about other animals, we need to include them in the training data.
If I throw "The fox goes yip" into the training data, the additional 'fox' following 'The' now means there's a 2-in-3 chance of 'fox' and a 1-in-3 chance of 'lizard', so the probabilities are no longer split evenly. Which makes sense, we have two sentences about foxes and only one about lizards. If I kept adding more sentences about silly fox noises without anything else, the model would accordingly be built under the assumption that, generally, we want to generate sentences about foxes, but once in a blue moon we'll have a sentence about lizards instead.
This is simple enough for an explanatory example, but natural language is complex. So we need more training data – a LOT of data. Usually you would use a 'corpus' which is considered a representative sample of text; for instance, the Brown Corpus is a collection of 500 samples of text each over 2000 words, such as letters, news articles, scientific journals, and more. It's a little over a million words in size.
I, of course, crossed that million words threshold quite some time ago (I'm almost at 1.7 million words in October 2025), so MY corpus is built entirely, 100% from my own work – there is absolutely zero outside data in the model that I have trained. Not one word. The end result is a model that will generate sentences in the style of the author known as Codelizard… or at least, try to, because at the end of the day it's just rolling dice and randomly deciding what word to take next from the options presented in the model.
For example, if I crack open the model I trained for last year, then the first thing I notice… is that PyCharm insists on read-only mode because the model, a 24MB JSON file, too large for it to edit. Whoops.
The second thing I can notice from a glance is that "Fidget" has a weight of 128 for beginning a sentence, which means that in the training data I supplied, "Fidget" is the first word in 128 sentences. The chance of that actually happening in the output is going to be 128 in ((whatever the grand total sentence count is)).
The training is a separate step from the generation – once a model is generated, it can be re-used multiple times. This is nice because training takes a lot more time, as it has to look through every file I provide it for training data, and catalogue the chance of every word following every other word. Generation is just taking the pre-made model and rolling a bunch of dice, and is much faster.
State Size
Now, the thing is, if we were to look at just the previous word, we would be very likely to get nonsense, like the "Code took a home should be granted" I got just now while generating some sentences with such a model.
We can get more coherent results by adjusting the 'state size', which represents how many previous words we look at when selecting the next word. This number is chosen at the time we train the model.
By default in Markovify, this is 2 – if we're up to "Code stared at", the model will look at what follows the pair of words "stared at" without caring that it's Code doing the staring. To do this the model must have an entry for that pair of words specifically, which is why the state size must be specified when training, not just generating.
When I outlined above how generation works with the animal noises examples, I only used a state size of 1 for simplicity. It's good for teaching how Markov Chains work, but with a real data set you will get nonsensical (though highly varied) results.
For my actual generations, I normally use 2, but starting this year I'm going to try bumping the state size from 2 to 3. A higher state size will get you more coherent sentences, but it's also more likely for sentences from the training data to be re-generated verbatim, because any given combination of words becomes less probable as the number of words in the combination increases. (This can still happen with smaller state sizes when very uncommon words get used, but it's less likely.)
For example, a state size of 3 considers the entirety of the "Code stared at" sentence fragment above, and thus would only generate another word based on things Code has previously stared at… unless it picks 'the' as the next word, in which case it'll be generating a word to come after 'stared at the' leading into any singular object that anyone has previously stared at in one of my stories.
The other downside of increasing the state size is that it also increases the size of the model because there are more potential states to be in. For comparison, my models made from the first 6 years of stories are:
- 5MB for a state size of 1
- 24MB for a state size of 2
- 32MB for a state size of 3
Improvements
That covers the basics, but there's more you can do with this.
First, when I'm generating a sentence, I can nudge it in a particular direction by providing one or more words to start a sentence, which I have done before by telling it to start a sentence with "Code" and seeing what nonsense the chain decides I should be doing this time. I could give it more than that, such as "Code grabbed", which when I tested it, resulted in a surprisingly SFW spread of towels, remotes, and Zhen's arm being grabbed onto.
There's also nothing really stopping the generation from just going on indefinitely in an unreadable run-on sentence, which can and does happen, so I can provide a maximum sentence length to the generator and tell it to stop after a certain number of words.
Starting this year I have also added some pre-processing and post-processing steps. By default every word is a single token so "Code" and "Code's" would actually be tracked separately and have different probabilities. One step I took is to split the possessive "'s" off from my name, as well as every other time this is used, making it its own token with its own set of probabilities, so that all names share the chances for what comes after them. Then in the output data I remove the space between a name and 's to make them a single word again. I did the same thing with non-terminal punctuation as well such as commas, colons and semicolons, because otherwise they would be counted as part of the previous word. This adds more variety to the output. It doesn't distinguish between possessives and contractions, but due to how Markov chains work, it will never slap a 's in a place it shouldn't be in as it can only follow words it's previously seen a 's follow in the training data.
I also do a little cleanup to remove certain characters from the text entirely such as the < and > symbols (which I've used in some early stories to indicate written or telepathic messages) and my twenty-hyphen section breaks which aren't a sentence but would otherwise show up as one if I didn't strip them from the training data.
Finally, I can also turn 'strict' mode on/off when generating. With it on, it will only start a sentence with a word it has seen starting a sentence before; with it off, it will ignore that restriction. I typically leave this on when generating a whole sentence from nothing (as I'm more likely to get sentence fragments otherwise) and turn it off when I give it a starting word to turn into a sentence (to improve the variety of the output).
If I wanted to go the extra mile, as part of the preprocessing I'd run the text through a stemmer to cut each word down to just its base form. This would make the output very grammatically incorrect, but would further reduce confusion by treating words like "stood", "stand", and "standing" as different forms of the same word rather than separate words. That's a little further than I'm going to go, though, since I don't want to make more effort for myself re-formatting the output and correcting its grammar any more than necessary.
Conclusion
The end result of this is a big ol' random text generator that only uses words I've used before and strings them together into an order that will usually look like something I'd write. More or less. If you've ever used (or been forced by an employer to use) an LLM you can see the differences and limitations of this model.
As I've stated in previous journals, I do this for amusement purposes only, for my anniversary posts and for times I just want to chuckle by subjecting a character to the whims of RNG. I do not use it for writing new stories, especially since a good 80-90% of what's generated is nonsense, sentences regenerated from the original input, or boring 'normal' sentences - what you see in my anniversary posts is me cherrypicking the best of 100 or so generated sentences.
The sweet spot of a Markov generation is when it has a valid sentence except for one word out of place that makes it hilariously weird. It has zero understanding of narration or story and will mix and match characters from different stories with ease, bouncing between topics and subject matter. If you get a laugh out of the Markov Chain generations I put in my anniversary posts, then my mission is accomplished!
FA+

In terms of music apparently even reliably: https://www.scientificamerican.com/.....ond-the-sound/
Hope one day we will proove whether our biological brains do not work like a weighted RNG machine.. or that they do too! I tend to believe the later.
Maybe over time as these get refined their output will be more reliable - or maybe not, maybe they'll always be missing something. Only time will tell.
a markov chain then would also likely be whats used when we come up with like, madlibs versions of "horror movie titles" or "generate your fursona" stuff where itll be like 'the blood of blood lake' or 'pink wolf with neon horn'
neat!!
One thing people do sometimes is to just repeatedly take the first suggested word over and over and see what the phone thinks their 'typical' sentence looks like. This tends to end up in a loop though since I believe those words are ordered with the most common one first, not randomly. It's otherwise a very similar principle: use your phone enough and it'll have the data to say "after the words 'gonna head', the most common word that follows is 'out'" and will suggest it to you automatically.
I'm not 100% sure that this is how it works, but from what I understand it basically is.
Eventually I just tried stripping quotation marks out of the text entirely during preprocessing, and it actually works pretty well. It doesn't distinguish between dialogue and narration any more, but it wasn't really doing that before either, so it just ends up improving the connections of words between each other.