The GPT-2 language model is a wonderful bit of tech - a cutting-edge pretrained language model which provides surprising levels of quality on tasks with only a few data points. The model can be fine-tuned to a (small) dataset to produce material more closely resembling the given data, leading to hilarious projects such as poetry generation, a more accurate subreddit simulator, and AI generated text adventures with AI dungeon.
Ever since discovering GPT-2 I’ve wanted to find some interesting datasets to use it with. The best datasets tend to be roughly 1 to 10 megabytes in size, since there needs to be enough data that we can train it reasonably well without overfitting. This limits us slightly in terms of what we can do - initially, I tried fine-tuning the model to some logs of group chats, but these weren’t quite big enough. The results produced were still hilarious, but overfitting became a problem if training went on for too long.
As it turns out, the King James Version (KJV) of the Bible is in the public domain, and can be downloaded in plain-text form from Project Gutenberg. It also happens to be just over 4 megabytes in size. Perfect!
I used this jupyter notebook on Google Colaboratory, which allowed me to fine-tune and run the model on Google’s GPUs - my laptop isn’t the most powerful thing, and fine-tuning GPT-2 would have been slow and may have even caused problems with overheating. Colaboratory’s GPUs are fast, and we’re given more than enough free GPU time to fine-tune and run the model.
Training the model on Colaboratory didn’t take long - the notebook has everything set up so you can download the base GPT-2 model in just a few clicks, and from there it didn’t take much work to mount Google Drive as a folder to load the dataset. Then we simply start training, and we’re off! The training script automatically generates samples every 100 epochs, so it doesn’t take long before seeing results, and you can quickly check for problems like overfitting.
I’ve hand-picked some of the funniest results - you can read the entire output here.
Even after the first 100 epochs the model had done an impressive job at matching the writing style of the KJV Bible - each “verse” it generated was on a new line, and it had even done a decent job at counting verse numbers!
10:19 Now there must needs be for you that to be a man: the beginning of all your being is the end.
10:20 You must need no clothes: I will teach you what I have seen: how your feet must need your heels.
10:21 In my name will I set the end of my glory upon you: it is mine; and I will give you peace: it is from my mind.
10:22 The LORD your God, the YHWH your God, hath heard my words: he is my salvation, saith the LORD.
9:31 You shall set on the place of the LORD your God.
9:32 They shall take my name among my names, saith the LORD, because I will make your name a name in the midst of the land, in your midst.
9:33 I will do as the LORD hath done in the wilderness, and as the LORD himself spake, and brought the people unto you: 9:34
9:35 In the midst of the people of Israel will I raise the dead; I will cleanse their land from them that eat slain, saith the LORD.
9:36 The desolate of the earth shall be my inheritance, saith the LORD; and the desolate shall I call to Zion.
9:37 I will give you that thou mayest possess the land; I will give you that you may have Zion, saith the LORD.
Devolving into Gibberish
At 200 epochs things start looking a bit more worrying - the style is still just fine, but the content doesn’t make as much sense.
10:19 Now the LORD hath laid many things of many things to be done; 12:25 But I will not write my mouth; for thy words are a stone; and of the stone is he cast: wherefore sayest thou that man shall not write his mouth?
12:26 Wherefore, let them that shall do the commandments of him that sent me speak in parables; 12:27 Wherefore if a man be righteous, having knowledge, he shall write his discourse with the pen.
12:28 The scripture is plain indeed from the beginning, and shall not be concealed from the people, neither shall its parables be hid from the eyes of them that seek after them.
13:1 So shall it seem to thee, when ye shall read the parables which I shall tell you, and shalt know that there is none other god but he that broughtest them, that in whom the heart is at rest, and the words which are written are not vain.
At 300 epochs it’s learned to insert its own line breaks within verses, but the content is equally questionable:
1:14 And the LORD said unto Moses, Behold, I will be merciful unto you when I send thee, and when thou goest down from the mountain.
1:15 And the children of Israel heard it, and they also stood to see, and saw.
1:16 But the LORD was with child, and could not bring himself to go down beyond the mount Sinai. Yet all the sons of Israel spake against him.
1:17 And they came unto Moses, and he went down to the valley of mount Sinai.
1:18 And it came to pass, as he went, that the people arose and went up from hence out of the house of the LORD.
1:19 And Moses said unto the children of Israel, Behold, the sons of Moses are faint, and yet they will bear thee the children of Israel.
At 400 epochs the model gets in a bit of a rut involving gold, silver, brass, and iron (possibly my favourite part of this experiment):
11:3 And the LORD GOD also did great works; and there was a great abundance of money: because that he had no more than to repair, he repaired to the land of his possession.
11:4 Thus were all such works done in the house of the LORD, as he had said; and the house of the LORD was filled with gold, and silver, and brass, and iron.
11:5 Because the LORD GOD had no choice, and could not give it him: so the LORD GOD gave his spirit to the house of the house of Abishag.
11:6 Thus were all his works done in the house of the LORD, and the house of the LORD was filled with gold, and silver, and brass, and iron.
11:7 In the house of the LORD was the house of Abishag clean: for he brought an abundance of gold and silver, and of brass, and of iron, and of brass, and of iron, and of brass, and of iron, and of brass; all that came out of Egypt, and went into Babylon.
11:8 As they came down in like manner, and entered their tabernacle at the door of the house of the LORD, so they entered into the city with the same abundance that came out of Egypt: 11:9 And the house of the LORD was filled with gold, and of silver, and of brass, and of iron; and the men of the house of the LORD continued until the coming of the kings of Babylon.
11:16 But he went out from him, and came to the house of the LORD in a little time when he had spoken, and brought forth a very great amount of gold and silver.
11:17 And the servant that was with him found a golden coin, a golden penny, and ten thousand pieces of silver.
11:18 The servant took the money, and cast it on the altar, with ten thousand pieces of silver: and it was finished before he went out of the house of the LORD.
11:19 He went into the house of the LORD in the night, and filled his tabernacle.
11:20 And the LORD God of hosts delivered him from the hand of the king, which had turned his kingdom into a people; and he rejoiced greatly.
11:21 And the servant came out of the tabernacle, and sat upon a high place: and the LORD God of hosts delivered him out of the hand of the king, which had turned his kingdom into a people.
11:22 And the servant went out, and came to his own house, and looked down upon his house: and there he found a strange token from God.
11:23 And he came and saw the great storehouse full of gold, two thousand and three hundred thousand and two thousand pieces; ten thousand and three thousand pieces.
11:24 And there was a silver coin, and two thousand pieces were found.
11:25 And he found out the name of the god that was served; and he said, Lord, I am he: and he was pleased to give him that coin, because it was of his own good will.
From these samples it’s clear how versatile this model is - with very little training it learned the writing style and subject of the source material. At a glance it could appear to be the original text! However, even after sufficient training the model still slips up in ways that betray the AI generated nature of the text - we see it get stuck in loops, abruptly change topic, or in some cases just generate nonsense.
This model was trained on the GPT-2 model with 345 million parameters, but a version has since been released with 1.5 billion parameters. I’m sure this could do a better job of generating believable texts, but it would likely take a lot longer to do so. The smaller model is a good compromise where interesting texts can be generated relatively cheaply.
OpenAI have recently released the GPT-3 model, which was trained on an even larger dataset and has 175 billion parameters. However, this model has yet to be released to the public due to concerns around it being used to generate fake news and spam - plus it would be very expensive to run! I’m excited to see what projects come out of OpenAI’s use of this new model, as well as hobbyists when the model is eventually released.
I’d also like to have a go at using GPT-2 on some other datasets - for instance, writing a web scraper for MediaWiki sites (such as Wikipedia and some smaller wiki sites) would allow us to generate fake wiki pages on the fly.
Beyond messing around with datasets to generate funny/interesting outputs, I think language models like this have a lot of potential in natural language processing and creating more realistic speech engines. I can’t wait to see what the future holds in this field.