HeaRNNthstone: Generating Hearthstone cards with an LSTM network. Part two
--
In this second part of the blog we will be looking at how to pre-process the textual data for modelling . Next part will focus on the actual deep learning stuff!
In part 1 we “spoke” to the Hearthstone API to get card information. What we got was a bunch of JSON’s with card texts inside. Before we can do something cool with it using machine learning we need to first pre-process these texts!
I thought the most interesting part of a Hearthstone card to generate would be it’s cardtext with the effect as I think it offers the most space for creativity for our network ;) But first, let’s take care of the necessary LSTM text preprocessing steps!
When predicting with LSTM’s, it is necessary to define the sequence length: it will serve like the LSTM memory. Essentially, this is the number of characters we want to use while predicting the next one! This is the part which makes recurrent neural networks so powerful. I thought to look at the average sequence length in our data to make an educated guess:
Our generative model is a character based one, so our input data is a huge list of characters. Let’s convert the list of sentences to a list of characters first. However, the actual input to the network needs to be numerical because machines do not understand pure human languages yet. So what we need is to index all possible characters in our dataset! We create a set of all characters (and therefore deduplicating the list) and then we create two dictionaries: one character to a numeric index and the other way around. This way we can always switch around between numerical and string representations. One very important fact: we need to make sure that the character mapping is exactly the same every time we use the trained model! So we need to either sort our set or persist the mapping in someway (for example by saving the mappings to disk with pickle, which I did):
I borrowed some of the following code samples and inspiration from this nice project on GitHub:
Make sure to check it out!
Now that we have all the needed mappings, let’s now look at reshaping the data for the LSTM format.
LSTM expects input of the shape (batch_size, length_of_sequence, number_features). Here’s what each of these shapes mean:
batch_size: amount of sequences which are fed into the network at a single weight update iteration, just as in a regular feedforward neural network
length_of_sequence: the amount of “neural networks”, the memory or the amount of steps the network looks at at each step. In our example, we want to predict a character given 57 previous characters
number_features: the length of one featurized element. In the case of images it could be padded standardized vectors of pixels. In case of text it is the length of our vocabulary, because our input is going to be represented by every char in our vocabulary.
We set the number of features to the total amount of unique characters:
NUMBER_FEATURES = len(all_text_unique_chars)
Now we can initialise our X (input) and y (output) for the network by creating a correctly shaped zeros array which we will fill iteratively. We find the batch size by dividing our whole dataset by the length of one sequence. This way batch size is actually equal to the amount of sequences in our dataset. In practice this means that all of our data is consisted inside one batch during training:
The idea is that we want to predict the same length sequence as is our input, BUT just shift it with one character. This is common while using LSTM’s for prediction as it makes shaping the data convenient. Then we are practically predicting one character at a time. For a sequence [c, a, r, r, o] we want to have the sequence [a, r, r, o, t] as output. This way, each char in the input has the following char as label (c -> a, a -> r, r -> r, etc.).
Last, and biggest, part of the data pre-processing is actually creating the proper sequences for the LSTM’s. I will walk through the code snippet step by step in the Python comments:
After doing this, X is a fully prepared 3D NumPy where first dimension corresponds to the number of sequences where each sequence has an array for each of the characters and each character is a one-hot encoded array on itself.
Our data is now ready to be modelled! In the next part of this blog we will model the actual neural network and produce some very nice results with it!
Stay tuned and happy coding!
P.S. PART ONE: https://bit.ly/2O61vLn