Building serverless text prediction from training to deployment
What we'll be doing
We'll be making a simple text based prediction model based off of the book Pride and Prejudice. We will then create an API which will take prediction queries and return the top three most likely words to come afterwards. This will be deployed to the cloud so that it can be used in another project.
Prerequisites
- Pipenv - for simplified dependency management
- The Nitric CLI
- (optional) Your choice of an AWS, GCP or Azure account
Getting started
We'll start by creating a new project for our API.
nitric new prediction-api py-starter-pipenv
Next, open the project in your editor of choice.
cd prediction-api
Make sure all dependencies are resolved using Pipenv:
pipenv install --dev
Exploring our data
We will need to start by downloading the Pride and Prejudice text file from Project Gutenberg. This will form the basis of our training data and as you will find at the end, it makes the predictions have a Jane Austen spin to it.
An important first step for training a model is exploring and pre-processing the training data. After spending some time looking through, we will find that Project Gutenberg adds a header and a footer to the data. As the book is three separate volumes, there are volume headers that also need to be removed. Along with these, we will need to remove chapter headings, punctuation, contractions, and convert all the numbers to number words i.e. 8 -> eight. This will allow for our training data to be as versatile as possible, which will make the predictions more cohesive.
To start we can manually remove the headers and footers. The header starts with The Project Gutenberg eBook, Pride and Prejudice, by Jane Austen, Edited
and ends with CHAPTER I.
. The footer starts with Transcriber's note:
and ends with subscribe to our email newsletter to hear about new eBooks.
.
We can then either manually remove the section headers, or do it programmatically.
def remove_section_headers(lines: list[str]):section = Falsenew_lines = []for line in lines:if line.lower().startswith(("end of the second", "end of vol")):section = Trueelif line.lower().startswith("chapter") and section:section = Falseif not section:new_lines.append(line)return new_lines
Removing the chapters.
import redef remove_chapters(data: str):return str(re.sub('(CHAPTER .+)', '', data))
Remove contractions.
def remove_contractions(data: str) -> str:return (data.replace("shan't", "shall not").replace("here's", "here is").replace("you'll", "you will").replace("what's", "what is").replace("don't", "do not").replace("i'm", "i am").replace("there's", "there is"))
Remove punctuation.
import stringdef remove_punctuation(data: str) -> str:return data.translate(str.maketrans(string.punctuation, ' ' * len(string.punctuation)))
Convert to numbers using num2words. This will mean we have to install it.
pipenv install num2words
We can then write our convert numbers function.
from num2words import num2wordsdef convert_numbers(data: str) -> str:numberless_data = []for word in data.split():if str.isdigit(word):numberless_data.append(num2words(word))else:numberless_data.append(word)return " ".join(numberless_data)
Putting it all together we can get our cleaned data.
# Open text data and read it into arrayfile = open("data.txt", "r")lines = []for line in file:lines.append(line)data = remove_section_headers(lines)data = remove_chapters(" ".join(data))data = data.lower()data = remove_contractions(data)data = remove_punctuation(data)data = convert_numbers(data)# Save the cleaned data in a new filewith open('clean_data.txt', 'w') as f:f.write(data)
Before we are done, we will want to tokenize the data so that it can be processed by the model. After it's fit to the text, we will save it so we can use it later. To tokenize the data, we will use keras' pre-processing module. For this we require the keras module.
pipenv install keras
We can then create and fit the tokenizer to the text. We will initialize the Out of Vocabulary (OOV) token as <oov>
.
import picklefrom keras.preprocessing.text import Tokenizer# Tokenize the data and fit it to the texttokenizer = Tokenizer(oov_token='<oov>')tokenizer.fit_on_texts(data.split())# Save tokenizerwith open('tokenizer.pickle', 'wb') as handle:pickle.dump(tokenizer, handle, protocol=pickle.HIGHEST_PROTOCOL)
Training the model
To train the model, we will be using a Bi-Directional Long-Short Term Memory Recurrent Neural Network or Bi-LSTM for short. This type of recurrent neural network is ideal for this problem as it is able to store state for both long term and short term memory. This enables the neural network to be able to store the context of the previous words in the sentence.
Start by loading the tokenizer from the pre-processing stage.
import picklewith open('tokenizer.pickle', 'rb') as handle:tokenizer = pickle.load(handle)
We can then create all the input sequences to train our model. This works by getting every 6 word combination in the text. First add numpy as a dependency.
pipenv install numpy
Then we'll write the function to create the input sequences from the data.
import numpy as npfrom keras.utils import pad_sequencesdef create_input_sequences(data: list[str], n_gram_size=6):# Create n-gram input sequences based on an n-gram size of 6input_sequences = []token_list = tokenizer.texts_to_sequences([data])[0]# Sliding iteration which takes every 6 words in a row as an input sequencefor i in range(1, len(token_list) - n_gram_size):n_gram_sequence = token_list[i:i+n_gram_size]input_sequences.append(n_gram_sequence)# Pad sequencesmax_sequence_len = max([len(x) for x in input_sequences])return np.array(pad_sequences(input_sequences, maxlen=max_sequence_len, padding='pre')), max_sequence_len
We'll then split the input sequences into labels, training, and testing data.
from keras.utils import to_categorical, pad_sequencesfrom sklearn.model_selection import train_test_split# Create the features and labels and split the data into training and testingdef create_training_data(input_sequences):# Create features and labelsxs, labels = input_sequences[:,:-1], input_sequences[:,-1]ys = to_categorical(labels, num_classes=total_words)# Split datareturn train_test_split(xs, ys, test_size=0.1, shuffle=True)
The next part is fitting, compiling, and training the model. We will use the X training data and y training data, as well as the sizes of our data. We are using an ADAM optimizer, a reduce learning rate on plateau callback, and a save model on checkpoint callback.
# Create callbackscheckpoint = ModelCheckpoint("model.h5", monitor='loss', verbose=1, save_best_only=True, mode='auto')reduce = ReduceLROnPlateau(monitor='loss', factor=0.2, patience=3, min_lr=0.0001, verbose = 1)# Create optimiseroptimizer = Adam(learning_rate=0.01)
Then we will add layers to the sequential model.
# Create modelmodel = Sequential()model.add(Embedding(total_words, 100, input_length=max_sequence_len-1))model.add(Bidirectional(LSTM(512)))model.add(Dense(total_words, activation='softmax'))model.summary()
Putting it all together and compiling the model using the training data.
from keras.models import Sequentialfrom keras.layers import LSTM, Dense, Embedding, Bidirectionalfrom keras.optimizers import Adamfrom keras.callbacks import ModelCheckpoint, ReduceLROnPlateau# Train the modeldef train_model(X_train, y_train, total_words, max_sequence_len):# Create callbackscheckpoint = ModelCheckpoint("model.h5", monitor='loss', verbose=1, save_best_only=True, mode='auto')reduce = ReduceLROnPlateau(monitor='loss', factor=0.2, patience=3, min_lr=0.0001, verbose = 1)# Create optimiseroptimizer = Adam(learning_rate=0.01)# Create modelmodel = Sequential()model.add(Embedding(total_words, 100, input_length=max_sequence_len-1))model.add(Bidirectional(LSTM(512)))model.add(Dense(total_words, activation='softmax'))model.summary()# Compile modelmodel.compile(loss='categorical_crossentropy', optimizer=optimizer, metrics=['accuracy'])model.fit(X_train, y_train, epochs=20, batch_size=2000,callbacks=[checkpoint,reduce,])
With all the services defined, we can train our model with the cleaned data.
data = open('clean_data.txt', 'r').read().split(' ')total_words = len(tokenizer.word_index) + 1input_sequences, max_sequence_len = create_input_sequences(data)X_train, X_test, y_train, y_test = create_training_data(input_sequences)train_model(X_train, y_train, total_words, max_sequence_len)
The model checkpoint save callback will save the model as model.h5
. We will then be able to load the model when we create our API.
Predicting text
Starting with the hello.py
file, we will first load the model and tokenizer. This is done with dynamic imports so that it will reduce the cold start time when its deployed.
import pickleimport importlibmodel = Nonetokenizer = Nonedef load_tokenizer():global tokenizerif tokenizer is None:# Load the tokenizerwith open('prediction/tokenizer.pickle', 'rb') as handle:tokenizer = pickle.load(handle)return tokenizerdef load_model():global modelif model is None:models = importlib.import_module("keras.models")# Load the modelmodel = models.load_model('prediction/model.h5')return model
Once the model is loaded, we can write a function to predict the next 3 most likely words. This uses the tokenizer to create the same token list that was used to train the model. We can then get a prediction of all the most likely words, which we will reduce down to 3. We'll then get the actual word from the map of tokens by finding the word in the dictionary. The tokenizer word index is in the from { "word": token_num }
, e.g. { "the": 1, "and": 2 }
. The predictions we receive will be an array of the token numbers.
# Predict text based on a set of seed text# Returns a list of 3 top choices for the next worddef predict_text(seed_text: str) -> list[str]:# Dynamically load the utilsutils = importlib.import_module("keras.utils")# Convert the seed text into a token list using the same process as the previous tokenizationtoken_list = load_tokenizer().texts_to_sequences([seed_text])[0]token_list = utils.pad_sequences([token_list], maxlen=5, padding='pre')# Make the predictionm = load_model()predict_x = m.predict(token_list, batch_size=500, verbose=0)# Find the top three wordspredict_x = np.argpartition(predict_x, -3, axis=1)[0][-3:]# Reverse the list so the most popular is firstpredictions = list(predict_x)predictions.reverse()# Iterate over the predicted words, and find the word in the tokenizer dictionary that matchesoutput_words = []for prediction in predictions:for word, index in tokenizer.word_index.items():if prediction == index:output_words.append(word)breakreturn output_words
Creating the API
Using the predictive text function, we can create our API. First we will make sure that the necessary modules are imported.
from nitric.resources import apifrom nitric.application import Nitricfrom nitric.context import HttpContext
We will then define the API and our first route.
mainApi = api("main")@mainApi.get("/prediction")async def create_prediction(ctx: HttpContext) -> None:passNitric.run()
Within this function block we want to define the code that is run on a request. We will accept the prompt to predict from via the query parameters. This will mean requests are in the form: /predictions?prompt=where should I
.
@mainApi.get("/prediction")async def create_prediction(ctx: HttpContext) -> None:prompt = ctx.req.query.get("prompt")if prompt is None:returnprompt = " ".join(prompt)Nitric.run()
With the users prompt we can then do the prediction and return to the user the prediction.
@mainApi.get("/prediction")async def create_prediction(ctx: HttpContext) -> None:...prompt = " ".join(prompt)prediction = predict_text(prompt)ctx.res.body = f"{prompt} {prediction}"Nitric.run()
Thats all there is to it. To test the function locally, we will start the nitric server.
nitric start
You can then make a request to the API using any HTTP client.
curl "http://localhost:4001/prediction?prompt=what%20should%20I"What should I ['have', 'think', 'say']
Deploy to the cloud
Setup your credentials and any other cloud specific configuration:
Create your stack. This is an environment configuration file for the cloud provider for which your project will be deployed.
nitric stack new
This project will run perfectly fine with a default memory configuration of 512 MB. However, to get instant predictions we will amend the memory to be 1 GB. In the newly created stack file we want to add some config.
name: devprovider: gcpregion: us-west2project: gcp-project-123456config:default:memory: 1024
You can then deploy using the following command:
nitric up
To undeploy run the following command:
nitric down