A TensorFlow Tutorial: Email Classification
This code/post was written in conjunction with Michael Capizzi. Sections of the original code on which this is based were written with Joe Meyer.
Update: November 2, 2017 - New script for raw text feature extraction read_corpus.py
Update: March 8, 2017 - Now supports TensorFlow 1.0
Quick Start
You can get the code and data discussed in this post (as well as presentation slides from the Tucson Data Science Meetup) by cloning the following repo:
Dependencies
Once you have the code and data, you can run a training session and get some output with the following:
Introduction
This tutorial is meant for those who want to get to know the Flow of TensorFlow. Ideally, you already know some of the Tensor of TensorFlow. That is, in this tutorial we aren’t going to go deep into any of the linear algebra, calculus, and statistics which are used in machine learning.
Don’t worry though, if you don’t have that background you should still be able to follow this tutorial. If you’re interested in learning more about the math, there’s a ton of good places to get an introduction to the algorithms used in machine learning. This tutorial from Stanford University about artificial neural nets is especially good. We’re going to be using a simple logistic regression classifier here, but many of the concepts are the same.
Email Classification
To ground this tutorial in some real-world application, we decided to use a common beginner problem from Natural Language Processing (NLP): email classification. The idea is simple - given an email you’ve never seen before, determine whether or not that email is Spam or not (aka Ham).
For us humans, this is a pretty easy thing to do. If you open an email and see the words “Nigerian prince” or “weight-loss magic”, you don’t need to read the rest of the email because you already know it’s Spam.
While this task is easy for humans, it’s much harder to write a program that can correctly classify an email as Spam or Ham. You could collect a list of words you think are highly correlated with Spam emails, give that list to the computer, and tell the computer to check every email for those words. If the computer finds a word from the list in an email, then that email gets classified as Spam. If the computer did not find any of those words in an email, then the email gets classified as Ham.
Sadly, this simple approach doesn’t work well in practice. There’s lots of Spam words you will miss, and some of the Spam words in your list will also occur in regular, Ham emails. Not only will this approach work poorly, it will take you a long time to compose a good list of Spam words by hand. So, why don’t we do something a little smarter by using machine learning? Instead of telling the program which words we think are important, let’s let the program learn which words are actually important.
To tackle this problem, we start with a collection of sample emails (i.e. a text corpus). In this corpus, each email has already been labeled as Spam or Ham. Since we are making use of these labels in the training phase, this is a supervised learning task. This is called supervised learning because we are (in a sense) supervising the program as it learns what Spam emails look like and what Ham email look like.
During the training phase, we present these emails and their labels to the program. For each email, the program says whether it thought the email was Spam or Ham. After the program makes a prediction, we tell the program what the label of the email actually was. The program then changes its configuration so as to make a better prediction the next time around. This process is done iteratively until either the program can’t do any better or we get impatient and just tell the program to stop.
On to the Script
The beginning of our script starts with importing a few needed dependencies (Python packages and modules). If you want to see where these packages get used, just do a CTRL+F search for them in the script. If you want to learn what the packages are, just do a Google search for them.
Next, we have some code for importing the data for our Spam and Ham emails. For the sake of this tutorial, we have pre-processed the emails to be in an easy to work with format.
The X-matrices (i.e. feature matrices) have the shape (number of rows, number of columns). Each row represents an email, and each column represents a word. Each cell in the matrix contains an interger between 0 and infinity (i.e. \(\mathbb{N}\)) which is the count of how many times a given word occured in a given email.
great | dog | pill | \(\cdots\) | |
email_001 | 0 | 1 | 0 | \(\cdots\) |
email_002 | 3 | 0 | 5 | \(\cdots\) |
email_003 | 0 | 0 | 0 | \(\cdots\) |
\(\vdots\) | \(\vdots\) | \(\vdots\) | \(\vdots\) | \(\ddots\) |
Similarly, we have a matrix which holds the labels for the our data. In this case, the matrix has two columns, one for Spam and one for Ham. There can only be a 1 or a 0 in each cell, where 1 means that column is the correct label for the email. Like in the feature matrix, each row in the matrix represents an email.
Spam | Ham | |
email_001 | 0 | 1 |
email_002 | 1 | 0 |
email_003 | 1 | 0 |
\(\vdots\) | \(\vdots\) | \(\vdots\) |
In these illustrations the matrices have row and column headers, but the actual matrices we feed into TensorFlow have none.
The import_data() function first checks if the data directory “data/” exists in your current working directory or not. If it doesn’t exist, the code tries to unzip the tarred data from the file “data.tar.gz”, which is expected to be in your current working directory.
You need to have either the “data/” directory or the tarred file “data.tar.gz” in your working directory for the script to work.
Now that we can load in the data, let’s move on to the code that sets our global parameters. These are values that are either (a) specific to the data set or (b) specific to the training procedure. Practically speaking, if you’re using the provided email data set, you will only be interested in adjusting the training session parameters.
TensorFlow Placeholders
Next we have a block of code for defining our TensorFlow placeholders. These placeholders will hold our email data (both the features and label matrices), and help pass them along to different parts of the algorithm. You can think of placeholders as empty shells (i.e. empty tensors) into which we insert our data. As such, when we define the placeholders we need to give them shapes which correspond to the shape of our data.
The way TensorFlow allows us to insert data into these placeholders is by “feeding” the placeholders the data via a “feed_dict”. You if you do a CTRL+F search for “feed” you will see that this happening in the actual training step.
This is a nice feature of TensorFlow because we can create an algorithm which accepts data and knows something about the shape of the data while still being agnostic about the amount of data going in. This means that when we insert “batches” of data in training, we can easily adjust how many examples we train on in a single step without changing the entire algorithm.
In TensorFlow terms, we build the computation graph and then insert data as we wish. This keeps a clean division between the computations of our algoritm and the data we’re doing the computations on.
TensorFlow Variables
Next, we define some TensorFlow variables as our parameters. These variables will hold the weights and biases of our logistic regression and they will be continually updated during training. Unlike the immutable TensorFlow constants, TensorFlow variables can change their values within a session.
This is essential for any optimization algorithm to work (we will use gradient descent). These variables are the objects which define the structure of our regression model, and we can save them after they’ve been trained so we can reuse them later.
Variables, like all data in TensorFlow, are represented as tensors. However, unlike our placeholders above which are essentially empty shells waiting to be fed data, TensorFlow variables need to be initialized with values.
That’s why in the code below which initializes the variables, we use tf.random_normal to sample from a normal distribution to fill a tensor of shape “shape” with random values. Both our weights and bias term are initialized randomly and updated during training.
TensorFlow Ops
Up until this point, we have defined the different tensors which will hold important information. Specifically, we have defined (1) the tensors which hold the email data (feature and label matrices) as well as (2) the tensors which hold our regression weights and biases.
Now we will switch gears to define the computations which will act upon those tensors. In TensorFlow terms, these computations are called operations (or “ops” for short). These ops will be the nodes in our computational graph. Ops take tensors as input and give back tensors as output.
In the illustration below, the green nodes represent our TensorFlow placeholders (holding features and labels), the blue nodes represent our TensorFlow variables (holding weights and biases), and the pink nodes represent our TensorFlow Ops (operations on our tensors).
We will start with the operations involved in the prediction phase (i.e. the logistic regression itself). First, we need to initialize our weights and biases with random values via a TensorFlow Op. TensorFlow has a special built in Op for just this, since this is a step you will most likely have to perform everytime you use TensorFlow. Like all our other Ops, this Initialization Op will become a node in our computational graph, and when we put the graph into a session, then the Op will run and create the variables.
Next, we have our ops which define the logistic regression function. Logisitic regression is typically thought of as a single equation, (i.e. \(\mathbf{\hat{y}} = sig(\mathbf{WX} + \mathbf{b})\)), but for the sake of clarity we have broken it into its three main components (1) a weight times features operation, (2) a summation of the weighted features and a bias term, and (3) the application of a sigmoid function. As such, you will find these components defined as three separate ops below.
Next we define our cost operation (i.e. Mean Squared Error). For this, we use TensorFlow’s built in L2 Loss function, \(\frac{1}{2} \sum_{i=1}^{N} (\mathbf{\hat{y}_i}-\mathbf{y_i})^2\).
Next, we define the function to optimize our cost function. We are using gradient descent, which TensorFlow has a built in op for.
At this point, we have defined everything we need to put our data into a computational graph and put the computational graph into a TensorFlow session to start training.
Vizualizations with Matplotlib
The following code block is not central to TensorFlow, but we wanted to include a graph of our training progress which updates in real time.
TensorFlow Sessions
Now that we’ve done all of the important definitions of our tensors and ops, we need to send them to some hardware to run them. This is where the TensorFlow Session comes in handy. In the words of the official documentation:
To compute anything, a graph must be launched in a Session. A Session places the graph ops onto Devices, such as CPUs or GPUs, and provides methods to execute them.
Now, lets create a TensorFlow session and do some training!
Now that we have a trained logistic regression model for email classification, let’s save it (i.e. save the weights and biases) so that we can use it again.
Resources
TensorBoard Instructions
Using Tensorboard with this data:
- run: tensorboard –logdir=/path/to/log-directory
- open your browser to http://localhost:6006/
External Resources
- A very good step-by-step learning guide.
- A graph visualization tutorial.
- Scikit-Learn style wrapper for TensorFlow.
- Another TensorFlow tutorial.
- Some more TensorFlow examples.
- Official resources recommended by TensorFlow.