Build a Natural Language Processor and Keep Things Sorted


Posted on

Epic Sadness

Our inbound parse webhook makes it easy to start receiving emails into your application. Once you dig in you can start creating some really awesome internal tools to deal with what’s being sent to you.

One of the types of emails you might be receiving is from your customers, and if you have a lot of customers this is probably going to amount to a huge pile of emails that may or may not get addressed in a timely manner, especially if you have no way of knowing which emails you should be prioritising.

But what if you did know?

In this article I’m going to intro you to the basics of Natural Language Processing, and more specifically how to perform sentiment analysis on inbound emails so they can be better prioritised, which would help not only your users but also you as well.

Sure, there are services out there that can do this for you and plenty of customer satisfaction services to keep track of your users sentiment towards your product. I thought it would be fun to try building my own without having to start completely from scratch, so I’m sharing that process with you. Let’s get into it.

What is Natural Language Processing

“Natural language processing (NLP) is a field of computer science, artificial intelligence, and linguistics concerned with the interactions between computers and human (natural) languages”, is what Wikipedia has to say on the subject. I would say that NLP is best described simply as “making computers understand what human beings are talking about”.

You will have probably interacted with NLP before. It’s used to power those automated assistants that sometimes pop up on websites asking if you need help with your purchase. Have you ever felt they didn’t quite understand what you were asking?

The key term I want to focus on from the Wikipedia definition is ‘artificial intelligence’. In order to process natural language, you need to provide the system with some idea of what its supposed to be looking for. This is often called ‘training data’ and it’s at the core of all NLP. Your training data, when collated into one place is called a ‘model’.

Before we get to that point, I’ll give you a high level view of what we’re going to be building.

Bits ‘n’ bobs of NLP

Sketch of the NLP system we're building

Our system is made up of the following

  1. Training App – A simple way of collecting our initial training data.
  2. Training Data Store – Any old DB will do (I’m using MongoDB).
  3. Google Prediction API – The magic. Handling our training data and giving us an API to query against.
  4. Your App – The service that receives inbound email and needs to categorise them.

Pretty simple. Even simpler thanks to Google’s Prediction API which saves us the time of building this whole thing from scratch because it will take our training data, learn from it and provide us with an API to query when we want to categorise those inbound emails.

Putting on the training wheels

Our app is going to categorise emails as one of the following:

  • Positive
  • Negative
  • Indifferent
  • Both positive and negative

In order to properly train the model we need to gather lots of examples of emails that fall into the above categories. You could go through your inbox and do this but it would take ages, so it’s best to gather it from many sources. To do that, we’ll need a simple app that looks something like this:

MoarData

It’s a single page app that stores a lump of text and an category association in a database. I’ve created a simple version that you can easily deploy to Heroku, you can find it on GitHub.

Once you have this up and running, send it around to as many people as possible and get them to enter a piece of text from an email and select a category they think that text falls into. Congratulations, you’re building a training data set!

How much data will you need? How long is a piece of string? The answer is as much as possible. I’ve heard it said that 3-5Mb of data is a good start but any amount of data will get you going. The more data you have, the better your predictions will be.

It’s time to learn

Next, we need to upload our training data so the Google Prediction API can access it. Google requires you to export a CSV file of your data and put it on their Cloud Datastore service.

Ensure you’ve completed the following to get set up on the Google side

  1. Login to your API Console.
  2. Turn on the Google Cloud Datastore API and the Prediction API (there may be some low associated costs).

Next, upload your training data CSV file to the Cloud Datastore. You can do this via the browser by creating a new bucket and uploading the file directly.

Training data in the Cloud Datastore bucket

Now we need to tell the Prediction API where our training data is and get it to pick it up and train a new model for us to query against. To do that, open the Predict API console and complete the following steps:

  1. Enable “Authorize requests using OAuth 2.0.”
  2. Select the trainedmodels.insert method.
  3. Enter your project number (Google shows you where to get this)
  4. In the request body, choose the predefined elements ‘id’ and ‘storageDataLocation’ and fill them out as shown below

predictapi

After clicking ‘Execute’ your model will begin training. This will be pretty instantaneous if you’re using a small amount of training data. You can check on the status of your training by running the prediction.trainedmodels.get method from the API explorer, or by running the following CURL command:

$ curl -X GET https://www.googleapis.com/prediction/v1.6/projects/868811888652/trainedmodels/{YOUR_DATA_SET_ID}?key={YOUR_API_KEY}

Now you have a fully trained model to query against! Go and have a coffee and a snack.

Answers for your questions

Here’s where we’re at:

- 1 x Data gathering application
- 1x Store of training data on Google’s Cloud Datastore- 1 x Trained model running on Google’s Prediction API

The only thing left to do is to link up the emails we get through the inbound parse webhook with the Predict API so they can be properly categorised. The gist below shows how that can be achieved.

 

Above we do the following:

  1. Specify an endpoint to receive emails on.
  2. Grab the body text of the email and pass it to the Google Prediction API via our classifyEmail() function.
  3. Classify the email and print out the Predict API’s best guess (outputLabel) to the console

Of course you would want to do something better than just printing out the result to the console. You should store the result next to the email in the database so they can be sorted negative first, should those be the emails you deem most important to deal with.

How to continue from here

Thanks to the Google Prediction API, most of the heavy lifting of NLP has been spared allowing us to concentrate on achieving results quickly. I was able to get this up and running from their API console and CURL in a couple of hours (Note to Google: Nice API console, terrible user experience jumping between the Datastore, Predict and Billing stuff you need to do to get this working!).

Good NLP requires nurturing. If you refer to the sketch of the system diagram above you’ll see that there’s an arrow from ‘Your App’ back to the ‘Training App’. What I’m illustrating here is that you should constantly build up your training data set with real world, accurate results that are as domain specific as possible. If you do this, you will only see better classification of emails as time progresses.

Finally, NLP and sentiment analysis is a huge topic and it wouldn’t have been possible to cover everything involved in this post without it being somewhat novel-like. This has been a high level view, showing you how you can add the power of prediction via training data to your application using readily available APIs and without a huge amount of knowledge of the field.

Go forth and experiment. I am tremendously excited about the possibilities that APIs like Google Predict open up and I can’t wait to see what people start doing with it.

Tags: , , ,


Martyn Davies is a Developer Evangelist at SendGrid and a creative developer based in London. He has worked in technology for over 14 years with a background in both the music industry and technology. A serial hackathon organiser, mentor and startup advisor, you’ll find him presenting, demoing, hacking and chatting at hack days, conferences and meetups in the UK & Europe on a regular basis.

Leave a Reply

Your email address will not be published. Required fields are marked *

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>