Sentiment analysis example using FastText

Have you ever wondered how, just after posting a status about a hotel, mentioning a page on a comment or recommending a product to your friend on Messenger, Facebook starts showing you ads about it?! Well, I can assure you there's no magic behind it, it's just Facebook (and many other companies) using AI to understand the enormous amount of (text) data its collecting to serve its users better!

There's no doubt that we're currently living in a world where the most valuable resource is no longer oil, but data! Understanding this data, classifying and representing it is the challenge that Natural Language Processing (NLP) tries to solve.
In this article, will take a look at FastText, Facebook's open source library for fast text representation and classification.

Twitter Sentiment Analysis using FastText

One of the most common application for NLP is sentiment analysis, where thousands of text documents can be processed for sentiment in seconds, compared to the hours it would take a team of people to manually complete the same task.

Incidentally, through this article we're going to use the sentiment140 dataset that contains 1,600,000 tweets extracted using the twitter api. The extracted tweets have been annotated with 2 keys: 0 = negative and 4 = positive. Our objective will be to classify a given tweet (text) and check if it's positive or negative using fastText.

Installing FastText

FastText is a library created by the Facebook Research Team for efficient learning of word representations and sentence classification. I've used the latest stable available at the time of writing this post, which is 0.2.0.
The Readme.md file contains the required step to build fasttext, which I also describe also below:

$ wget https://github.com/facebookresearch/fastText/archive/v0.2.0.zip
$ unzip v0.2.0.zip
$ cd fastText-0.2.0
$ make

This will generate object files for all the classes and the main binary fasttext that we're going to use throughout this post.

Tweets Preprocessing and Cleaning

Our dataset is basically a csv file of 1,600,000 tweets, composed of 6 fields:

  • target: the polarity of the tweet (0 = negative, 4 = positive)
  • ids: The id of the tweet
  • date: the date of the tweet
  • flag: The query. If there is no query, then this value is NO_QUERY.
  • user: the user that tweeted
  • text: the text of the tweet

Obviously we don't need all these fields, so we're going to keep only the target and the text. We're going also to add a prefix our dataset to have something like: __label__0 to adjust it to is how fastText recognize what is a label or what is a word. The last step is to clean up the text from unneeded elements: links, hashtags, usernames ...

Furthermore, we need to split the data into train and validation. We will use the validation set to evaluate how good the learned classifier is on new data.

All of these requirements is implemented in the script below:

import csv
import re

train = open('tweets.train','w')
test = open('tweets.valid','w')
with open('sentiment140.1600000.csv', mode='r', encoding = "ISO-8859-1") as csv_file:
    csv_reader = csv.DictReader(csv_file, fieldnames=['target', 'id', 'date', 'flag', 'user', 'text'])
    line = 0
    for row in csv_reader:
        # Clean the training data
        # First we lower case the text
        text = row["text"].lower()
        # remove links
        text = re.sub('((www\.[^\s]+)|(https?://[^\s]+))','',text)
        #Remove usernames
        text = re.sub('@[^\s]+','', text)
        # replace hashtags by just words
        text = re.sub(r'#([^\s]+)', r'\1', text)
        #correct all multiple white spaces to a single white space
        text = re.sub('[\s]+', ' ', text)
        # Additional clean up : removing words less than 3 chars, and remove space at the beginning and teh end
        text = re.sub(r'\W*\b\w{1,3}\b', '', text)
        text = text.strip()
        line = line + 1
        # Split data into train and validation
        if line%16 == 0:
            print(f'__label__{row["target"]} {text}', file=test)
        else:
            print(f'__label__{row["target"]} {text}', file=train)

Training our classifier

First of we need to build our first classifier! For that we need simply to run the following command:

$ ./fasttext supervised -input tweets.train -output model_tweet

Read 13M words
Number of words:  578422
Number of labels: 2
Progress: 100.0% words/sec/thread:  472970 lr:  0.000000 loss:  0.288137 ETA:   0h 0m

At the end of training, which take impressively only few seconds, the file model_tweet.bin, containing the trained classifier, is created in the current directory. Note that we've only used the default parameters for training the model. We can additionally specify custom arguments during the training time to improve the performance. For example, we specify below the learning rate (ltr) of the training process and the number of times each examples is seen (epoch)

$ ./fasttext supervised -input tweets.train -output model_tweet -epoch 30 -lr 0.1

Read 13M words
Number of words:  578422
Number of labels: 2
Progress: 100.0% words/sec/thread:  512276 lr:  0.000000 loss:  0.312297 ETA:   0h 0m

Testing our classifier

Testing our model is similar to training it, we just run the following command:

$ ./fasttext test model_tweet.bin tweets.valid
N	100000
P@1	0.765
R@1	0.765

The output describes the Number of examples (100000), P@1 is the precision and R@1 is the recall. We can of course add few more features during the training phase to improve even further our performance!

Testing our classifier using python API

fastText also offers a python API that we can use to interact with it. The steps described [here] (https://github.com/facebookresearch/fastText/tree/master/python) shows how to build it.

Once done, we're going to use it to import our previously saved model and test it against some fake tweets!

from fastText import load_model

classifier = load_model("model_tweet.bin")
texts = ['Ugghhh... Not happy at all! sorry', 'Happyyyyyyy', 'OH yeah! lets rock.']

labels = classifier.predict(texts)
print (labels)

The result of executing the code above looks acceptable:

([['__label__0'], ['__label__4'], ['__label__4']], array([[0.65397215],
       [0.80476588],
       [0.99866331]]))

It has successfully classified our 3 tweets into negative (label 0) for the first one, and positive (label 4) for the rest, and shows the probability for each of them. Not bad!