Natural Language Processing: How to Test, Evaluate, and Scale Your Approach

By November 28, 2017Analytics, Optimization, Testing

How the Guardian Mobile Lab and MaassMedia efficiently analyzed qualitative data from thousands of survey responses with natural language processing.

In a recent article, “Analysis Without Benchmarks: An Approach for Measuring the Success of Mobile Innovation Projects,” we discussed how MaassMedia and the Guardian Mobile Innovation Lab worked together to develop survey questions that enabled the lab to analyze users’ reactions to new mobile news formats  (such as their experiment sending real-time updates on the 2016 US presidential election results to users’ lock screens). Here we’ll explain why we include open-ended questions in these feedback surveys, and how we developed a natural language processing algorithm to evaluate the sentiment of thousands of users’ free-form responses.

In order to measure the success of mobile experiments, the Guardian Mobile Lab puts a high-level focus on positive sentiment around usefulness and whether or not a new format was interesting to users. We can find out if people feel positively or negatively about an experiment by interpreting the responses to a feedback survey sent out after each experiment is over. In addition to multiple choice questions, the lab’s feedback surveys always include an open-ended question at the end: “Is there anything else you would like to tell us about this experiment?”. By allowing users to provide a free-form response to an open-ended question, they can share additional thoughts in their own words, and the team can fill in analysis gaps left by multiple choice questions, where the answers are limited to what we come up with on our own. Essentially, free-form responses provide us with insight into areas of concern or success that we didn’t include in the survey.

Early on in collecting survey data from the mobile experiments, when the audiences and the number of free-form responses was small, the team at MaassMedia could read and score the sentiment of individual responses manually. However when the audience grew and the number of responses reached several thousand, manually scoring responses became too time consuming. Even if the work was divided up amongst the team, each person might apply different sentiments to each response based on their naturally occurring biases. It became clear that we needed to find a way to speed up the analysis and to make the scoring more consistent. It would be even better if this new method could be used again for future experiment analysis.

Our solution was to develop a sentiment analysis algorithm with natural language processing (NLP). Brian Hood, a MaassMedia engineer and co-author of this post, took about 20 hours to research and build the solution. He also collaborated with the subject matter experts in the lab to guide and refine the eventual solution.  

Why Natural Language Processing?

Natural language processing is one of the more efficient ways to program computers to interpret qualitative data. With the right solution, you can train a computer to perform sentiment analysis on large quantities of text-based responses, quickly gleaning insights on respondents’ emotional reactions to, or opinions about, a particular subject or experience. Having a dependable NLP solution for textual analysis not only reduces the time it takes to read and score written or free-form survey responses, but can also reduce human error and bias in your analysis.

In our work with the lab we developed a solution that saved time and improved consistency, as well as enabled us to hone in on responses from the experiment’s most and least satisfied users. As with a net promoter score, the most valuable responses to surveys are typically those on either end of the spectrum. By grouping responses as either highly positive or negative, we could begin to do text mining on the words most commonly used to express sentiment. These words could help tell a better story and provide context on how users responded to the experiments.

Another team objective was to apply the NLP algorithm to survey data from other lab experiments with minimal tweaking. With these objectives in mind, we began testing existing algorithms and developing our own.

How we built the algorithm

There were several approaches we could have taken to build and apply a natural language processing algorithm for sentiment analysis. We decided to start by building our own model from scratch, and experiment with using different types of existing data sets to train the model. We also looked at the success rates for existing sentiment analysis models and used them as benchmarks against which to compare our work. Ultimately, we went through three iterations of models and data sets to arrive at the solution.

First iteration: Our own algorithm

To build our own algorithm, we reviewed a number of  existing NLP Python packages to mine ideas for how to handle language nuances. For example, the algorithm needed to be able to understand the logic of negation words, such as “not,” in front of a positive word. Once we created our base algorithm, we followed this process to train it:



Once we had our base algorithm written, we started experimenting with various datasets to train it. First we tried a Twitter sentiment analysis dataset, but it only had a 57% accuracy rate when used to analyze data from the election survey. To us, this highlighted some discrepancies between common words used in tweets and the survey responses for the elections experiment. For example, words related to the topic of notifications such as “alerts” or “auto-updating” were used in the survey responses but not covered in the tweets.

Next we tried training the algorithm model on our own data set from the election survey, and we obtained an accuracy score of 81%. Better! However, when testing the same algorithm to analyze responses from an earlier experiment the lab ran for the Brexit vote, the accuracy dropped several points to 78%. While this method of training the algorithm picked up more nuances and specifics of the lab’s experiments, the scope was still limited to words highly associated with experiments surrounding the topic of the election.

Second iteration: The VADER algorithm

To validate our own algorithm, we tested the same election survey data set with another algorithm called VADER. The VADER algorithm was created by researchers at Georgia Tech and has been trained through crowdsourcing, asking surveyed users to rate a series of words, emoticons, slang, and acronyms. The dataset that VADER was trained on includes over 7,000 words in its lexicon. We decided to switch to the VADER algorithm instead of using our own original base algorithm because it would allow us to accurately analyze a wider scope of words (not just election-based).

Third iteration: Adapting the VADER algorithm

To further improve the VADER algorithm’s accuracy, we added terms specific to the lab to its lexicon, based on the team’s input. For example, its lexicon did not include words such as “convenience”, or “up-to-date”, which were important keywords to add as they are benefits of participating in the experiments.

Our results with the adapted VADER algorithm were the best of the three iterations. For the US election survey data, it had an 80% accuracy score. Although this score was a bit lower than the one we produced with our survey-trained model, it pulled through when we tested it to analyze the survey results from the Brexit experiment, where the accuracy score was 88%. Thus, this meant that VADER was flexible enough to accurately handle and score sentiment for future experiments that may differ from the scope of US politics coverage.

Although we had set out to use VADER to validate our own algorithm, we learned that adapting the VADER algorithm with words specific to the lab’s surveys turned out to be the most efficient solution.VADER is the algorithm we now use in our analyses. However as more survey responses come in from new experiments, we could feed them into our own survey-trained model, enabling our model to potentially perform better than VADER over time as the experiment-specific data set grows larger.

The Outcome

Through the use of natural language processing, we substantially reduced the time it took to tag and grade the sentiment of survey responses. If our team had read and manually tagged every one of the 1,400 election survey responses, the work would have taken about five hours. But with the algorithm, we could tag and grade the responses in less than five minutes.

In addition to reducing the time required for analysis, our algorithm allowed us to hone in on some of the most commonly used words associated with the experiment, segmented by positive and negative sentiment. For example, positive reactions were associated with people who liked the convenience of the live updates. Their responses included words such as “easy”, “live”, “check”, and “updated.”  These key words gave us some hints about what users liked about the experiment.

Developing an NLP solution is a valuable investment and requires a team with the appropriate skillset. Although it is time consuming to build, it can significantly reduce the time required for analyses. We now use the modified VADER sentiment scoring algorithm to analyze free-form survey data from lab experiments.

Still, it’s important to note that developing NLP methods is an iterative process. In order to improve accuracy over time, it’s necessary to continuously add important keywords to its lexicon, and allow the algorithm to evolve and adapt alongside the content included in the experiments.


One Comment

  • Jenny Elliott says:

    Interesting read – what is your perspective on the social sentiment tools out there to license? Given the details on how you fine tuned your process and validated along the way specific for your client, I’m less confident that buying a sentient analyzer OOB is worth it. Seems like building your own unique to a brand would make more sense (?).

Leave a Reply