Click here to Skip to main content
65,938 articles
CodeProject is changing. Read more.
Articles / artificial-intelligence / machine-learning

Using Pre-trained VADER Models for NLTK Sentiment Analysis

4.76/5 (5 votes)
29 May 2020CPOL4 min read 19.1K   56  
This article is the third in the Sentiment Analysis series that uses Python and the open-source Natural Language Toolkit. In this article, we'll look at techniques you can use to start doing the actual NLP analysis.
NLTK includes pre-trained models in addition to its text corpus. The VADER Sentiment Lexicon model, aimed at sentiment analysis on social media. Let's see how it works.

If you’ve ever been asked to rate your experience with customer support on a scale from 1-10, you may have contributed to a Net Promoter Score (NPS). With this approach to customer experience, you generally are looking for promoters, those who rate their experience 9-10, because they are advocates for your brand and will keep buying, consuming, and telling others about their experience.

Within the context of NPS, detractors are anyone who rates their experience with a score from 0-6. They are unhappy and often spread their displeasure through negative word-of-mouth. These customers are typically a priority for outreach. A value of 7-8 is considered passive, satisfied, and neutral.

Sentiment analysis can give insights to NPS, but without requiring our audience to directly take a survey. Sentiment analysis can help you find promoters and detractors simply by evaluating what people are saying about you in social media or public forums.

In Finding Data for Natural Language Processing, we talked about textual datasets for NLP and techniques for creating a custom dataset by collecting posts and comments from Reddit discussions.

In this article, we'll look at techniques you can use to start doing the actual NLP analysis. We'll be building on the data collected in the previous article.

VADER Sentiment Analyzer

Developed in 2014, VADER (Valence Aware Dictionary and sEntiment Reasoner) is a pre-trained model that uses rule-based values tuned to sentiments from social media. It evaluates the text of a message and gives you an assessment of not just positive and negative, but the intensity of that emotion as well.

It uses a dictionary of terms that it can evaluate. From the GitHub repository this includes examples like:

  • Negations - a modifier that reverses the meaning of a phrase ("not great").
  • Contractions - negations, but more complex ("wasn’t great").
  • Punctuation - increased intensity ("It’s great!!!").
  • Slang - variations of slang words such as "kinda", "sux", or "hella".

It's even able to understand acronyms ("lol") and emoji (❤).

The scoring is a ratio of the proportion for text that falls into each category. Language is not black and white, so it is rare to see a completely positive or a completely negative score. Since this model has been pre-trained for social media, it should be very applicable to comments made by users on Reddit.

Let’s first look at an example from a comment retrieved previously from Reddit.

Python
Comments[116].body     # Output: 'This is cool!'

# If you haven’t already, download the lexicon
nltk.download('vader_lexicon')

# Initialize the VADER sentiment analyzer
from nltk.sentiment.vader import SentimentIntensityAnalyzer
analyzer = SentimentIntensityAnalyzer()
analyzer.polarity_scores(comments[116].body)

The output of this analysis is:

Python
{'neg': 0.0, 'neu': 0.436, 'pos': 0.564, 'compound': 0.3802}

On Reddit, a post like "This is cool!" is high praise.

We’ve downloaded (nltk.download('vader_lexicon')) and imported (from nltk.sentiment.vader import SentimentIntensityAnalyzer) the Vader sentiment analyzer and used it to score a particular comment from the collection of comments (analyzer.polarity_scores(comments[116].body)).

The results of polarity_scores gives us numerical values for use of negative, neutral, and positive word choice. The compound value reflects the overall sentiment ranging from -1 being very negative and +1 being very positive.

You can find more about the NLTK sentiment usage from the pydoc page: https://www.nltk.org/api/nltk.sentiment.html.

Sentiment for all Comments on a Reddit Post

Let’s look at the sentiment overall for this post instead of just a single comment. There were 119 comments to analyze and we’ll put them into buckets to keep count.

Python
len(comments)  # Output: 119

# Initializing a dictionary to keep tally of results
result = {'pos': 0, 'neg': 0, 'neu': 0}
for comment in comments:
    score = analyzer.polarity_scores(comment.body)
    if score['compound'] > 0.05:
        result['pos'] += 1
    elif score['compound'] < -0.05:
        result['neg'] += 1
    else:
       result['neu'] += 1

print(result)

The output is:

Python
{'pos': 65, 'neg': 25, 'neu': 29}

What we’ve learned is that for this post, the comments overall were generally positive.

If you start analyzing your own posts using a model like this, you may want to tune the threshold up or down. For example, only looking at compound scores +/- 0.5 instead of 0.05 would highlight the more extreme opinions.

What can you do with this information? If you were trying to prioritize how to engage with your community, you might look at the positive comments and give them recognition as your supporters. If you were trying to win back detractors, you might focus on the negative scores and see if you can find constructive feedback from their comment to improve your offering or personal outreach efforts to address specific customer concerns.

Next Steps

As you've seen, we can take a text from a variety of sources and do a quick analysis to understand positive and negative sentiment. This is useful feedback to understand whether a product, service, or content is well-liked. It can also help prioritize community engagement.

As a next step, we can consider the Pros and Cons of NLTK Sentiment Analysis with VADER.

We can also take this analysis project further by leveraging machine learning approaches to understanding language and try to improve upon our results in NLTK and Machine Learning.

License

This article, along with any associated source code and files, is licensed under The Code Project Open License (CPOL)