We’ll build a library to help us label and identify features in Reddit comments to improve the accuracy of a Natural Language Toolkit (NLTK) VADER sentiment analysis with a machine learning approach.
The goal of this series on Sentiment Analysis is to use Python and the open-source Natural Language Toolkit (NLTK) to build a library that scans replies to Reddit posts and detects if posters are using negative, hostile or otherwise unfriendly language.
The Natural Language Toolkit (NLTK) natural language processing (NLP) library for Python is a powerful tool for performing textual analysis of a corpus of data.
In the articles Using Pre-trained VADER Models for NLTK Sentiment Analysis and NLTK and Machine Learning for Sentiment Analysis, we used some pre-configured datasets and analysis tools to perform sentiment analysis on a body of data extracted from a Reddit discussion.
Fundamentally, movie reviews may not be a good corpus for training if we’re looking at a technical community like the r/learnpython subreddit. If we continued to iterate and improve the features we’re looking for, we might be able to get good results when moving to an evaluation set like data from the r/Movies or r/MovieDetails subreddits.
Overfitting occurs when you end up with too many features that are specifically tuned to training data and don’t generalize. We might be able to achieve greater accuracy in predicting movie reviews, but that wouldn’t be a good fit with our target use case.
We could use a dataset like the UCI Paper Reviews, which may use technical and scientific terms that are predictive, but that dataset is in the Spanish language and so might not generalize.
We could use a dataset like Reuters news articles, but the style of communicating facts and news doesn’t involve the same word choices a technical community might make, so it suffers the same problem as movie reviews.
Instead, let’s look at what a process of annotating our own dataset would entail.
Creating a Custom Dataset for Training
So that we don’t overload the Reddit API, we’re going to temporarily do a little bit of caching while training.
from nltk.sentiment.vader import SentimentIntensityAnalyzer
import json
results = {}
analyzer = SentimentIntensityAnalyzer()
for comment in comments:
score = analyzer.polarity_scores(comment.body)
results[comment.id] = {
'score': score,
'ups': comment.ups,
'downs': comment.downs,
'created': comment.created,
'text': comment.body
}
if comment.author:
results[comment.id]['author'] = comment.author.name
else:
results[comment.id]['author'] = 'deleted'
filename = submission.id + '.json'
with open(filename, 'w') as file:
json.dump(results, file, indent=4)
This gives us a data file with some values that might be useful for defining features.
Now it’s time to annotate. We can edit this file or create a mapping to add our own labels, shown in red below.
{
"fmonppi": {
"score": {
"neg": 0.0,
"neu": 0.981,
"pos": 0.019,
"compound": 0.1779
},
"ups": 122,
"downs": 0,
"created": 1586293145.0,
"author": "...",
"text": "To take a non programming approach, suppose you and I both have a $50 bill. They are of equal value but are not the same as their serial numbers differ.\n\nSo they are equivalent but not identical.\n\nIn Python, == checks for equivalency. For custom classes you can implement this yourself. In contrast, is and is not checks to see if two things are identical. Generally it's used to check if two variable names are aliases of each other and to check for None as there is only NoneType instance.",
"label": "neutral",
},
...
}
We can review each comment from this post and label it in a way we find informative. For example, I agree with the VADER analysis that this comment should be neutral.
Many sentiment analysis datasets designate positive and negative only, which forces neutral comments into one of those categorical options, and that may be a disservice in trying to identify the truly insightful comments.
We can build tools to help automate the annotation workflow. We’re not going to look in detail at these tools, but whether we use a CLI or GUI, we can choose to brute force add a label to all of the comments, or do random sampling across a number of sets and grow our dataset over time in regular intervals.
Using a Custom Annotated Dataset for Training
With our own annotated dataset, we can now substitute it for the off-the-shelf movie review corpus in our training attempt. As before, we’ll pair some details about our text along with the label we choose for ourselves. Unlike with movie reviews, we’re not reading similarly labeled content at the same time, but we randomize anyway in case the order of comments is somehow meaningful.
def get_labeled_dataset():
filename = 'annotated.json'
with open(filename, 'r') as file:
annotated = json.loads(file.read())
dataset = []
for id in annotated:
dataset.append((annotated[id], annotated[id]['label']))
random.shuffle(dataset)
return dataset
We aren’t making any substantial changes to our get_features()
method, other than having text passed to the function directly, which we need to retrieve from comment['text']
.
Considerations for Effective Feature Analysis
To improve our analysis, we’ll need to build up a corpus. To annotate a single post isn’t too bad, but our movie reviews dataset had 2000 items while a single Reddit submission may have only 100 comments. (We discussed collecting comments from Reddit in Finding Data for Natural Language Processing.)
Despite the small sample size, the effort of annotating our own data can help us understand what makes for better predictions.
The VADER analysis is still part of our feature set because it gives us a good baseline. VADER values range from -1 to +1 and we needed to choose a threshold above and under which positive and negative sentiment is assigned. By combining the scores with our machine learning approach, we can determine that threshold by using weighting in the classification model.
The defining concept here was originally understanding customer sentiment in comments, and our data source was comments in the r/learnpython subreddit. What other features make sense for understanding sentiment for technical communities?
- The number of upvotes and downvotes is used by Reddit to derive controversial posts and comments. That holds true for us as well. Posts with lots of both upvotes and downvotes tend to be extremely positive or negative.
- With technical audiences, a number of phrasings can be inferred to be negative by proving somebody wrong about something. This includes phrases such as "is not correct," "a better explanation," "I would avoid it," "you got it backwards," "this behavior is esoteric," "strongly advise against," and so forth.
- Gratitude is a good indication of positive sentiment, including phrases such as "Thank you" and "This is cool."
- Beginning a sentence with "Well," or "Perhaps" often indicates a comment that can be construed as negative.
- The lexical approaches may miss other details related to technical communities, such as when a comment uses source code or cites an external reference like a URL (which can be seen as an intensifier, to help somebody, or to point out a Python Enhancement Proposal (PEP) recommendation not followed)
- Understanding when rhetorical questions are being asked.
Ultimately, labels are subjective and reflect our own bias and interest in the data. We should strive to be consistent in our labels and in how we codify feature tagging for these observations in order to get the most accurate results.
Next Steps
If you've followed through our articles on NLTK and sentiment analysis, you should have a basic understanding of the main building blocks for natural language processing using both built-in NLTK tools for Python and your own machine learning models.
In Using Cloud AI for Sentiment Analysis, we'll take a look at some sophisticated NLP tools now available through cloud computing services such as Amazon Comprehend and Microsoft Azure Text Analytics.