We'll provide a brief overview of Natural Language Processing (NLP), introduce NLTK for Python and explain how it can be used to solve complex NLP problems.
People with roles like Community Manager, Developer Advocate, and similar titles try to keep a finger on the pulse of their community by following what customers say on social media or discussion forums. As the community grows, the large number of people and conversations makes understanding the sentiment of that group overall challenging, but perhaps even more important to understand where to focus engagement efforts and to discern useful customer feedback.
So, what are people saying about your business? In this multi-part tutorial, I’m going to demonstrate how you can begin building your own tools with Python and Natural Language Processing (NLP) — a branch of machine learning — to analyze the sentiment of a group based on their comments in a public forum like Reddit.
We’ll start with a primer on some key NLP concepts and getting started with the Natural Language Toolkit (NLTK) Python library.
See the end of this article for links to additional modules on getting data for NLP analysis, using pre-trained NLP models, and creating your own NLP text classification and custom models.
What is Natural Language Processing?
Natural Language Processing is the interdisciplinary study of artificial intelligence and machine learning as it relates to doing useful things with text in human languages. NLP can be used in a wide range of applications like translation between languages, summarizing information, conversational bots, and search.
Language analysis and dialog processing require an understanding of complex topics like morphology, syntax and grammatical structures, semantics, and meaning derived from context.
Based on the research in the field, Edward Loper, Steven Bird, and Ewan Klein created Natural Language Toolkit (NLTK), a platform for building NLP applications in Python. The point of libraries like NLTK is that you don’t need to be an expert on morphology, syntax, and semantics — and you don't have to be an expert on machine learning, either — to build applications that employ NLP.
Whether it’s to understand political leanings of a demographic, popularity of a particular marketing campaign, investor confidence, or customer sentiment based on social media, you can focus on the areas you know best and let tools like NLTK do the heavy lifting.
I’m not going to cover Python programming basics, installation, virtual environments, and so on in this tutorial. If you need help with any of that, please go check out The Hitchhiker’s Guide to Python first. The examples here use Python 3.7.5.
Getting Started with NLTK
NLTK is described as a platform rather than just another Python library because, in addition to a collection of modules, it includes a number of contributed datasets. These datasets are referred to as the corpus, named that because it is the collection or body of knowledge about how to work with language.
NLTK is installed by default with the Anaconda distribution for data science and machine learning with Python.
If not already installed, you can install NLTK with Python's pip package manager by running:
pip install -U nltk
The first step for projects using NLTK is to import the library. Whether you're writing your Python in a code file, the Python interactive shell, or are using a tool like Jupyter Notebooks or IPython, you can run something like the following:
import nltk
Note that, in these tutorial modules, we'll present the Python code as you would write it in a code editor, but the same code works in any of the environments mentioned above.
Next, use the download()
method to extend the corpus available for your use. When not providing any arguments, this will open a user interface where you can select an individual extension or download the entire set.
In this example, we're downloading Punkt, a pre-trained tokenizer for English that divides the text into a list of sentences and words. It was built from an unsupervised algorithm to model abbreviations, collocations, and words that start sentences.
nltk.download('punkt')
Once it has been downloaded, you’ll be able to make use of it. Here's an example of using the Punkt tokenizer to extract the components of a sentence.
import nltk
sentence = "A long time ago in a galaxy far, far away…"
tokens = nltk.word_tokenize(sentence)
print(tokens)
The output from this code looks like this:
['A', 'long', 'time', 'ago', 'in', 'a', 'galaxy', 'far', ',', 'far', 'away', '...']
You might have been able to do something similar just using the basic Python split()
method, but to account for all the variations you’d end up writing a lot more code than what was just accomplished from this single nltk.word_tokenize(sentence)
call.
Next Steps
In this article, we introduced you to some basic concepts in natural language processing and one of the popular Python libraries for NLP, NLTK.
For building apps that are more substantial than just reading and writing strings, the next step is exploring how to analyze text.
For the next step in learning about NLP and NLTK, we recommend Finding Data for Natural Language Processing.
If you need a refresher on Python, see series on Data Cleaning with Python and Pandas.