Introduction
This article describes three approaches to parsing the sentences from a body of text; three approaches are shown as a means of describing the pros and cons for performing this task using each different approach. The demonstration application also describes an approach to generating sentence count, word count, and character count statistics on a body of text.
Figure 1: The test application running.
The three approaches to parsing out the sentences from the body of text include:
- Parse Reasonable: An approach based on splitting the text using typical sentence terminations where the sentence termination is retained.
- Parse Best: An approach based on the use of splitting the text based upon the use of a regular expression and where the sentence termination is retained, and
- Parse Without Endings: An approach to splitting the text using typically sentence terminations where the terminations are not retained as part of the sentence.
The demonstration application contains some default text in a textbox
control; three buttons used to parse the text using one of the three approaches mentioned, and three label controls used to display the summary statistics generated on the body of text. Once the application is run, clicking on any of the three buttons will result in the display of each of the parsed sentences within the listbox
control at the bottom of the form, and will result in the display of the summary statistics using the three labels in the upper right hand side of the form.
Getting Started
In order to get started, unzip the included project and open the solution in the Visual Studio 2008 environment. In the solution explorer, you should note these files (Figure 2):
Figure 2: Solution Explorer.
As you can see from Figure 2, there is a single WinForms project containing a single form. All code required of this application is included in this form’s code.
The Main Form (Form1.vb)
The main form of the application, Form1
, contains all of the code necessary. The form contains default text within a textbox
control; the three buttons are used to execute each of the three functions used to parse the body of text into a collection of strings; one per sentence. You may replace, remove, or add to the text contained in the textbox
control to run the methods against your own text. Three label controls are used to display summary statistics (sentence, word, and character counts) on the text contained in the textbox
control. These summary statistics are updated each time the text is parsed into sentences.
If you'd care to open the code view up in the IDE, you will see that the code file begins with the following library imports
:
Imports System
Imports System.Collections
Imports System.ComponentModel
Imports System.Data
Imports System.Drawing
Imports System.Text
Imports System.Windows.Forms
Imports System.Text.RegularExpressions
Note that the defaults have been altered and now include the reference to the regular expressions library.
Following the imports
, the class and constructor are defined:
Public Class Form1
Public Sub New()
InitializeComponent()
End Sub
Next up is a region entitled, “Best Sentence Parser”; this region contains a function entitled SplitSentences
which accepts a string
as an argument. This method tends to yield the best results in terms of parsing sentences but may issue inaccurate values if the text contains errors. The region also contains a button click event handler used to evoke the SplitSentences
function.
The code is annotated and reading through the notes will explain what is going on within the function.
#Region "Best Sentence Parser"
Private Function SplitSentences(ByVal sSourceText As String) As ArrayList
Dim sTemp As String = sSourceText
Dim al As New ArrayList()
Dim RegexSentenceParse As String() = _
Regex.Split(sTemp, "(?<=['""A-Za-z0-9][\.\!\?])\s+(?=[A-Z])")
Dim i As Integer = 0
For i = 0 To RegexSentenceParse.Length - 1
Dim sSingleSentence As String = _
RegexSentenceParse(i).Replace(Environment.NewLine, String.Empty)
al.Add(sSingleSentence.Trim())
Next
lblCharCount.Text = "Character Count: " & _
GenerateCharacterCount(sTemp).ToString()
lblSentenceCount.Text = "Sentence Count: " & _
GenerateSentenceCount(RegexSentenceParse).ToString()
lblWordCount.Text = "Word Count: " & _
GenerateWordCount(al).ToString()
Return al
End Function
Private Sub btnParseNoEnding_Click(ByVal sender As System.Object, ByVal e
As System.EventArgs) Handles btnParseNoEnding.Click
lstSentences.Items.Clear()
Dim al As New ArrayList()
al = SplitSentences(txtParagraphs.Text)
Dim i As Integer
For i = 0 To al.Count - 1
lstSentences.Items.Add(al(i).ToString())
Next
End Sub
#End Region
Next up is a region entitled, “Reasonable Sentence Parser”; this region contains a function entitled ReasonableParser
which accepts a string
as an argument. This method tends to yield fair results in terms of parsing sentences but does not apply the proper sentence terminations if the input string
contains duplicate sentence with different terminations. This issue could be resolved by use of a recursive function to continue to move through each instance of the duplicate sentence, however it is less work to use the method indicated in the previous code region. The region also contains a button click event handler used to evoke the ReasonableParser
function.
The code is annotated and reading through the notes will explain what is going on within the function.
#Region "Reasonable Sentence Parser"
Private Function ReasonableParser(ByVal sTextToParse As String) As
ArrayList
Dim al As New ArrayList()
Dim sTemp As String = sTextToParse
sTemp = sTemp.Replace(Environment.NewLine, " ")
Dim arrSplitChars As Char() = {".", "?", "!"}
sentence
Dim splitSentences As String() = sTemp.Split(arrSplitChars,
StringSplitOptions.RemoveEmptyEntries)
Dim i As Integer
For i = 0 To splitSentences.Length - 1
Dim pos As Integer = sTemp.IndexOf(splitSentences(i).ToString())
Dim arrChars As Char() = sTemp.Trim().ToCharArray()
Dim c As Char = arrChars(pos + splitSentences(i).Length)
al.Add(splitSentences(i).ToString().Trim() & c.ToString())
Next
lblCharCount.Text = "Character Count: " & _
GenerateCharacterCount(sTemp).ToString()
lblSentenceCount.Text = "Sentence Count: " & _
GenerateSentenceCount(splitSentences).ToString()
lblWordCount.Text = "Word Count: " & _
GenerateWordCount(al).ToString()
Return al
End Function
Private Sub btnParseReasonable_Click(ByVal sender As System.Object, ByVal
e As System.EventArgs) Handles btnParseReasonable.Click
lstSentences.Items.Clear()
Dim al = ReasonableParser(txtParagraphs.Text)
Dim i As Integer
For i = 0 To al.Count - 1
lstSentences.Items.Add(al(i).ToString())
Next
End Sub
#End Region
Next up is a region entitled, “Parse Without Sentence Terminations”; this region contains a function entitled IDontCareHowItEndsParser
which accepts a string
as an argument. This method tends to yield good results in terms of parsing sentences but does not add the termination to the parsed sentences; this is a good approach to use if you don't care what termination is used at the end of the sentence. The region also contains a button click event handler used to evoke the IDontCareHowItEndsParser
function.
The code is annotated and reading through the notes will explain what is going on within the function.
#Region "Parse Without Sentence Terminations"
Private Function IDontCareHowItEndsParser(ByVal sTextToParse As String)
As ArrayList
Dim sTemp As String = sTextToParse
sTemp = sTemp.Replace(Environment.NewLine, " ")
Dim arrSplitChars As Char() = {".", "?", "!"}
sentence
Dim splitSentences As String() = sTemp.Split(arrSplitChars,
StringSplitOptions.RemoveEmptyEntries)
Dim al As New ArrayList()
Dim i As Integer
For i = 0 To splitSentences.Length - 1
splitSentences(i) = splitSentences(i).ToString().Trim()
al.Add(splitSentences(i).ToString())
Next
lblCharCount.Text = "Character Count: " +
GenerateCharacterCount(sTemp).ToString()
lblSentenceCount.Text = "Sentence Count: " +
GenerateSentenceCount(splitSentences).ToString()
lblWordCount.Text = "Word Count: " + GenerateWordCount(al).ToString()
Return al
End Function
Private Sub btnParseBest_Click(ByVal sender As System.Object, ByVal e As
System.EventArgs) Handles btnParseBest.Click
lstSentences.Items.Clear()
Dim al = IDontCareHowItEndsParser(txtParagraphs.Text)
Dim i As Integer
For i = 0 To al.Count - 1
lstSentences.Items.Add(al(i).ToString())
Next
End Sub
#End Region
The final region is entitled, “Generate Statistics”. This region contains three functions which return the character count, word count, and sentence counts for a body of text. Again, this section is annotated; read through the annotation to get a description of how each function works.
#Region "Generate Statistics"
Public Function GenerateCharacterCount(ByVal allText As String) As
Integer
Dim rtn As Integer = 0
Dim sTemp As String = allText
sTemp = sTemp.Replace(Environment.NewLine, String.Empty)
sTemp = sTemp.Trim()
Dim splitSentences As String() = _
Regex.Split(sTemp, _
"(?<=['""A-Za-z0-9][\.\!\?])\s+(?=[A-Z])")
Dim cnt As Integer
For cnt = 0 To splitSentences.Length - 1
Dim sSentence As String = splitSentences(cnt).ToString()
sSentence = sSentence.Trim()
Dim sentence As Char() = sSentence.ToCharArray()
Dim i As Integer
For i = 0 To sentence.Length - 1
If Char.IsLetterOrDigit(sentence(i)) Or _
Char.IsPunctuation(sentence(i)) Or _
Char.IsWhiteSpace(sentence(i)) Then
rtn += 1
End If
Next
Next
Return rtn
End Function
Public Function GenerateWordCount(ByVal allSentences As ArrayList) As
Integer
Dim rtn As Integer = 0
Dim sSentence As String
For Each sSentence In allSentences
Dim arrSplitChars As Char() = New Char() {" "}
Dim arrWords As String() = sSentence.Split(arrSplitChars,
StringSplitOptions.RemoveEmptyEntries)
rtn += arrWords.Length
Next
Return rtn
End Function
Public Function GenerateSentenceCount(ByVal allSentences As String()) As
Integer
Dim rtn As Integer = 0
rtn = allSentences.Length
Return rtn
End Function
#End Region
Summary
This article is intended to describe several approaches for parsing the sentences out of a body of text. Further, the article describes three functions which may be used to generate summary statistics on a body of text. There are of course, other ways that may be used to do each of these things. In general, the best approach to parsing out the sentences appears to be through the use of a regular expression. Modifications to the regular expression may yield different results which might work better with the sort of text you are working with; however, I have found that this approach works well with even complicated bodies of text so long as the text is properly formatted into proper sentences.
History
- 3rd June, 2008: Initial version