Download PitchContour.zip
Download snack2.2.zip
Table of Contents
The reason for this article and the application accompanying it is an idea of an automated pronunciation test I've been
flirting with for a few months now. Due to the difficulties found and some frustration, I thought giving it up three or four
times, but in the end the inner "never give up" voice had the upper hand and eventually won. Fortunately, I ended up with a
solution, which I admit is not perfect and one that got far away from my initial "development track". But this is how it
often goes when you find difficulties, you embrace whatever tools that work for you.
The first problem was to find code or component to generate the so-called "pitch contour" for the analysis.
The pitch contour is the melody that follows human voice, more technically the fluctuation in frequency that accompanies voice.
I tried hard to find that magical "open source .Net code" which included the pitch contour calculation, searching the internet,
but with no success. I found some open source solutions, but sadly they're not written in .Net code (mostly in C++/Python).
Sadly, too, I'm not expert in C++ or Python, plus the code is too large to be ported. Also, I'm not expert in the mathematical
algorithms (such as Fast Fourier Transform) which are needed for creating a new library from scratch. So I ended up with a
"collaboration" between server-side c# program and a console python application. Not particularly pretty, since I initally
planned an all-client-side, managed code, but it works and that's what matters. I hope some code hero in the .Net community
come up with a more elegant solution fot that.
The second problem is trying to compare the user voice against the predefined exercise voice and providing a score. How can
I compare both pitch contours? I had no tools for such task, so I had to come up with a new one. It took me many hours of work
and still it's not perfect, but the only one I had so far. Like in the previous problem, a code hero would save the day here.
Make sure you follow these 3 steps:
1. The following software are needed for running Pronunciation Test provided with this article:
2. Also, you must download the Python 2.2 software and make sure it is installed in C:\Python22 folder. This is so because the
source code only works with the application stored at C:\Python22 folder. Since the app was made using only Python 2.2, I can't tell if
it will work with other versions of Python.
3. Finally, you must download the snack2.2.zip file and copy them to the C:\Python22\tcl folder. Without this folder, the
application will not work.
The user interface is 100% Silverlight. It's clean and, I must admit, sowewhat inspired by Metro design. The buttons perform
very basic functions: moving to previous and next exercises, playing sample voice and user voice, and recording user's voice.
As with many XAML projects, this one makes use of MVVM (Model-View-ViewModel) pattern.
In short, there is no code-behind for "click event" for the buttons, as well as there is no "object.property = new value" instruction
(actually, there are a couple of event handlers, but just in a situation where using MVVM appeared to be be impossible). Instead, our
buttons use MVVM-style Commanding, and the other visual elements properties are bound to the properties of the underlying
ViewModel class.
For this application, I included 2 speeches, taken from the Free Sound website, so there is no problem
regarding to copyrights about those voices. I've just included 2 sample files, in order to enable the next/previous functionality and keep the
.zip source code as small as possible. Those files are sample01.wav and sample02.wav, and located at the PitchContour.Web\Files\sample01.wav folder.
You can change or add further sample audio files if you want, but be warned that there are some conditions that must be met:
- Files must have .wav extension.
- Files must be mono.
These requirements are imposed by the tools I've chosen for the app. If you are interested in adding files that does not meet these conditions,
or event record your own voice, then you might be interested in installing Audacity, an
excellent free tool for recording/editing audio.
But how do we actually play the Sample Voice in our application? First, we have a standard MediaElement control directly
in our XAML code, dedicated to the task of playing the sample voice:
<MediaElement x:Name="sampleVoiceMediaElement" Width="450" Height="250" Stretch="Fill" AutoPlay="True"
Position="{Binding SampleVoiceMediaPosition, Mode=TwoWay}" MediaOpened="sampleVoiceMediaElement_MediaOpened"/>
In the above snippet, we can notice that the MediaElement's Position proprety is bound to the
SampleVoiceMediaPosition property of the underlying ViewModel class. Also, the MediaOpened
event is handled by the sampleVoiceMediaElement_MediaOpened function in the code behind class.
Let's take a look at the MediaOpened event first. Until the media is opened, we don't know (and still
don't have access to) the value of the media duration (the duration is needed for the calculation and positioning of the
playing cursor), so we must read this value and store it in the ModelView instance:
private void sampleVoiceMediaElement_MediaOpened(object sender, RoutedEventArgs e)
{
viewModel.SampleVoiceDuration = this.sampleVoiceMediaElement.NaturalDuration;
}
Now that the duration is known, we're able to calculate the percentage of the MediaElement's Position in
face of this duration and then draw a green rectangle representing the progress of the player. But before that, we must first
store the result of the calculation in another property (SampleVoiceMediaBorderWidth) of the ViewModel
instance:
public TimeSpan SampleVoiceMediaPosition
{
get
{
return sampleVoiceMediaPosition;
}
set
{
sampleVoiceMediaPosition = value;
NotifyPropertyChanged("SampleVoiceMediaPosition");
if (sampleVoiceDuration.HasTimeSpan)
{
if (sampleVoiceDuration.TimeSpan.TotalMilliseconds > 0)
{
var x = (double)(value.TotalMilliseconds / sampleVoiceDuration.TimeSpan.TotalMilliseconds)
* CANVAS_WIDTH;
SampleVoiceMediaBorderWidth = x;
}
}
}
}
Now that the SampleVoiceMediaBorderWidth is updated, we just need to pass that value to the width
of the rectangle that represents the progress bar in our view. Fortunately, since we are using MVVM, there is already
a Border element (our cursor, actually) which Width property is already wired to the
SampleVoiceMediaBorderWidth property:
public double SampleVoiceMediaBorderWidth
{
get
{
return sampleVoiceMediaBorderWidth;
}
set
{
sampleVoiceMediaBorderWidth = value;
NotifyPropertyChanged("SampleVoiceMediaBorderWidth");
}
}
<Border x:Name="brdSampleVoiceCursor" BorderBrush="DarkGreen"
BorderThickness="1" Height="100" Width="{Binding SampleVoiceMediaBorderWidth, Mode=TwoWay}"
HorizontalAlignment="Left" VerticalAlignment="Center">
<Border.Background>
<LinearGradientBrush StartPoint="0,0" EndPoint="0,1">
<GradientStop Offset="0" Color="#fff"/>
<GradientStop Offset="0.5" Color="#8f8"/>
<GradientStop Offset="1" Color="#8f8"/>
</LinearGradientBrush>
</Border.Background>
</Border>
In short: as the MediaElement plays, the position value is passed to the ViewModel and the
rectangle value is calculated, which in turn is passed back to the Border (cursor) element.
There are some solutions involving audio recording and Silverlight on the web. I particularly liked the one proposed by
Ondrej Svacina's blog. I must say,
that for the audio recording part, I simply copied his code, but in the end there are some noticeable differences between our interfaces:
- Ondrej's code allows for downloading the audio file locally (my code just uploads it to the server).
- He included a pair of buttons for start/stop the recorder. Mine has a single switch on/off recording button.
- His interface shows an analog counter to track the recorder progress (mine shows none).
You just need to click the recorder button to start recording your voice, and then click it once again to stop recording:
Once the voice is recorded, the application will start uploading it to the server. For this functionality, I initially had no code
of mine, so I had to resort to someone who already had done that. That's why I chose Michael Washington's great
Silverlight Simple Drag And
Drop / Or Browse View Model / MVVM File Upload Control article. Although Michael's initial article had a very different goal
from min, fortunately it provided me with a nice Silverlight and web server plumbing that was needed to perform the voice upload
functionality.
As I stated at the beginning of the article, unfortunately I didn't managed to find or conceive a managed code for extracting the
pitch contour from the .wav voice file. But nevertheless I came up with a solution by using Snack,
and a small python program via command-line. In their own words:
"The Snack Sound Toolkit is designed to be used with a scripting language such as Tcl/Tk or Python. Using Snack you can create
powerful multi-platform audio applications with just a few lines of code. Snack has commands for basic sound handling, such as playback,
recording, file and socket I/O. Snack also provides primitives for sound visualization, e.g. waveforms and spectrograms. It was developed
mainly to handle digital recordings of speech, but is just as useful for general audio. Snack has also successfully been applied to other
one-dimensional signals. The combination of Snack and a scripting language makes it possible to create sound tools and applications with
a minimum of effort. This is due to the rapid development nature of scripting languages. As a bonus you get an application that is
cross-platform from start. It is also easy to integrate Snack based applications with existing sound analysis software."
The Pitch Contour Extraction is done by a script written in Python:
from Tkinter import *
import tkSnack
import pickle
class Speech:
def Analyze(self, inputFile, outputFile):
root = Tk()
tkSnack.initializeSnack(root)
mySound = tkSnack.Sound()
mySound = tkSnack.Sound(load=inputFile)
f = open(outputFile, "w")
data = mySound.pitch()
pickle.dump(data, f)
f.close()
return()
speech = Speech()
speech.Analyze('{source}', "{destination-pitch}")
The above script is quite simple: first, it make references to the libraries (tkSnack, pickle). Then a new instance of
Speech class is made, and the Analyze function is called, passing the source (.wav) file and
the destination (.txt) file containing the list of pitch values.
This destination file will contain a list of values representing the pitch variation, that is, the variation in frequency. As
expected, male voices will have a lower average values than women voices. These values will be later read by the application and
displayed on the wave form. This is how the resulting .txt pitch file looks like (each value has a preceding 'F' letter):
(F0.0
F0.0
F0.0
F0.0
F0.0
F0.0
F0.0
F0.0
F0.0
F216.0
F214.0
F212.0
F213.0
F212.0
F210.0
F204.0
F206.0
F202.0
F196.0
F190.0
F178.0
F160.0
F0.0
F0.0
F0.0
F0.0
F0.0
F0.0
F0.0
F222.0
.
.
.
F0.0
tp0
.
But as we have mentioned before, the python code is not called directly by the .Net application. Instead, we instantiate a Process
class and invoke the Start method, passing both the source .wav file and the destination .txt file:
public void GeneratePitchFile()
{
var pythonFolder = ConfigurationManager.AppSettings["PythonFolder"];
var extractPitchProgram = ConfigurationManager.AppSettings["ExtractPitchProgram"];
var pythonExe = System.IO.Path.Combine(pythonFolder, "python.exe");
var extractPitchDestinationPath = System.IO.Path.Combine(pythonFolder,
string.Format(@"lib\{0}", extractPitchProgram));
var pitchResultPath = filePath.Replace(".wav", ".txt");
var waveResultPath = filePath.Replace(".wav", "-wave.txt");
using (var sr = new StreamReader(Path.Combine(appFolder, "ExtractPitch.py")))
{
using (var sw = new StreamWriter(extractPitchDestinationPath, false))
{
var fileString = sr.ReadToEnd()
.Replace("{source}", filePath.Replace(@"\", @"\\"))
.Replace("{destination-pitch}", pitchResultPath.Replace(@"\", @"\\"))
.Replace("{destination-wave}", waveResultPath.Replace(@"\", @"\\"));
sw.Write(fileString);
}
}
Process process = Process.Start(pythonExe, extractPitchDestinationPath);
process.EnableRaisingEvents = true;
process.Exited += (sender, args) =>
{
process.Close();
};
}
One might argue that the Python code could have been ported to the Iron Python to work directly with .Net code. In fact,
I tried it, but that doesn't work because the Snack library in its turn relies on other C++ libraries. This is a deal
breaker, so unfortunately the porting can't be made.
The wave form is extracted directly from the .wav file. I used the code provided by user pj4533 in Show Wave Form
article:
public ObservableCollection<int> GetPoints(double canvasWidth, double canvasHeight)
{
Read();
var points = new ObservableCollection<int>();
short val = m_Data[0];
int prevX = 0;
canvasHeight = CANVASHEIGHT;
int prevY = (int)(((val + 32768) * canvasHeight) / 65536);
for (int i = 0; i < m_Data.NumSamples; i += 16)
{
val = m_Data[i];
int scaledVal = (int)(((-val - 32768) * canvasHeight) / 65536);
points.Add(scaledVal);
prevX = i;
prevY = scaledVal;
if (m_Fmt.Channels == 2)
i++;
}
return points;
}
It might be noticed that both the pitch contour and the wave form are extracted only after the audio file is uploaded.
We have a Path element just for the pitch contour. That could be accomplished by a Canvas element,
but with the Path element you can define points and it automatically draw lines between them:
<Path x:Name="pthPitchCurve" Height="100" Width="500" Stroke="#f00" StrokeThickness="2" Data="{Binding SampleVoicePitchData}"
HorizontalAlignment="Left" Stretch="None"></Path>
The path element's Data property sets a Geometry that specifies the shape to be drawn. It follows the
Path Markup Syntax, which is quite extensive to be
explained here, but if Data has a value of "Mx0,y0 x1,y1 x2,y2 x3,y3 ... xn,yn" that means we have a geometry made up by a sequence
of line segments defined by the points {x0,y0}, {x1,y1}, {x2,y2}, {x3,y3}, ... {xn,yn}. The Path property is bound to
the SampleVoicePitchData property on the ViewModel side. As we saw before, this value was extracted previously from the
.wav file and requested via webservice, so it is already available to our Silverlight application:
private string GeneratePitchData(ArrayOfInt pitchValues, int offset, double xAdjustFactor)
{
var sb = new StringBuilder();
if (pitchValues.Count() > 0)
{
double minPoint = pitchValues.Min();
double maxPoint = pitchValues.Max();
double absMaxPoint = Math.Abs(minPoint) > maxPoint ?
Math.Abs(minPoint) : maxPoint;
double xScale = (CANVAS_WIDTH / pitchValues.Count()) * xAdjustFactor;
double yScale = CANVAS_HEIGHT / (maxPoint - minPoint);
yScale = PITCHDATAYSCALE;
var lastYValue = 0;
var x = 0;
foreach (var pitch in pitchValues)
{
var yValue = pitch;
var y = LINEBASE - yValue;
if (yValue > 0)
{
if (lastYValue == 0)
{
var pointM = string.Format("M{0},{1} ", (int)(offset + x * xScale),
(int)(y * yScale));
sb.Append(pointM);
}
var pointL = string.Format("{0},{1} ", (int)(offset + x * xScale),
(int)(y * yScale));
sb.Append(pointL);
}
lastYValue = yValue;
x++;
}
}
else
{
DispatcherTimer pitchDataTimer = new DispatcherTimer();
pitchDataTimer.Interval = TimeSpan.FromMilliseconds(1000);
pitchDataTimer.Tick += (s, e) =>
{
pitchDataTimer.Stop();
DoGetSampleVoicePitchData(false);
};
pitchDataTimer.Start();
}
return sb.ToString();
}
As a result, the pitch contour is displayed in Patch element accordingly:
The wave form is displayed in a quite similar way: We have another Path element for the wave form:
<Path x:Name="pthWave" Height="100" Width="500" Stroke="#aaa" Data="{Binding SampleVoiceWavePath}"
HorizontalAlignment="Left" VerticalAlignment="Center" Stretch="None"></Path>
The Path property is bound to the SampleVoiceWavePath property on the ViewModel side.
As we saw before, this value was extracted previously from the .wav file and requested via webservice:
private string GenerateWavePath(ArrayOfInt points)
{
double minPoint = points.Min();
double maxPoint = points.Max();
double middlePoint = maxPoint - minPoint / 2;
double absMaxPoint = Math.Abs(minPoint) > maxPoint ? Math.Abs(minPoint) : maxPoint;
double xScale = CANVAS_WIDTH / points.Count();
double yScale = CANVAS_HEIGHT / ((maxPoint - minPoint));
var sbUserVoiceWavePath = new StringBuilder();
var yWave = points[0];
sbUserVoiceWavePath.AppendFormat("M{0},{1} ", 0, (int)(CANVAS_HEIGHT / 2));
for (var xWave = 1; xWave < points.Count(); xWave++)
{
yWave = (int)(points[xWave]);
var x = string.Format("{0:0.00}", xWave * xScale).Replace(",", ".");
var y = string.Format("{0:0.00}", (yWave - minPoint) * yScale).Replace(",", ".");
sbUserVoiceWavePath.AppendFormat("L{0},{1} ", x, y);
}
return sbUserVoiceWavePath.ToString();
}
Then this is how the wave form will look like:
Now that we have all the data (pitch contour and wave forms from both sample voice and user voice), it's up to us to calculate the score. Assuming that
the score ranges from a minimum of 0 points to the maximum of 100 points (meaning perfect pronunciation), we must define how to measure this scale.
As stated before, I have no knowledge in audio analysis, so I invented a way of taking the pitch contour in its individual segments and calculating the
slope of each segment. That is, a segment can go up or down. The entire pitch contour of the sample speech will then have a collection of slopes, being for
example "down-up-down-down-up-down-up", while the user speech will have another set of slopes, for example "down-down-up-down-down-up-up-down". We then
compare these sets against each other and provide a score varying from 0 to 100 points, where 0 means no matches and 100 means all segment slopes matched.
You can see which slopes are goind down or up through the red and blue arrows in the image below:
And below is the main code for calculating the grade from the pitch contour comparison:
private void GenerateGrade()
{
var segmentSlopeScore = 0.0;
var samplePitchValuesLength = GetLastX(this.sampleVoicePitchValues) -
GetFirstX(this.sampleVoicePitchValues);
var userPitchValuesLength = GetLastX(this.userVoicePitchValues) -
GetFirstX(this.userVoicePitchValues);
var pitchValuesLengthError = (double)Math.Abs(userPitchValuesLength -
samplePitchValuesLength) / samplePitchValuesLength;
var sampleSegments = GetPitchSegmentLengthList(this.sampleVoicePitchValues).Where(v => v > 0).ToList();
var userSegments = GetPitchSegmentLengthList(this.userVoicePitchValues).Where(v => v > 0).ToList();
var segmentIndex = 0;
var validSegmentCount = 0;
RemoveNaNSegments(userSlicedSlopes, userSegments);
if (sampleSegments.Count() > userSegments.Count())
{
RemoveInconsistentSegments(sampleSlicedSlopes, userSlicedSlopes, sampleSegments, userSegments);
}
else if (userSegments.Count() > sampleSegments.Count())
{
RemoveInconsistentSegments(userSlicedSlopes, sampleSlicedSlopes, userSegments, sampleSegments);
}
foreach (var sampleSegment in sampleSegments)
{
if (sampleSegment > 0)
{
if (userSlicedSlopes.Count() > segmentIndex)
{
var currentSampleSlope = sampleSlicedSlopes[segmentIndex];
var currentUserSlope = userSlicedSlopes[segmentIndex];
if (!double.IsNaN(currentSampleSlope) && !double.IsNaN(currentUserSlope))
{
if (CheckSlopes(currentSampleSlope, currentUserSlope))
segmentSlopeScore++;
}
segmentIndex++;
validSegmentCount++;
}
}
}
var sampleSegmentCount = GetPitchSegmentLengthList(this.sampleVoicePitchValues).Count();
var userSegmentCount = GetPitchSegmentLengthList(this.userVoicePitchValues).Count();
var segmentCountError = (double)Math.Abs(userSegments.Count() - sampleSegments.Count())
/ sampleSegmentCount;
Grade = (int)((segmentSlopeScore / validSegmentCount) * 100.0 * (1.0 - segmentCountError));
}
As we did before with pitch contours and wave formats, the score is displayed by binding a visual element on the XAML side to a given property
on the ViewModel class:
<TextBlock x:Name="txtGrade" Text="{Binding Grade}" Foreground="Green" FontSize="45" TextAlignment="Center" VerticalAlignment="Center">
</TextBlock>
The Grade property getter/setter are defined like this:
public int Grade
{
get
{
return grade;
}
set
{
grade = value;
NotifyPropertyChanged("Grade");
}
}
I hope you have enjoyed the article. And while I hope it can be useful for you, as you can see, there is a lot of room for improvement,
and if you have something to say, please leave a comment below.
- 2012-04-29: Initial version.