Introduction
HeadTexter is a simple App targeted to convert head movement to English text. It uses Intel Perceptual Computing SDK for head tracking and is a contest entry in recently concluded perceptual computing challenge.
Background
Before you read the rest of article you must know the reasons behind this madness of trying to convert the head movement to text. It is to carry a research and give a direction for communication to Alzheimer's disease.
Alzheimer's Disease is a disease which is mainly caused by stroke. It paralyzes the patients and they looses capability to move any other parts of their body other than their head. The head movement also becomes limited with time and only way they can convey any message is through their eyes. So I basically wanted to build a system that converts the eye movement to text. But the limitations of Beta SDk in face tracking and more in eye tracking has made me to shift the design a bit towards face tracking.
The interaction may not be quite realistic for Alzheimer's patients, but somehow it had to be started at a point. So I decided to build the basic framework with head movement ( which turned out more complicated than I thought it would be) and to transform the work to eye tracking once the sdk issues are fixed. That would mean some simple changes in the transform part and should come easy.
Now coming to my decision of writing a tutorial on head tracking based system rather than hand gesture work is pretty simple. It works with any normal web cam. Yes that is right. All of you can actually install the sdk and get on with the programming without a creative gesture camera. So my motive of introducing Intel Perceptual Computing SDK before the community ( especially those crazy C#'rs was to give everybody an opportunity and a simple walk through with the sdk which they can carry forward).
So what basically we are learning here in this tutorial?
a) Working with Perceptual Computing SDK
b) Head Tracking system with Perceptual Computing SDK
c) Do something funny ( or meaningful) with the tracked data.
d) Learn how to use a 175 years old computer concept effectively.
Let's not waste any more digital bytes in clarifying my motives or article's objective and start with what we do best! Code.
Using the code
First download the sdk from Perceptual Computing SDK Download Page
Start with SDK:
The SDK is mainly written in C++ as you might expect it to due to speed constraints and what you get for C# is a managed dll. So Start a project and add a reference to libpixclr.dll located in sdk/bin/X64 or sdk/bin/X86 folder. If you are really that inclined to a 64 bit application then you must also change the project properties to 64 bit. If you select x86, don't forget to change the project properties to x86. "Any CPU" will not work. Unlike some other solutions like Microsoft's Ink technology that does work only in x86 architecture and the applications fails in x64, no such worries with this one. Select x86 and it runs well on both the architecture.
All of us who are well versed with OpenCV and C++ style of coding use an infinite loop for acquiring data. However as it had to be C#, I wanted the solution to be BackgroundWorker based. So what we will do is come out of those for(;;) stuff of OpenCv and capture the frames from DoWork, processing would be performed from progress changed. UI and SDK threads are entirely different stack, so we will use a delegate to update UI with the result from SDk thread.
The first thing you got to do with any PerC work is create a session with sdk. So you need an object of PXCMSession. Let us call it session;
Any operation with SDk will return a Status of type PxcmStatus
Let us call the object sts. This one is linked with success and error codes and is essentially an Enum. Once a session is created, you want to capture frames from camera. The capture class is called UtilMCapture. Let us have an object of it called capture. I want to tell some interesting facts about this Capture class before moving ahead with code. The Sdk supports several types of data capture including capture of depth map, capture of RGB data, Voice and all that. You can find out the types of streams that SDK is capable of by checking out capture_viewer demo located in bin folder. So once session is created, you really need a profile or a pipeline to get the correct stream.
So we start the session in following style.
sts = session.CreateImpl(PXCMFaceAnalysis.CUID, out fanalysis);
This is a common code for any session initiation in SDK. The first argument must specify the Algorithm for which you want to create the session. In our case it is PXCMFaceAnalysis. So we will pass the CUID of PXCMFaceAnalysis as first argument. In C++ implementation it takes a pointer as input and puts the value through reference. A similar approch is adopted for C#. The method returns a PXCMBase object. PXCMBase is a class from which all the perceptual computing algorithm classes like FaceAnalysis, classes for gestures are derived. So fanalysis is just an object of PXCMBase type.
If the session is created successfully ( more correct way of saying is if your code gets the ownership of desired camera) then sts will be of type NO_ERROR. Now you need to dynamically cast fanalysis object to PXCMFaceAnalysis object.
It is simple.
fa = (PXCMFaceAnalysis)fanalysis.DynamicCast(PXCMFaceAnalysis.CUID);
Now be happy that this approach is quite generic throughout any SDK functionality. So you can dynamically cast any PXCMBase objects created through SessionImplementation to respective Analysis class just by passing the CUID of the desired class and offcourse using an explicit typecasting.
Now the first step is building a ProfileInfo which will hold the result of any query to faceAnalysis module.
PXCMFaceAnalysis.ProfileInfo pf = new PXCMFaceAnalysis.ProfileInfo();
fa.QueryProfile(0, out pf);
'0' is still unclear to me. However Intel specifies that it can support multiple different profiles in future and '0' represents just the default profile.
Now the resultant profile should be passed as reference to FaceAnalysis module.
fa.SetProfile(ref pf);
FaceAnalysis is the basic module that in tern supports both FaceAttribute and FaceDetection. FaceDetection deals with identifying face location
FaceAttribute deals with getting postures from face data like gender information, smile, openness of eyes etc.
Though this is not a serious requirement in the current contest as we are not trying to find out facial expression, I wanted to introduce the analysis part which should be helpful for you if you want a real game from face tracking mode.
detection = (PXCMFaceAnalysis.Detection)fa.DynamicCast(PXCMFaceAnalysis.Detection.CUID);
face_attribute = (PXCMFaceAnalysis.Attribute)fa.DynamicCast(PXCMFaceAnalysis.Attribute.CUID);
Now our little program is ready for both: detecting face and eye location as well as getting facial attributes.
Just as you have Queried the profiles in the case of PXCMFaceAnalysis, you need to do for both detection and attribute objects.
dinfo = new PXCMFaceAnalysis.Detection.ProfileInfo();
attribute_dinfo = new PXCMFaceAnalysis.Attribute.ProfileInfo();
detection.QueryProfile(0, out dinfo);
face_attribute.QueryProfile(PXCMFaceAnalysis.Attribute.Label.LABEL_EMOTION, out attribute_dinfo);
detection.SetProfile(ref dinfo);
face_attribute.SetProfile(PXCMFaceAnalysis.Attribute.Label.LABEL_EMOTION, ref attribute_dinfo);
face_attribute.QueryProfile(PXCMFaceAnalysis.Attribute.Label.LABEL_GENDER, out attribute_dinfo);
face_attribute.SetProfile(PXCMFaceAnalysis.Attribute.Label.LABEL_GENDER, ref attribute_dinfo);
face_attribute.QueryProfile(PXCMFaceAnalysis.Attribute.Label.LABEL_EYE_CLOSED, out attribute_dinfo);
face_attribute.SetProfile(PXCMFaceAnalysis.Attribute.Label.LABEL_EYE_CLOSED, ref attribute_dinfo);
There is one more module called landmark. two eye points , points of nostrils and a point on lip constitute FaceLandmark.
Here again to introduce you to landmark data, I am using it.
landmark = (PXCMFaceAnalysis.Landmark)fa.DynamicCast(PXCMFaceAnalysis.Landmark.CUID);
landmark.QueryProfile(1, out lpi);
landmark.SetProfile(ref lpi);
Dont' crack your head too much into the naming convention used here. I had only a month to slog in several ideas that I had to play around with SDK and the C# part of the samples were not well constructed, I really had to convert the C++ version of code to C#. It is almost as it is like the SDK samples ( I have tried to make them more meaningful). But that is all.
With every module in place, camera started, we now unleash our capturing code.
So we run a BackgroundWorker asynchronously and capture frames in DoWork.
sts = capture.ReadStreamAsync(images, out sps[0]);
Well the first argument is a PXCMImage type array which is declared as
PXCMImage[] images = new PXCMImage[1];
It is always 1 for Face. You may wonder what is it with SDK that expects the argument to be of array type? Firstly it tells the low level code that it is expecting only one data stream( in some sdk modules it is a mix of RGB, Depth map and so on) and secondly the reference of capture image can be passed directly to array type.
But you can not use any stream returned from SDK unless you synchronize them. Well to be honest, I have no idea of what is this all about. I just lifted a sentence plain from SDK manual.
sts = fa.ProcessImageAsync(new PXCMImage[] { images[0] }, out sps[1]);
PXCMScheduler.SyncPoint.SynchronizeEx(sps);
We also do want to see our faces in a window while working with a web cam based project. Don't we? It adds a fun factor and tells us that camera is functioning the way it should. What we have in windows form for displaying an image? A PictureBox. And what can it display? Image or Bitmap type. But is our image really a bitmap? No it is not. That is the exact reason why you need a conversion from PXCMImage type to Bitmap type.
bmp = new Bitmap((int)images[0].imageInfo.width, (int)images[0].imageInfo.height);
images[0].QueryBitmap(session, out bmp);
Yes. That is the Bitmap code and we are more into our C# now. But wait ! a little bit of sdk based processing is still remaining. Remember, we need to detect the location of head?
For display and for detecting face location we move straight to ProgressChanged method.
PXCMFaceAnalysis.Detection.Data face_data;
detection.QueryData(fid, out face_data);
if (face_data.rectangle.h > 0)
{
try
{
Bitmap localBitmap = (Bitmap)bmp.Clone();
Graphics g = Graphics.FromImage(bmp);
float variation = 0;
this.Invoke((MethodInvoker)delegate
{
g.DrawRectangle(new Pen(new SolidBrush(Color.Red), 6), new Rectangle(new Point((int)face_data.rectangle.x, (int)face_data.rectangle.y), new Size((int)face_data.rectangle.w, (int)face_data.rectangle.h)));
});
x = (int)face_data.rectangle.x;
y = (int)face_data.rectangle.y;
}
catch
{
}
}
face_data should now have the result of detection which is essentially the rectangle bounding the face. We obtain x and y coordinates of the rectangle.
Alright, now we want to store x and y information into temporary variable and want them to be compared with last position data to get the information about direction of movement.
Before we take on with that part of simple stuff. Here is a quick look at the Morse code. Well all of you must have heard about it. It is an Encoding that assigns a set of '.' and '_' to all the characters and numbers. What amazes me is that it was developed somewhere around 1836. It was the backbone behind once a lifeline of quick communication , Telegraph. Sadly it was used more in war-communication than computation.
The advantage of a Morse code is, it is a variable length coding and maximum code length is 4. Like binary stream, it uses a space or rather absence of a sequence to detect the end of a number. However in binary system we would need about 8 bits to represent a character. So If you want to say 'a' through head movement that is 8 times head movement and the product then would have to be names as a Headache than anything else.
What we need to understand is unlike hand which you can close, open, show one finger or all of them, wave, head movement will have predominantly 4 movements, UP, DOWN, LEFT, RIGHT. Diagonal movements are too tough to emulate with head. Therefore we could afford only two symbols for Texting. Other two had to be used for say triggering a conversion and deleting a misinterpreted symbol.
To work with the system effectively, we shall use a conversion routine from characters to Morse code ( though it will not be used here) and one from Morse code to characters. Then we will assign a code to UP and DOWN movement and trigger the conversion. This is a pure lookup table stuff.
#region Morse code related part
private Char[] Letters = new Char[] {'a', 'b', 'c', 'd', 'e', 'f', 'g',
'h', 'i', 'j', 'k', 'l', 'm', 'n', 'o', 'p', 'q', 'r', 's', 't', 'u',
'v', 'w', 'x', 'y', 'z', '0', '1', '2', '3', '4', '5', '6', '7', '8',
'9', ' '};
private String[] MorseCode = new String[] {".-", "-...", "-.-.",
"-..", ".", "..-.", "--.", "....", "..", ".---", "-.-", ".-..",
"--", "-.", "---", ".--.", "--.-", ".-.", "...", "-", "..-",
"...-", ".--", "-..-", "-.--", "--..", "-----", ".----", "..---",
"...--", "....-", ".....", "-....", "--...", "---..", "----.", " "};
public String ConvertMorseToText(String text)
{
text = "@" + text.Replace(" ", "@@") + "@";
int index = -1;
foreach (Char c in Letters)
{
index = Array.IndexOf(Letters, c);
text = text.Replace("@" + MorseCode[index] + "@", "@" + c.ToString() + "@");
}
return text.Replace("@@@@", " ").Replace("@", "");
}
public String ConvertTextToMorse(String text)
{
text = text.ToLower();
String result = "";
int index = -1;
for (int i = 0; i <= text.Length - 1; i++)
{
index = Array.IndexOf(Letters, text[i]);
if (index != -1)
result += MorseCode[index] + " ";
}
return result;
}
string MorseSymbol2Gesture(char symbol)
{
switch (symbol)
{
case '.':
return "UP ";
case '-':
return "DOWN ";
case ' ':
return "RIGHT";
default:
return "LEFT";
}
}
#endregion
If you work with HandGesture and tracking of Perceptual SDK, you have sort of an event handler delegate. It is triggered for any valid gestures like THUMB_UP. You need to subscribe to event through a local method. However there is no such events in Head movement. So we need to create that event through raw code.
Before you dare to think it is quite simple, I would suggest you to have a good and close look at the video. SDK's Face Tracking is quite slower. Therefore computationally you can not move your head down without actually moving it left or right and same goes true for UP movement. We humans are not robots who would hand the face down and lift it simply on an axis. The shoulder will try to balance it and you immediately also have a side wise movement. Sometimes it is as significant as up or down movement. Another important aspect was that after every left movement, head must come back to centre triggering just the inverse movement.
This problem almost killed the application when it was almost ready.
To solve this problem, we will use a state based programming. It is a probabilistic programming model like Hidden Markov Model which is based on the probabilities of certain movement when it is associated with other movement. I first made a state diagram from Use Case which had 5 states and I figured out that UP or DOWN movement will be much lesser than side wise movement. So user can effectively use LEFT and RIGHT but not UP and DOWN. So we will first check for verticle movement and if there is a little, we will first translate it and escape horizontal translation. If vertical movement is not significant, we will translate horizontal movement. Once a particular movement is detected, we will use a sleep to allow user to take his head back to normal. But any Thread.Sleep() will freeze the capture and you will loose any intermediate frames as we are in ProgressChanged method. Hence instead of a Thread.Sleep, we will use a plain variable. Once a movement is detected, we will assign the variable a value say 15. Main thread waits 50ms for capturing a frame. So I already have a Sleep call in every 50ms. Once the variable is assigned and is non zero, every time we go to ProgressChanged, we will decrement it. Our normal processing starts after the variable is back to zero.
As we are not robots, our head can not be steady like a candle. It will move. Therefore we can not fix a centre point. Movement must be derived from difference between current position and previous position.
variation = (float)Math.Sqrt((double)((x - x1) * (x - x1) + (y - y1) * (y - y1)));
if (!head.Equals(HeadState.NONE))
{
head = HeadState.NONE;
x1 = x;
y1 = y;
return;
}
if (nn != 0)
{
if(nn>0)
nn--;
}
if ((variation > 9) && (nn==0) )
{
int xvar = x - x1;
int yvar = y - y1;
if (Math.Abs(yvar) > Math.Abs(xvar))
{
if (Math.Abs(yvar) > 9)
{
if (yvar < 0)
{
head = HeadState.UP;
}
if (yvar > 0)
{
head = HeadState.DOWN;
}
}
}
else
{
if (Math.Abs(xvar) > Math.Abs(yvar))
{
if (Math.Abs(xvar) > 9)
{
if (xvar < 0)
{
head = HeadState.LEFT;
}
if (xvar > 0)
{
head = HeadState.RIGHT;
}
}
}
}
this.Invoke((MethodInvoker)delegate
{
switch (head)
{
case HeadState.UP:
txtCurrentSequence.Text += ".";
nn = 8;
break;
case HeadState.DOWN:
txtCurrentSequence.Text += "-";
nn = 8;
break;
case HeadState.RIGHT:
txtRecognition.Text += ConvertMorseToText(txtCurrentSequence.Text + " ");
txtCurrentSequence.Text = "";
nn = 8;
break;
case HeadState.LEFT:
char[] m = txtCurrentSequence.Text.ToCharArray();
string s = "";
for (int i = 0; i < m.Length-1; i++)
{
s = s + m[i];
}
txtCurrentSequence.Text = s;
nn =8;;
break;
}
});
}
x1 = x;
y1 = y;
Wow! That is all you need. Give your creativity a wing, put it in games, use it for creating a password managers and do all the nice things you can do with the code. I hope that SDK is improved soon so that I can fulfil my goal of making it suitable for real patients.
Points of Interest
I see a recent trend in few of the contests and new technology usage that people are more inclined to visuals. XNA , Unity and 3D rendering takes major pie of apps with any cool technology , be it Ultrabook or Perceptual Computing. My Goal is not to create apps that suite such development but to integrate the features to existing problems and trying to solve crucial problems. Microsoft's recent unclear stands with it's technologies like Silverlight and WPF has made the life even more awful. However C# is a power that none of any current languages has. Action must be louder than words. So I invite my beloved community to explore the SDK and build amazing stuff. Let MS see that you can not survive if you keep dumping your loyal hard core programmers for kid's languages.