Intel® Developer Zone offers tools and how-to information for cross-platform app development, platform and technology information, code samples, and peer expertise to help developers innovate and succeed. Join our communities for Android, Internet of Things, Intel® RealSense™ Technology and Windows to download tools, access dev kits, share ideas with like-minded developers, and participate in hackathon’s, contests, roadshows, and local events.
Overview
Over the past thirty years, speech recognition technology has made significant progress, starting in the lab to the market. Speech recognition technology is becoming important in people lives, and is found in our jobs, houses, automotive, medical and other fields. It’s is one of the TOP 10 merging technologies in the world.
As a result of this year’s’ developments, the main algorithm of speech recognition technology has changed from GMM (Gaussian Mixture Model) and, HMM-GMM (Hidden Markov Model-Gaussian Mixture Model) to DNN (Deep Neural Network). DNN functions similar to the way a human’s brain works, it is a very complicated, heavy calculation, and huge data based model. Thanks to the internet, we just only need a smartphone and don’t care about the huge number of servers in the remote computer room that make it all happen. Without internet, the speech recognition service in your mobile devices nearly useless, very few times it can listen to what you said and work.
Is it possible to move the DNN calculation process from server to the mobile end device? Phones? Tablets? The answer is YES.
With support for the SSSE3 instruction set on Intel’s CPU, we could easy run a DNN based speech recognition application without the internet. The accuracy is over 80% by our test, that’s very close to the result of online mode tests. Adding direct SSSE3 support creates a good user experience on mobile devices. In this article I will explain what is DNN and how the Intel® SSSE3 instruction set helps to accelerate DNN calculation progress.
Introduction
DNN is the abbreviation for Deep Neural Network, which contains a many hidden layer feed forward network. DNN is a hot spot in the field of machine learning in recent years, producing a wide range of applications. DNN has a deep structure, with tens of millions of parameters needed to be learned, and the lead time for training is very time consuming.
Speech recognition is a typical application case of DNN. To put it simply, speech recognition applications consists of an acoustical model, language model and a decoding process. The acoustical model is used to simulate the probability distribution of pronunciation. The language model is used to simulate the relationship between words. And the decoding process stage uses the above two models to translate the sound into text. A neural network has the ability to simulate any word distribution. Where a deep neural network has a stronger expression ability than a shallow neural network, it simulates the deep structure of the brain and can "understand" more accurately the characteristics of things. So compared with other methods, the depth of the neural network can be a more accurately simulated acoustic and language model.
Figure 1. DNN Application Field
Typical DNN Chart
A typical DNN generally contains multiple alternate superposition of a linear and non-linear layer, as it shows below:
Figure 2. Including 4 hidden layer DNN acoustic model
In Figure 2, the linear layer is a fully connected relationship, and the input to output could be described by this formula:
YT = XTWT + B
XT is the row vector, and input is by neural network. In a speech recognition application, we generally put 4 frames of data to calculate together, so that we create a 4xM input matrix. WT and B is the linear transformation matrix of the neural network and offset vector, usually the dimension is huge and square.
Intel® SSSE3 Instruction Set
Supplemental Streaming SIMD Extensions 3, or SSSE3 for short, is named by Intel and as the extension of SSSE3 instruction set. The SSSE3 instruction set is a part of SIMD technology, which has been integrated into Intel’s CPU and helps to improve the ability of multimedia processing, coding/decoding, and calculations. Using the SSSE3 instruction set, we can process multiple data inputs by a single instruction in a clock cycle, and then greatly improve the program’s efficiency. It works particularly for matrix calculations.
To use the SSSE3 instruction set, we should first declare and include the SIMD header files:
#include //MMX
#include //SSE(include mmintrin.h)
#include //SSE2(include xmmintrin.h)
#include //SSE3(include emmintrin.h)
#include //SSSE3(include pmmintrin.h)
#include //SSE4.1(include tmmintrin.h)
#include //SSE4.2(include smmintrin.h)
#include //AES(include nmmintrin.h)
#include //AVX(include wmmintrin.h)
#include //(include immintrin.h)
The header file "tmmintrin.h" is for SSSE3, and the functions defined in this file are below:
extern __m128i _mm_hadd_epi16 (__m128i a, __m128i b);
extern __m128i _mm_hadd_epi32 (__m128i a, __m128i b);
extern __m128i _mm_hadds_epi16 (__m128i a, __m128i b);
extern __m64 _mm_hadd_pi16 (__m64 a, __m64 b);
extern __m64 _mm_hadd_pi32 (__m64 a, __m64 b);
extern __m64 _mm_hadds_pi16 (__m64 a, __m64 b);
extern __m128i _mm_hsub_epi16 (__m128i a, __m128i b);
extern __m128i _mm_hsub_epi32 (__m128i a, __m128i b);
extern __m128i _mm_hsubs_epi16 (__m128i a, __m128i b);
extern __m64 _mm_hsub_pi16 (__m64 a, __m64 b);
extern __m64 _mm_hsub_pi32 (__m64 a, __m64 b);
extern __m64 _mm_hsubs_pi16 (__m64 a, __m64 b);
extern __m128i _mm_maddubs_epi16 (__m128i a, __m128i b);
extern __m64 _mm_maddubs_pi16 (__m64 a, __m64 b);
extern __m128i _mm_mulhrs_epi16 (__m128i a, __m128i b);
extern __m64 _mm_mulhrs_pi16 (__m64 a, __m64 b);
extern __m128i _mm_shuffle_epi8 (__m128i a, __m128i b);
extern __m64 _mm_shuffle_pi8 (__m64 a, __m64 b);
extern __m128i _mm_sign_epi8 (__m128i a, __m128i b);
extern __m128i _mm_sign_epi16 (__m128i a, __m128i b);
extern __m128i _mm_sign_epi32 (__m128i a, __m128i b);
extern __m64 _mm_sign_pi8 (__m64 a, __m64 b);
extern __m64 _mm_sign_pi16 (__m64 a, __m64 b);
extern __m64 _mm_sign_pi32 (__m64 a, __m64 b);
extern __m128i _mm_alignr_epi8 (__m128i a, __m128i b, int n);
extern __m64 _mm_alignr_pi8 (__m64 a, __m64 b, int n);
extern __m128i _mm_abs_epi8 (__m128i a);
extern __m128i _mm_abs_epi16 (__m128i a);
extern __m128i _mm_abs_epi32 (__m128i a);
extern __m64 _mm_abs_pi8 (__m64 a);
extern __m64 _mm_abs_pi16 (__m64 a);
extern __m64 _mm_abs_pi32 (__m64 a);
The data structure definition of __m64
and __m128
are in MMX’s header file "mmintrin.h" and SSE header file "xmmintrin.h".
__m64:
typedef union __declspec(intrin_type) _CRT_ALIGN(8) __m64
{
unsigned __int64 m64_u64;
float m64_f32[2];
__int8 m64_i8[8];
__int16 m64_i16[4];
__int32 m64_i32[2];
__int64 m64_i64;
unsigned __int8 m64_u8[8];
unsigned __int16 m64_u16[4];
unsigned __int32 m64_u32[2];
} __m64;
__m128:
typedef union __declspec(intrin_type) _CRT_ALIGN(16) __m128 {
float m128_f32[4];
unsigned __int64 m128_u64[2];
__int8 m128_i8[16];
__int16 m128_i16[8];
__int32 m128_i32[4];
__int64 m128_i64[2];
unsigned __int8 m128_u8[16];
unsigned __int16 m128_u16[8];
unsigned __int32 m128_u32[4];
} __m128;
Case study: using SSSE3 functions to accelerate DNN calculation
In this section, we take two functions as a sample to describe how SSSE3 is used to accelerate the DNN calculation process.
__m128i _mm_maddubs_epi16 (__m128i a, __m128i b) Saturated Accumulation Operation
This function is very critical for the matrix calculation in DNN, the parameter a is a 128bit register, used to store 16 unsigned integers which are 8bit, and parameter b is 16 signed integer which also is 8bit; the return result which included 8 signed 16bit integer. This function is perfect for meeting the requirement of matrix calculation. Such as:
r0 := SATURATE_16((a0*b0) + (a1*b1))
r1 := SATURATE_16((a2*b2) + (a3*b3))
…
r7 := SATURATE_16((a14*b14) + (a15*b15))
__m128i _mm_hadd_epi32 (__m128i a, __m128i b) Adjacent Elements Add Operation
This function can be called pair-wise add. The parameters a and b both are 128bit registers which store a 4 signed integer of 32bit. According to normal corresponding element add operation in two vector, it does the add operation with adjacent elements with input vector. Such as:
r0 := a0 + a1
r1 := a2 + a3
r2 := b0 + b1
r3 := b2 + b3
Then, we suppose there’s a task of vector calculation in DNN process:
Q: There are five vectors a1, b1, b2, b3, b4. The a1 vector is 16 dimension unsigned-char integer, b1, b2, b3, b4 are both 16 dimension signed-char integers. We need the inner product of a1*b1, a1*b2, a1*b3, a1*b4, and to store the result in a signed int of 32bit.
If we used normal design and C program language to implement it, the coding would be as follows:
unsigned char b1[16],b2[16],b3[16],b4[16];
signed char a1[16];
int c[4],i;
Initialize b1,b2,b3,b4 and a1, for c, initialize with zeros
for(i=0;i<16;i++){
c[0] += (short)a1[i]*(short)b1[i];
c[1] += (short)a1[i]*(short)b1[i];
c[2] += (short)a1[i]*(short)b1[i];
c[3] += (short)a1[i]*(short)b1[i];
}
Suppose there is one multiplication and addition per clock cycle, this code fills 64 clock cycles.
Then we used the SSSE3 instruction set to implement it instead:
register __m128i a1,b1,b2,b3,b4,c,d1,d2,d3,d4;
d1 = _mm_maddubs_epi16(a1,b1);
d1 = _mm_add_epi32(_mm_srai_epi32(_mm_unpacklo_epi16(d1, d1), 16), _mm_srai_epi32(_mm_unpackhi_epi16(d1, d1), 16));
d2 = _mm_maddubs_epi16(a1,b2);
d2 = _mm_add_epi32(_mm_srai_epi32(_mm_unpacklo_epi16(d2, d2), 16), _mm_srai_epi32(_mm_unpackhi_epi16(d2, d2), 16));
d3 = _mm_hadd_epi32(d1, d2);
d1 = _mm_maddubs_epi16(a1,b3);
d1 = _mm_add_epi32(_mm_srai_epi32(_mm_unpacklo_epi16(d1, d1), 16), _mm_srai_epi32(_mm_unpackhi_epi16(d1, d1), 16));
d2 = _mm_maddubs_epi16(a1,b4);
d2 = _mm_add_epi32(_mm_srai_epi32(_mm_unpacklo_epi16(d2, d2), 16), _mm_srai_epi32(_mm_unpackhi_epi16(d2, d2), 16));
d4 = _mm_hadd_epi32(d1, d2);
c = _mm_hadd_epi32(d3, d4);
We stored the result in a 128bit register of "c", where it is jointed by 4 integers. Take in consideration of the pipeline, this process may cost 12 or 13 clock cycles. So, the potential results we could get from this task are:
Implementation |
CPU Clock Cycles |
Promotion |
Normal C Coding |
64 |
- |
Using SSSE3 Instruction Set |
13 |
~ 500% |
As we know, there are many matrix calculations in the DNN process of speech recognition, if we optimize each one in our code like this, it will achieve better performance on the IA platform than ever. We have cooperated with ISV Unisound, which provides speech recognition services in China. Unisound used the DNN process with an improvement in performance of over 10% on ARM devices.
Summary
DNN is becoming the main algorithm in speech recognition. It has been selected by Google Now, Baidu Voice, Tencent Wechat, iFlytek Speech Service, Unisound Speech Service, and many others. At the same time, we have the SSSE3 instruction set which could help to optimize the speech recognition process, if all of these applications begin using it, I believe the speech service will give us a better experience and more increased usage of IA platform.
About the Author
Li Alven graduated from Huazhong University of Science and Technology, where he majored in Computer Science and Information Security at 2007. He joined Intel in 2013 as a senior application engineer in the Developer Relations Division Mobile Enabling Team. He is focused on differentiation and innovative enabling on the IA platform, Speech Recognition Technology, tuning performance, etc.