|
Great, thanks to you.
You may be interested about another neural network application, see Sharky Neural Network.
This is free software for playing with neural networks classification (for Windows XP/Vista).
You can see network results during learning like a movie - live view.
You may also be interested in other CodeProject article: Neural Network Classifier[^]
Regards,
SharkTime.com
|
|
|
|
|
Hello!
I have realize your method depend on MNIST with the weight of Mr. Mike O'Neil's
but I can not get the accuracy that Mr. Mike O'Neil get,
About only one in ten can be recognized correctly
Coud you email to me, and sent your recognition result on the text set of MNIST.
Thanks!
Email: hanxiaoxue724@hotmail.com
|
|
|
|
|
i think it possible to train the convolutional neural network in CUDA ,only need more shared memory.
we can compute every delta value separately and finnaly sum them up in the kernel function one layer one kernel function., or other way.does anyone else agree with me. ,
as the performance,i think the CUDA is the better architecture for neural network compute, more like brain then pc.
|
|
|
|
|
where can i get Layer_1.neu , Layer_2.neu ...
i can't get it work!? i use 9600 GT and sdk has been setuped.
who will be kind to tell me how can i get it work?
and my english is poor ,i holp my words will be understood.
modified on Wednesday, December 3, 2008 8:40 PM
|
|
|
|
|
i get it.
Layer_1.neu , Layer_2.neu ... be created by nn.exe.
but nn.exe can't be execute in my system.
i modify the main function in nn.cu and link the nn.cu ,then it works. but i don't know why?
main(int argc, char** argv)
{
NeuralNetwork();
//CUT_EXIT(argc, argv);
}
|
|
|
|
|
i hope it can be used to training the network.
|
|
|
|
|
u miss cudart.dll file .download it from....google
|
|
|
|
|
Hi,
I downloaded NN.cu and NN_kernel.cu.
I modified NN.cu so that it prints out the output as follows:
for(int a=0;a<10;a++)
{
printf("output[%d]=%f\n", a, Layer5_Neurons_CPU[a]);
outputLayer[a] = (double)Layer5_Neurons_CPU[a];
}
output(outputLayer);
However, upon execution (I'm using an 8600GTS), I get the following results:
output[0]=nan
output[1]=nan
output[2]=nan
output[3]=nan
output[4]=nan
output[5]=nan
output[6]=nan
output[7]=nan
output[8]=nan
output[9]=nan
I'm not sure what the correct results are. Are they what the previous poster (AIgpu) got?
I noticed you provided what I assume to be intermediate outputs, e.g. layer_1.neu, layer_2.neu, layer_3.neu, layer_4.neu.
Would it be possible to upload your copies of those files so I can diff against what I got and debug them as necessary?
Thank you,
George
|
|
|
|
|
My bad, turns out I needed the files included with the GUI source code (e.g. lw1.wei to lw4.wei and in.neu). Now I get the same results as AIGPU.
I noticed that in your implementation you used at most 1250 threads. Since the 8800GTX can support over 12k and the GTX260 and GTX280 many more, would it be possible to further parallelize your implementation? Perhaps an easy way would be to recognize multiple digits/characters at once. I was wondering if you've done any further work on that front?
I would be very interested if you had such an implementation, since it would directly help with the research that I am doing.
|
|
|
|
|
it seems that each neural node is handled by a single thread in our program. so, as for the digit recognition program, i guess there is no need to use more threads and i don't know how.
the neural network computing is not a completely parallel procedure. you have to feed the result of the first layer into the second layer as the input.
if you can think of a way that finishes all the computation within only one layer, i think maybe you can make use of more threads.
or may be you can try a more complex neural network, i.e more layers and larger feature maps. but will the accuracy increase? i doubt it.
|
|
|
|
|
I would imagine one possible way to due it is to just replicate all the layers so that it works on multiple digits at once.
This leads to my next question:
I'm not quite sure how to interpret the output of the NN kernel.
I assume your input digit is specified by Layer1_Neurons_CPU[]:
float Layer1_Neurons_CPU[29*29]={
1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,
1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,
1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,
1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,
1,1,1,1,1,1,1,1,1,0,0,0,0,0,0,0,1,1,1,1,1,1,1,1,1,1,1,1,1,
1,1,1,1,1,1,1,1,0,0,0,1,1,0,0,0,0,1,1,1,1,1,1,1,1,1,1,1,1,
1,1,1,1,1,1,1,0,0,1,1,1,1,1,1,0,0,0,1,1,1,1,1,1,1,1,1,1,1,
1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,0,0,1,1,1,1,1,1,1,1,1,1,1,
1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,0,0,1,1,1,1,1,1,1,1,1,1,1,
1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,0,0,1,1,1,1,1,1,1,1,1,1,1,
1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,0,0,1,1,1,1,1,1,1,1,1,1,1,
1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,0,0,1,1,1,1,1,1,1,1,1,1,1,
1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,0,0,1,1,1,1,1,1,1,1,1,1,1,1,
1,1,1,1,1,1,1,1,1,1,1,1,1,1,0,0,0,1,1,1,1,1,1,1,1,1,1,1,1,
1,1,1,1,1,1,1,1,1,1,1,1,1,0,0,0,1,1,1,1,1,1,1,1,1,1,1,1,1,
1,1,1,1,1,1,1,1,1,1,1,1,0,0,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,
1,1,1,1,1,1,1,1,1,1,0,0,0,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,
1,1,1,1,1,1,1,0,0,0,0,0,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,
1,1,1,1,1,1,0,0,0,0,0,0,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,
1,1,1,1,1,1,1,1,1,0,0,0,0,0,0,0,0,1,1,1,1,1,1,1,1,1,1,1,1,
1,1,1,1,1,1,1,1,1,1,1,1,1,0,0,0,0,0,0,0,0,0,1,1,1,1,1,1,1,
1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,0,0,1,1,1,1,1,1,1,1,
1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,
1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,
1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,
1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,
1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,
1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,
1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1};
I assume this is a '2'. The outputs of the program (e.g. the output[] array) all seem to be close to either positive or negative one.
Does this have something to do with the 'certainty' of the program in recognizing the digit?
In that case, how can I know what the program actually thinks what the number is?
|
|
|
|
|
the output is a set of 10 numbers, the index of the largest value indicates the result.
you should read the reference article that i mentioned on the top.
|
|
|
|
|
Oh, okay. Thanks
|
|
|
|
|
I've tried to set your code up and running. I encountered some Visual Studio's project problems...
Anyway... the code is running, but...
when I measure the processing time, I get something like 0.665652 ms for one test, which is 10 times higher than your time. (The time was measured uncommenting some lines in the code and looping 1000 kernel's calls:
CUT_SAFE_CALL(cutCreateTimer(&timer));
CUT_SAFE_CALL(cutStartTimer(timer));
for (int i=0; i<1000; i++){
dim3 Layer1_Block(6,1);
dim3 Layer1_Thread(13,13);
executeFirstLayer<<<layer1_block,layer1_thread>>>(Layer1_Neurons_GPU,Layer1_Weights_GPU,Layer2_Neurons_GPU);
dim3 Layer2_Block(50,1);
dim3 Layer2_Thread(5,5);
executeSecondLayer<<<layer2_block,layer2_thread>>>(Layer2_Neurons_GPU, Layer2_Weights_GPU,Layer3_Neurons_GPU);
dim3 Layer3_Block(100,1);
dim3 Layer3_Thread(1,1);
executeThirdLayer<<<layer3_block,layer3_thread>>>(Layer3_Neurons_GPU, Layer3_Weights_GPU,Layer4_Neurons_GPU);
dim3 Layer4_Block(10,1);
dim3 Layer4_Thread(1,1);
executeFourthLayer<<<layer4_block,layer4_thread>>>(Layer4_Neurons_GPU,Layer4_Weights_GPU,Layer5_Neurons_GPU);
CUT_CHECK_ERROR("Kernel execution failed");
}
CUT_SAFE_CALL(cutStopTimer(timer));
totaltime = cutGetTimerValue(timer);
)
I have to mention that I'm using a GTX280 on a Core2Duo 2,4GHz.
Have you an idea about my problem?
|
|
|
|
|
i forgot how i measured the running time.
normally, i would loop the code for more than 1 sec. i forgot why, but this is how people measure the running time in lots of projects that i have seen.
what problem did you encounter? are you sure the code running properly, i.e recognize the digits?
this project only works under vs2005.
|
|
|
|
|
I had to import your files (nn.cu and nn_kernel.cu) in the template project (which comes with the NVIDIA SDK 2.0 Beta). The problems were with the include files.
I'm using vs2005 express edition. I increased the number of loops to 10000, but i've got the same result.
Yes, it seems to run properly since the output values are:
output[0] = -0.988
output[1] = -1.000
output[2] = 0.924
output[3] = -0.978
output[4] = -1.135
output[5] = -0.966
output[6] = -0.973
output[7] = -0.943
output[8] = -1.116
output[9] = -0.875
(The input digit is always the same, to be more precise, is the 2 hardcoded in Layer1_Neurons_CPU )
|
|
|
|
|
"i forgot how i measured the running time.
normally, i would loop the code for more than 1 sec. i forgot why, but this is how people measure the running time in lots of projects that i have seen.
what problem did you encounter? are you sure the code running properly, i.e recognize the digits?
this project only works under vs2005."
After more tests, I realized that if I measure the time without the loop (exactly your code) I obtained approximately your time (0.045ms). If I loop the test 10000 times (the code from previous post) I get 0.65ms for one iteration, which is more than ten times than your time.
So, it is possible that you measured the time with only one iteration, which provides incorrect results?
I measured also the EmuRelease time and it is 23ms/image. This is useful for comparing my CPU speed with your Xeon CPU speed (40ms/image).
You state that the CPU implementation took 16ms/image. I assume that this is the Mike's implementation which is quite slow because of extensive use of classes and vectors. Anyway, on my CPU the Mike's version took 2.7ms/image(maybe your 16ms/image is in debug mode?). My CPU implementation which is without classes, with C arrays, took 1.26ms/image (including all IO operation). I have offered you all this data hoping that you can figure out what is the problem.
I want to implement the training algorithm using GPU and until now all our tests show that the CPU implementation is very fast even without optimizations. More exactly, our CPU implementation of testing (forward propagation) takes 12.6s on the 10000 images from MNIST test file. If your implementation on GPU takes 0.65ms on one image, then on all images from the test set will take 6.5 seconds plus additional time needed for IO operation and so on. Consequently, the GPU version (GTX 280) will be only two times (approx) faster than the one-thread CPU version on a Core2Duo 2,4GHz.
Considering all these aspects, can you please verify that your time was correctly measured? (looping at least 1000 times). I would appreciate very much if you would take the time to redo the experiments, because I invest in the GTX280 and ATI Radeon 4850, and on both of them the corresponding GPU implementations(CUDA and BROOK) of the forward propagation algorithm are quite disappointing (the Brook version is even slower than the CPU version, but it has very few optimizations).
modified on Saturday, August 16, 2008 7:02 AM
|
|
|
|
|
may be you are right. the way by which i measured the running time was not right.
thank you.
|
|
|
|
|
Hi; what version of visual studio are you using? I can't open the project.
Do you happen to have a link for the emulator?
Also, is the entire project done in cs and you do calls from cs to the GPU.. so you are able to call CUDA functions and the cpu from cs?
|
|
|
|
|
i was using visual studio 2005 for cuda.
the core was done in c++. i didn't call cuda from cs.
the ui however is in cs, because we were too lazy to write ui with c++.
|
|
|
|
|
Hi,
Will you post the code to the NN CUDA part? I'd like to make make a new system that doesn't depend
on NIST data and 29x29 character array, pretty please...????
|
|
|
|
|
The presented CUDA code is just for evaluating the output value in the neural network. It is not a general NN code, and the training procedure is not implemented. If your input data is not 29x29, you just need to modify the function: executeFirstLayer in matrixMul_kernel.cu.
|
|
|
|
|
sorry, i don't quite understand your question.
were you asking for the NN CUDA source code which should be downloadable at the top?
|
|
|
|
|
Yes, sorry, now I realize the CUDA source was there in the kernel file, thank you,
I will be designing a NN for machine printed data, but need to modify the array size.
Will use Mike's code to train. Thanks for a great and ground breaking piece of software
|
|
|
|
|
Good article... What was the speed-up during training?
Mike O'Neill wrote that for him, it took 40 minutes per epoch, which I believe included a few optimizations. Thus a training session of 25 epochs took 16.7 hours. Was your CUDA version faster? Did it include any of Mike's optimizations?
-----
Update: Ah, never mind. Just read that this implementation doesn't include training. It uses Mike's code to train, then transfer the weights via files. Would have been great to see a significant speed-up achieved for training.
modified on Friday, June 20, 2008 4:10 AM
|
|
|
|
|