My previous article in this blog is a discussion on measuring image similarities with BOF in a large database. It is an extracted part from a forum of an article posted in CodeProject "Bag-of-Features Descriptor on SIFT Features with OpenCV (BoF-SIFT)". This article is also an extracted part from the commenting section of the same article in the Code Project. As I described in my previous article, many people who used visual features do not have a proper understanding of the feature extraction and description algorithms because of these algorithms contain a lot of mathematical procedures which are difficult to understand with an average mathematical knowledge. The question which is about to discuss in this article has proved the above-said fact and also the fact may cause the users to limit the usage of such features in their studies and applications.
Let's begin the discussion.
Q. I just wanted to ask why the minHessian
value is 400
, the number of octaves is 4
, and the number of octave layers is 2
. What would be the effect if I change these values? I'm just starting to learn about this and it is quite confusing. Also, how do you determine how many bags there should be? Why did you choose 200 for your code? I'm trying to extract the SURF features for more than 50 images, cluster them so I only have 1 matrix for each image (did I understand it correctly?), and then use the data to train SVM using Weka.
A. First of all, it will be really useful if you can read the original papers of SIFT by Lowe, and SURF.
For your first question, the SURF features are detected by thresholding the determinant of Hessian matrix of unit patches. In simple words, we first calculate the determinant of Hessian for each and every patch in the image and then threshold it to find the robust feature points. the minHessian
is the controller of this threshold, so if you increase it, you will get less amount of feature points and if you decrease it, you will get more feature points. One of the most important properties of a feature is its repeatability (the tendency of re-detection the same feature in another image of the same scene but with a different angle of the camera). If you set the threshold to a lower value, then you will get a lot of weak feature points which have less repeatability. If you over the threshold it, then there will not be enough features to describe the image. You also can keep 400 for minHessian
as it gives enough amount of feature points for natural images. In special cases such as in medical domain, you need to fine tune this value by doing an experiment.
For the second question, an octave represents a series of filter response maps obtained by convolving the same input image with a filter of increasing size. Unlike the other algorithm, in SURF, we don't need to rescale the image to detect features of different sizes but we can use filters with different sizes. If we say 4 octaves and 2 octave layers, then it means:
- First, we filter the image with the size
9x9
and then 15x15
(this is the two-octave layers of the first octave) - Second, we filter the image with the size
15x15
and then 27x27
(this is the two octave layers of the second octave) - Third, we filter the image with the size
27x27
and then 51x51
(this is the two octave layers of the third octave) - Finally, we filter the image with the size
51x51
and then 99x99
(this is the two octave layers of the fourth octave)
You can see in every octave the filter size is increased logarithmic scale.
9 + (6*1) = 15
15 + (6*2) = 27
27 + (6*4) = 51
51 + (6*8) = 99
The value 6
is chosen because it promises that the filter has a center and the size is uneven.
Finally, it selects features from 2X4
response maps.
Increasing the octave number will give you the ability to detect both smaller and larger sized features in the image. Increasing the number of octave layers give you the ability to detect features in many different sizes between the range of the smallest to the largest. For an example, assume that in your image there is a cat, an elephant, a human and a pig. The following table shows how we detect features with different values for the parameters.
Octaves | Octave Layers | Who is detected
1 | 1 | cat
2 | 1 | cat, pig
1 | 2 | cat, pig
2 | 2 | cat, pig, human
3 | 1 | cat, pig, human
3 | 2 | cat, pig, human, Elephant
The bad effect is, more octaves increase the running time of the algorithm.
The number of bags should be determined based on an experiment. There is a publication that 200 of bags performed well. If you are doing a research, then you have to find the best number of bags by assessing the retrieval performance with varying the number of bags.
For the third question, it will be easy if you push all the features to a one Mat
object because you can directly use the OpenCV
function to cluster them. Otherwise, you have to manually cluster and find the cluster centers to count as the vocabulary.