I'm trying to play around with some music clustering algorithms, and I thought that using a feature vector consisting of basically a discretized fft (like discretize the frequencies) would be a good similarity measure. Would this even be useful? Do people know what some good audio similarity measures might be?
First of all, you need to decide whether you want fingerprinting (i.e. identity except for some distortion) or similarity (but not identity!) measures.
Also have a look at MFCC, bark scales and so on. There is plenty of literature out there. Go to Amazon, and grab a dedicated book on this topic.
You can use a hierarchical cluster like a kd-tree or a hilbert curve before you discretize. A cluster reduces the dimension complexity and change the order of the input while a fft just transform it to waves.
Related
In my thesis project, I need to implement Monte Carlo Localisation algorithm (it's based on Markov Localisation). I have exactly one month of time to understand and implement the algorithm. I understand basics of probability and Bayes theorem. Now which topics I should get familiar with to understand Markov Algorithm? I have read couple of research papers 3-4 times, still I failed to understand everything.
I tried to do Google whichever terms I didn't understand but I couldn't get the essence of the algorithm. I want to understand systematically. I know what it does but I didn't fully understand how it does or why it does.
for e.g. in one of the research paper it was written that Markov algorithm can be used in global indoor positioning system or when you have multi-modal gaussian distribution. whereas Kalman filter can not be used for the same reasons. Now, I completely didn't understand.
second example, Markov Algorithm assume map is static and consider Markov assumption where measurements are independent and doesn't depend on previous measurements. but when environment is dynamic (objects are moving) , Markov assumption is not valid and we need to modify Markov algorithm to incorporate dynamic environment. Now, I don't understand why?
It would be great if someone point me out which topics should I learn to understand the algorithm. please keep in mind that I have only one month.
Particle Filter is what you are looking for to localize a robot.
To implement particle filter, you need an understanding of basic probability(mostly Bayes theorem), Gaussian distributions in 2D.
slides, video
Watch these course videos, which are really good.
for e.g. in one of the research paper it was written that Markov algorithm can be used in global indoor positioning system or when you have multi-modal gaussian distribution. whereas Kalman filter can not be used for the same reasons. Now, I completely didn't understand.
Kalman filter or Extend Kalman filter is used for unimodal distribution and also the initial estimation must be good enough to track.
Particle filter is multi modal, doesn't need an initial guess, but need more particles (or samples) to converge to a better estimate.
second example, Markov Algorithm assume map is static and consider Markov assumption where measurements are independent and doesn't depend on previous measurements. but when environment is dynamic (objects are moving) , Markov assumption is not valid and we need to modify Markov algorithm to incorporate dynamic environment. Now, I don't understand why?
If the objects are humans, it is not difficult to localize (unless the robot is completely covered by humans and robot is not able to see any part of the environment)even in a dynamic environment. A simple modification will be to consider laser rays which are in conformation with the map. Below paper explains this.
check this paper Markov Localization for Mobile Robots
in Dynami Environments
I am currently looking into hand-pose estimation in Unity without using any expensive plugins! At the moment, I have implemented a simple hand-tracking system by extracting the contours of the hand, like the link below:
https://www.youtube.com/watch?v=4QE5FcUK5ZA
However, it doesn't work brilliantly in all environments and tends to not recognise the hand when other object are in the frame (like a face!). Does anyone have any more complex algorithms for hand-pose estimation? I've looked at using neural nets but they tend to use a lot of CPU and/or GPU power, and I need this to be lightweight and not lag in Unity.
Anyone have any suggestions?
Multi-layered random forest is a good light weight method for real time hand pose estimation https://ieeexplore.ieee.org/document/7789644/.
It uses an ensemble of regressors that are specialised on different areas of angle space. And the first layer learns how to weight the output of each of these specialised regressors.
It achieves state of the art on hand pose estimation and has been used by the author in real-time AR applications.
The model uses contour features like the ones you have extracted.
I am right now in a serious problem, I need to compare images of flowers (carnations) using a genetic algorithm, the program must determine which variety does the flower belongs to (until now I am using 15 different varieties), the thing is I am having difficulties constructing the chromosome, right know I am only analysing the HSV of each image, then a take every channel and calculate the mean for each (n=255), after that I calculate the correlation between HS, HV and SV, I expected that the mean would be enough to locate any new flower next to the clusters of flowers of the variety it belongs (by the way, I have a database of all the flowers used for training purpose) by calculating the distance between the mean of the flower and the centroids of each cluster, and probably using the correlations for adjustment, but that distance is usually way smaller to a different variety than the one it must be. Is there a way to classify this flowers using ONLY colours (I've read of applications that uses the texture, but that's way out of my league), especially using a genetic algorithm (I know Neural Networks are more appropriate to this kind of analysis but that's what the teacher asked)?. Thank you very much. By the way I am working on OpenCV, don’t know if it's relevant. PS: Excuse my English if any mistakes were done, not my native language.
I am trying out vlfeat, got huge amount of features from an image database, and I am testing with the ground truth for mean average precision (MAp). Overall, I got roughly 40%. I see that some of the papers got higher MAp, while using techniques very similar to mine; the standard bag of word.
I am currently looking for an answer for obtaining higher MAp for the standard bag of word technique. While I see that there are other implementation such as SURF and what not, let's stick to the standard Lowe's SIFT and the standard bag of word in this question.
So the thing is this, I see that vl_sift got thresholding to allow you to be more strict on feature selection. Currently, I understand that going for higher threshold might net you smaller and more meaningful "good" features list, and possibly reduce some noisy features. "Good" features mean, given the same images with different variation, very similar features are also detected on other images.
However, how high should we go for this thresholding? Sometimes, I see that an image returns no features at all with higher threshold. At first, I was thinking of keep on adjusting the threshold, until I get better MAp. But again, I think it's a bad idea to keep on adjusting just to find the best MAp for the respective database. So my questions are:
While adjusting threshold may decrease numbers of features, does increasing threshold always return a lesser number yet better features?
Are there better approaches to obtain the good features?
What are other factors that can increase the rate of obtaining good features?
Have a look into some of the papers put out in response to the Pascal challenge in recent years. The impression they seem to give me is that standard 'feature detection' methods don't work very well with the Bag of Words technique. This makes sense when you think about it - BoW works by pulling together lots of weak, often unrelated features. It's less about detecting a specific object, but instead recognizing classes of objects and scenes. As such, putting too much emphasis on normal 'key features' can harm more than help.
As such, we see folks using dense grids and even random points as their features. From experience, using one of these methods over Harris corners, LoG, SIFT, MSER, or any of the like, has a great positive impact on performance.
To answer your questions directly:
Yes. From the SIFT api:
Keypoints are further refined by eliminating those that are likely to be unstable, either because they are selected nearby an image edge, rather than an image blob, or are found on image structures with low contrast. Filtering is controlled by the follow:
Peak threshold. This is the minimum amount of contrast to accept a keypoint. It is set by configuring the SIFT filter object by vl_sift_set_peak_thresh().
Edge threshold. This is the edge rejection threshold. It is set by configuring the SIFT filter object by vl_sift_set_edge_thresh().
You can see examples of the two thresholds in action in the 'Detector parameters' section here.
Research suggests features densely selected from the scene yield more descriptive 'words' than those selected using more 'intelligent' methods (eg: SIFT, Harris, MSER). Try your Bag of Words pipeline with vl_feat's DSIFT or PHOW implementation. You should see a great improvement in performance (assuming your 'word' selection and classification steps are tuned well).
After a dense set of feature points, the biggest breakthrough in this field seems to have been the 'Spatial Pyramid' approach. This increases the number of words produced for an image, but provides a location aspect to the features - something inherently lacking in Bag of Words. After that, make sure your parameters are well tuned (which feature descriptor you're using (SIFT, HOG, SURF, etc), how many words are in your vocabulary, what classifier are you using ect.) Then.. you're in active research land. Enjoy =)
for my final thesis i am trying to build up an 3d face recognition system by combining color and depth information. the first step i did, is to realign the data-head to an given model-head using the iterative closest point algorithm. for the detection step i was thinking about using the libsvm. but i dont understand how to combine the depth and the color information to one feature vector? they are dependent information (each point consist of color (RGB), depth information and also scan quality).. what do you suggest to do? something like weighting?
edit:
last night i read an article about SURF/SIFT features i would like to use them! could it work? the concept would be the following: extracting this features out of the color image and the depth image (range image), using each feature as a single feature vector for the svm?
Concatenation is indeed a possibility. However, as you are working on 3d face recognition you should have some strategy as to how you go about it. Rotation and translation of faces will be hard to recognize using a "straightforward" approach.
You should decide whether you attempt to perform a detection of the face as a whole, or of sub-features. You could attempt to detect rotation by finding some core features (eyes, nose, etc).
Also, remember that SVMs are inherently binary (i.e. they separate between two classes). Depending on your exact application you will very likely have to employ some multi-class strategy (One-against-all or One-against-many).
I would recommend doing some literature research to see how others have attacked the problem (a google search will be a good start).
It sounds simple, but you can simply concatenate the two vectors into one. Many researchers do this.
What you arrived at is an important open problem. Yes, there are some ways to handle it, as mentioned here by Eamorr. For example you can concatenate and do PCA (or some non linear dimensionality reduction method). But it is kind of hard to defend the practicality of doing so, considering that PCA takes O(n^3) time in the number of features. This alone might be unreasonable for data in vision that may have thousands of features.
As mentioned by others, the easiest approach is to simply combine the two sets of features into one.
SVM is characterized by the normal to the maximum-margin hyperplane, where its components specify the weights/importance of the features, such that higher absolute values have a larger impact on the decision function. Thus SVM assigns weights to each feature all on its own.
In order for this to work, obviously you would have to normalize all the attributes to have the same scale (say transform all features to be in the range [-1,1] or [0,1])