can anybody help me, please? I have found in one article this text: "The similarity between the network feature maps was calculated using the Euclidean distance. Afterward, the Top-k candidates were chosen to generate a ranked list of relevant image candidates according to the mAP,
where k={5, 10, 25, 50, 100}" in order to evaluate the pattern spotting stage. My question is, in the word spotting task, if we have a set of query images that each one is described using a features vector, and then we want to carry out a retrieval stage, we will compare each query vector with all of candidates dataset and after that rank the matching results. If we want now to evaluate our accuracy' system, we will compute for example the mAP which compares the ranked list with the ground truth results file. But in the paragraph cited above have used another type of evaluation "Top-K" that I am not understanding!
Related
I'm implementing a cache for virtual reality applications: given an input image query, return the result associated to the most visually similar cached image (so a previously processed query) if the distance between the query representation and the cached image representation is lower than a certain threshold. Our cache is relatively small and contains 10k images representations.
We use VLAD codes [1] as image representation since they are very compact and incredibly fast to compute (around 1 ms).
However, it has been shown in [2] that the the distance between the query code and the images in the dataset (the cache in this case) is very different from query to query, so it's not trivial to find an absolute threshold. In the same work it's proposed a method for object detection applications, which is not relevant in this context (we return just the most similar image, not all and only the images containing the query subject).
[3] offers a very precise method, but at the same time it's very expensive and returns short lists. It's based on spatial feature matching re-ranking, and if you want to know more details the quoted section is at the end of this question. I'm not an expert in computer vision, but this step sounds to me a lot like using a Feature Matcher on the short-list of the top-k elements according to the image representation and re-rank them based on the number of features matched. My first question is: is that correct?
In our case this approach is not a problem, since most of the times the top-10 most similar VLAD codes contains the query subject, and so we should do this spatial matching step only on 10 images.
However, at this point I have a second question: if we had the problem of deciding an absolute threshold for image representations (as VLAD codes), does this problem still persists with this approach? In the first case, the threshold was "the L2 distance between the query VLAD code and the closest VLAD code", here instead the threshold value would represent "the number of features matched between the query image and the image closest image using VLAD codes".
Of course my second question makes sense if the first question is positive.
The approach of [3]:
Geometrical Re-ranking verifies the global geometrical consistency
between matches (Lowe 2004; Philbin et al. 2007) for a short-list of
database images returned by the image search system. Here we implement
the approach of Lowe (2004) and apply it to a short-list of 200
images. We first obtain a set of matches, i.e., each descriptor of the
query image is matched to the 10 closest ones in all the short-list
images. We then estimate an affine 2D transforma- tion in two steps.
First, a Hough scheme estimates a trans- formation with 4 degrees of
freedom. Each pair of matching regions generates a set of parameters
that “vote” in a 4D histogram. In a second step, the sets of matches
from the largest bins are used to estimate a finer 2D affine
transform. The images for which the geometrical estimation succeeds
are returned in first positions and ranked with a score based on the
number of inliers. The images for which the estima- tion failed are
appended to the geometrically matched ones, with their order
unchanged.
Most of the recommendation algorithm in mahout requires user-item preference. But I want to find similar items for a given item. My system doesn't have user inputs. i.e. for any movie these can be attribute which can be use to find similarity coefficient
Genre
Director
Actor
The attribute list can be modified in future to build more efficient system. But to find item similarity in mahout datamodel user preference for each item is required. Where as these movies can be clustered together and get closest items in cluster on given item.
Later on after introducing user based recommendation above result can be used to boost the result.
If product attribute has some fix values like Genre. Do I have to convert those values to numerical value. If yes how system will calculate distance between two items where genre-1 and genre-2 doesn't have any numeric relation.
Edit:
I have found few example from command line, but I want to do it in java and save the pre-computed values for later use.
I think in the case of features vectors, the best similarity measure is the ones with exact matches like jaccard similarity for example.
In jaccard, the similarity between two items vectors is calculated as:
number of features in intersection/ number of features in union.
So, converting the genre to a numerical value will not make a difference since the exact match ( that is used to find intersection) will be the same in non numerical values.
Take a look at this question for how to do it in mahout:
Does Mahout provide a way to determine similarity between content (for content-based recommendations)?
It sounds like Mahout's spark-rowsimilarity algorithm, available since version 0.10.0, would be the perfect solution to your problem. It compares the rows of a given matrix (i.e: row vectors representing movies and their properties), looking for cooccurrences of values across those rows - or in your case: cooccurrences of Genres, Directors, and Actors. No user history or item interaction needed. The end result is another matrix mapping each of your movies to the top n most similar other movies in your collection, based on cooccurrence of genre, director, or actor.
The Apache Mahout site has a great write-up regarding how to do this from the command line, but if you want a deeper understanding of what's going on under the covers, read Pat Ferrel's machine learning blog Occam's Machete. He calls this type of similarity content or metadata similarity.
I am working on a Word representation algorithm, similar to Word2Vec and GloVe.I have been asked to make it more dynamic, such that new words could be added to the vocabulary,and new documents could be submitted to the program even after the representations (vectors) have been created.
The problem is, how do I know if my representation work? How do I know if it actually captures the meaning of each word? How do I compare my representation with other existing vector space models?
As of now, I am doing the following tests to check the quality of my word vectors:
Distance test:
Does the cosine distance between vectors reflect the semantic distance between words?
Analogy test:
Can the representation be used to solve problems like "King is to queen what man is to ________ ", (the answer should be woman)
Picking the odd one out:
Can the vectors be used to pick the odd word in a given list of words. If the input is {"cat","dog","phone"}, the output should be "phone"?
What are the other tests that I should do to check the quality of the vectors? What other tasks are word vectors expected to be capable of doing? Is there a benchmark for vector space models?
Your tests sound very reasonable — they are the usual evaluation tasks that are used in research papers to test the quality of word embeddings.
In addition, the website www.wordvectors.org can give you a good idea of how your vectors measure up. It allows you to upload your embeddings, generates plots, gives correlations with word pair similarity rankings, and compares your embeddings with pre-trained vectors from previous research. You can find a more detailed description in the accompanying paper.
Example:
I have m sets of ~1000 text documents, ~10 are predictive of a binary result, roughly 990 aren't.
I want to train a classifier to take a set of documents and predict the binary result.
Assume for discussion that the documents each map the text to 100 features.
How is this modeled in terms of training examples and features? Do I merge all the text together and map it to a fixed set of features? Do I have 100 features per document * ~1000 documents (100,000 features) and one training example per set of documents? Do I classify each document separately and analyze the resulting set of confidences as they relate to the final binary prediction?
The most common way to handle text documents is with a bag of words model. The class proportions are irrelevant. Each word gets mapped to a unique index. Make the value at that index equal to the number of times that token occurs (there are smarter things to do). The number of features/dimension is then the number of unique tokens/words in your corpus. There are manny issues with this, and some of them are discussed here. But it works well enough for many things.
I would want to approach it as a two stage problem.
Stage 1: predict the relevancy of a document from the set of 1000. For best combination with stage 2, use something probabilistic (logistic regression is a good start).
Stage 2: Define features on the output of stage 1 to determine the answer to the ultimate question. These could be things like the counts of words for the n most relevant docs from stage 1, the probability of the most probable document, the 99th percentile of those probabilities, variances in probabilities, etc. Whatever you think will get you the correct answer (experiment!)
The reason for this is as follows: concatenating documents together will drown you in irrelevant information. You'll spend ages trying to figure out which words/features allow actual separation between the classes.
On the other hand, if you concatenate feature vectors together, you'll run into an exchangeability problem. By that I mean, word 1 in document 1 will be in position 1, word 1 in document 2 will be in position 1001, in document 3 it will be in position 2001, etc. and there will be no way to know that the features are all related. Furthermore, an alternate presentation of the order of the documents would lead to the positions in the feature vector changing its order, and your learning algorithm won't be smart to this. Equally valid presentations of the document orders will lead to completely different results in an entirely non-deterministic and unsatisfying way (unless you spend a long time designing a custom classifier that's not afficted with this problem, which might ultimately be necessary but it's not the thing I'd start with).
I have question regarding the particular Naive Bayse algorithm that is used in document classification. Following is what I understand:
construct some probability of each word in the training set for each known classification
given a document we strip all the words that it contains
multiply together the probabilities of the words being present in a classification
perform (3) for each classification
compare the result of (4) and choose the classification with the highest posterior
What I am confused about is the part when we calculate the probability of each word given training set. For example for a word "banana", it appears in 100 documents in classification A, and there are totally 200 documents in A, and in total 1000 words appears in A. To get the probability of "banana" appearing under classification A do I use 100/200=0.5 or 100/1000=0.1?
I believe your model will more accurately classify if you count the number of documents the word appears in, not the number of times the word appears in total. In other words
Classify "Mentions Fruit":
"I like Bananas."
should be weighed no more or less than
"Bananas! Bananas! Bananas! I like them."
So the answer to your question would be 100/200 = 0.5.
The description of Document Classification on Wikipedia also supports my conclusion
Then the probability that a given document D contains all of the words W, given a class C, is
http://en.wikipedia.org/wiki/Naive_Bayes_classifier
In other words, the document classification algorithm Wikipedia describes tests how many of the list of classifying words a given document contains.
By the way, more advanced classification algorithms will examine sequences of N-words, not just each word individually, where N can be set based on the amount of CPU resources you are willing to dedicate to the calculation.
UPDATE
My direct experience is based on short documents. I would like to highlight research that #BenAllison points out in the comments that suggests my answer is invalid for longer documents. Specifically
One weakness is that by considering only the presence or absence of terms, the BIM ignores information inherent in the frequency of terms. For instance, all things being equal, we would expect that if 1 occurrence of a word is a good clue that a document belongs in a class, then 5 occurrences should be even more predictive.
A related problem concerns document length. As a document gets longer, the number of distinct words used, and thus the number of values of x(j) that equal 1 in the BIM, will in general increase.
http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.46.1529