I am trying to develop a music-focused search engine for my final year project.I have been doing some research on Latent Semantic Analysis and how it works on the Internet. I am having trouble understanding where LSI sits exactly in the whole system of search engines.
Should it be used after a web crawler has finished looking up web pages?
I don't know much about music retrieval, but in text retrieval, LSA is only relevant if the search engine is making use of the vector space model of information retrieval. Most common search engines, such as Lucene, break each document up into words (tokens), remove stop words and put the rest of them into the index, each usually associated with a term weight indicating the importance of the term within the document.
Now the list of (token,weight) pairs can be viewed as a vector representing the document. If you combine all of these vectors into a huge matrix and apply the LSA algorithm to that (after crawling and tokenising, but before indexing), you can use the result of the LSA algorithm to transform the vectors of all documents before indexing them.
Note that in the original vectors, the tokens represented the dimensions of the vector space. LSA will give you a new set of dimensions, and you'll have to index those (e.g. in the form of auto-generated integers) instead of the tokens.
Furthermore, you will have to transform the query into a vector of (token,weight) pairs, too, and then apply the LSA-based transformation to that vector as well.
I am unsure if anybody actually does all of this in any real-life text retrieval engine. One problem is that performing the LSA algorithm on the matrix of all document vectors consumes a lot of time and memory. Another problem is handling updates, i.e. when a new document is added, or an existing one changes. Ideally, you'd recompute the matrix, re-run LSA, and then modify all existing document vectors and re-generate the entire index. Not exactly scalable.
Related
Is there a common online algorithm to classify news dynamically? I have a huge data set of news classified by topics. I consider each of that topics a cluster. Now I need to classify breaking news. Probably, I will need to generate new topics, or new clusters, dynamically.
The algorithm I'm using is the following:
1) I go through a group of feeds from news sites and I recognize news links.
2) For each new link, I extract the content using dragnet, and then I tokenize it.
3) I find the vector representation of all the old news and the last one using TfidfVectorizer from sklearn.
4) I find the nearest neighbor in my dataset computing euclidean distance from the last news vector representation and all the vector representations of the old news.
5) If that distance is smaller than a threshold, I put it in the cluster that the neighbor belongs. Otherwise, I create a new cluster, with the breaking news.
Each time a news arrive, I re-fit all the data using a TfidfVectorizer, because new dimensions can be founded. I can't wait to re-fit once per day, because I need to detect breaking events, which can be related to unknown topics. Is there a common approach more efficient than the one I am using?
If you build the vectorization yourself, adding new data will be much easier.
You can trivially add new words as new columns that are simply 0 for all earlier documents.
Don't apply the idf weights, but use them as dynamic weights only.
There are well known, and very fast, implementations of this.
For example Apache Lucene. It can add new documents online, and it uses a variant of tfidf for search.
I have a large amount of documents of equal size. For each of those documents I'm building a bag of words model (BOW). Number of possible words in all documents is limited and large (2^16 for example). Generally speaking, I have N histograms of size K, where N is a number of documents and K is histogram width. I can calculate distance between any two histograms.
First optimization opportunity. Documents usually uses only small subset of words (usually less then 5%, most of them less then 0.5%).
Second optimization opportunity Subset of used words is varying from document to document much so I can use bits instead of word counts.
Query by content
Query is a document as well. I need to find k most similar documents.
Naive approach
Calculate BOW model from query.
For each document in dataset:
Calculate it's BOW model.
Find distance between query and document.
Obviously, some data structure should be used to track top-ranked documents (priority queue for example).
I need some sort of index to get rid of full database scan. KD-tree comes to mind but dimensionality and size of the dataset is very high. One can suggest to use some subset of possible words as features but I don't have separate training phase and can't extract this features beforehand.
I've thought about using MinHash algorithm to prune search space but I can't design an appropriate hash functions for this task.
k-d-tree and similar indexes are for dense and continuous data.
Your data most likely is sparse.
A good index for finding the nearest neighbors on sparse data is inverted lists. Essentially the same way search engines like Google work.
I have a fast set of multi dimensional timebased data which i suspect contain patterns. I simplified the dataset to create a custom visualization.
Humans see patterns in the visualization but the result of the pattern cannot be explained by the visualization. This is because of the simplification step, it hides data which is important.
I cannot put all my data in my visualization cause than humans cannot see the possible patterns anymore because too much data and dimensions are visualized.
Is there a technique that can detect hidden unknown patterns in a data set? (without using visualization, and without me learning the technique patterns) .
One optional extra would be that the technique should somehow be able to "explain the patterns" to me so that i can check if they make sense.
[edit] i can give the technique a collection of small sized datasets (extracted from the big dataset; still very multi dimensional) that i know that contain patterns (by using my visualization). The technique then needs to analyze under what conditions a pattern produces result a or result b.
First of, how did you "simplify" the data? If you did it without any heuristics, you might go ahead and perform PCA. The very idea of PCA is to solve your problem: Not losing "important" data while having a dimensional reduction. You can visualize your principal components so that patterns can be detected by the human eye as well as algorithms.
To your 2nd question: Yes, there are techniques that can detect hidden unknown patterns in data. However, this is a huge field (Machine Learning) and what algorithm you'd use, would depend on your problem structure, so it's impossible to give a specific model name at this point. From what you specified, neural networks in general seem fit to do the job. After you trained a network, you can visualize the activations or weights (Hinton Diagram) to perform an analysis on which input data is treated "similarly".
This may sound very naive but i just wanted to be sure that when talking in Machine Learning terminology, features in Document Clustering is words which are chosen from a document, if some are discarded after stemming or as stop-words.
I am trying to use LibSvm library and it says that there are different approaches for different types of { no_of_instances, no_of_features }.
Like if no_of_instances is much lower than no_of_features, linear kernels would do. if both are large, linear would be fast. However, if no_of_features is small, non-linear kernels is better.
So for my document clustering/classification, i have small number of documents like 100 each may have words around 2000. So i fall in small no_of_instances and large no_of_features category depending upon what i consider a feature is.
I would like to use tf-idf for the document.
So Is no_of_features is the size of the vector that i get from tf-idf ?
What you are talking about here is just one of possibilities, in fact the most trivial way of defining features for documents. In machine learning terminology feature is any mapping from the input space (in this particular example - from space of documents) into some abstract space, which is suited for a particular machine learning model. Most of the ML models (like neural networks, support vector machines, etc.) work on the numerical vectors, so features has to be mappings from documents to (constant size) vectors of numbers. This is a reason for sometimes choosing a representation of bag of owrds, where we have a words' count vector as a document representation. This limitation can be overcomed by using specific models, like for example Naive Bayes (or a custom kernel for SVM, which enables them to work with nonnumeric data), which can work on any objects, as long as we can define perticular conditional probabilities - here, the most basic approach is treating document containing a particular word or not as a "feature". In general this is not the only possibility, there are dozens of methods that use statistical features, semantic features (based on some ontologies like wordnet) etc.
To sum up - this is only one, most simple representation of document for the machine learning model. Good to start with, good to understand the basics, but far from being a "feature definition".
EDIT
no_of_features is size of the vector you use for your documents' representation, so if you use tf-idf, then size of resulting vecor is a no_of_featuers.
I want to cluster text. I kinda understand the concept of clustering text-only content from Mahout in Action:
make a mapping (int -> term) of all terms in the input and store into a dictionary
convert all input documents into a normalized sparse vector
do clustering
I want to cluster text as well as other information like date-time, location, people I was with. For example, I want documents made in a 10-day visit to a distant place to be placed into a distinct cluster.
I know I must write my own tool for making vectors from date-time, location, tags and (natural) text. How do I approach this? Should I use built-in tools to vectorize text and then integrate that output to my own vectors? What about weighing the dimensions?
I cant give you full implementation details, as im not sure, but i can help you out with a piece of the puzzle. You will almost certainly need some context analysis to extract entities (such as location, time/date, person names)
For this take a look at OpenNLP.
http://opennlp.apache.org/documentation/1.5.3/manual/opennlp.html
in particular look at POS tagger, and namefinder.
Once you have extracted out the relevant entities, - you 'may' be able to do something with them using Mahout classification, (once you have extracted enough entities to train your model), but this i am not sure.
good luck