Rapidminer- TF-IDF from csv dataset - machine-learning

I have to calculate tf-idf of two columns of csv file..
Should i have to convert the rows into text files? or is there any method to calculate tf-idf from csv.
how can i calculate tfidf of columns of csv file.

There is an operator called Generate TFIDF that might be able to do what you want. The Help for the operator offers up a sample process where it is used and this might be enough for your specific needs.

Related

How to apply tf-idf on multiple predictors, don't want to concatenate into a single column

I have two predictors - want to vectorize each one of them using tf-idf (don't want to concatenate them since we need to have separate vocabulary for each). Should I apply the tf-idf vectorizers on each and then join the features.
For e.g. If i apply tf-idf on predictor1, I get 100 features from that and 200 from predictor2. My features for the training data would simply be 300 (100+200). Am i thinking correctly here?
I will get two matrices from this (one for each predictor), can i concatenate these using numpy functions and use them as features?
Your suggestion on getting this done is correct. The most common way of using two vectors like this is to concatenate them into a longer vector and then feed it to the model.
If, for some reason, this doesn't work out for you, we can explore alternatives based on what your constraints are.
For example, if your constraint is total dimension size, one way to solve this would be to create a multilayered MLP autoencoder
We can train it with the combined vectors as both input and output until the encoder is trained
Subsequently, we can use any intermediate layer's activations as input to our model
It would be easier to suggest a solution if you can describe your constraints in the question.

Storing TF-IDF values in an inverted index

I'm creating a search engine to search a list of roughly 20k English phrases, each one being a few words long.
I've looked into ways to create the search engine, and currently I am using a TfidfVectorizer from sklearn and Cosine Similarity to compute the ranking scores.
From what I understand in information retrieval you have retrieval and ranking phases, however I'm confused how you could use a data structure like an inverted index to speed up the search before using TfidfVectorizer? It seems like TfidfVectorizer creates a term-document matrix which is different to an index. Could you just store TF and IDF values in an inverted index and use cosine similarity at run time? Ideally I want autocomplete of phrases so I need to store edge ngrams as well, and a boolean model isn't useful here.

Universal sentence encoder for big document similarity

I need to create a 'search engine' experience : from a short query (few words), I need to find the relevant documents in a corpus of thousands documents.
After analyzing few approaches, I got very good results with the Universal Sentence Encoder from Google.
The problem is that my documents can be very long. For these very long texts it looks like the performance are decreasing so my idea was to cut the text in sentences/paragraph.
So I ended up with getting a list of vectors for each document (representing each part of the document).
My question is : is there a state-of-the-art algorithm/methodology to compute a scoring from a list of vector ? I don't really want to merge them into one as it would create the same effect than before (the relevant part would be diluted in the document). Any scoring algorithms to sum up the multiple cosine similarities between the query and the different parts of the text ?
important information : I can have short and long text. So I can have 1 up to 10 vectors for a document.
One way of doing this is to embed all sentences of all documents (typically storing them in an index such as FAISS or elastic). Store the document identifier of each sentence. In Elastic this can be metadata but in FAISS this needs to be held in an external mapping.
Then:
embed query
Calculate cosine similarity between query and all sentence embeddings
For top-k results, group by document identifier and take the sum (this step is optional depending on whether youre looking for the most similar document or the most similar sentence, here I suppose that you are looking for the most similar document, thereby boosting documents with a higher similarity).
Then you should have an ordered list of relevant document identifiers.

Fast k-NN search over bag-of-words models

I have a large amount of documents of equal size. For each of those documents I'm building a bag of words model (BOW). Number of possible words in all documents is limited and large (2^16 for example). Generally speaking, I have N histograms of size K, where N is a number of documents and K is histogram width. I can calculate distance between any two histograms.
First optimization opportunity. Documents usually uses only small subset of words (usually less then 5%, most of them less then 0.5%).
Second optimization opportunity Subset of used words is varying from document to document much so I can use bits instead of word counts.
Query by content
Query is a document as well. I need to find k most similar documents.
Naive approach
Calculate BOW model from query.
For each document in dataset:
Calculate it's BOW model.
Find distance between query and document.
Obviously, some data structure should be used to track top-ranked documents (priority queue for example).
I need some sort of index to get rid of full database scan. KD-tree comes to mind but dimensionality and size of the dataset is very high. One can suggest to use some subset of possible words as features but I don't have separate training phase and can't extract this features beforehand.
I've thought about using MinHash algorithm to prune search space but I can't design an appropriate hash functions for this task.
k-d-tree and similar indexes are for dense and continuous data.
Your data most likely is sparse.
A good index for finding the nearest neighbors on sparse data is inverted lists. Essentially the same way search engines like Google work.

SVM How to calculate tf-df of test documents in document classification?

In my SVM, i am using tf-idf on the documents for feature extraction. These tf-idf are calculated on the whole of training documents.
Now when i get a test-document that i want to classify, how do i generate the vector for it ?
I used stemming before calculating tf-idf. I can perform that on test-document too. I have count_of_words for train-documents.
Should i increment count of words that are in the train-document count_of_words for calculating the tf-idf of test-document or should i use it directly ?
Calculate them the same way as during training but: use idf based on the training documents and tf from the test documents. If you have many new documents coming in, just update the training data time to time and retrain your model.

Resources