how to map vectors with words in a text file - mapping

I have a text file containing 100 sentences and 10 dimensional vector for each word is stored in another text file. I want to make a new text file where vectors will be printed for each word in a sentence in each matrix. please help me.

Here is one way to do it with Python and Gensim.
from gensim.models import word2vec
# Change this to your own path.
pathToBinVectors = '/data/GoogleNews-vectors-negative300.bin'
model1 = word2vec.Word2Vec.load_word2vec_format(pathToBinVectors, binary=True)
print model1['resume'] -> This will print a vector of the word "resume".
Source: https://bitbucket.org/yunazzang/aiwiththebest_byor

Related

I created a TF-IDF code to analyze an annual report, I want to know the importance of specific keywords

import pandas as pd
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.feature_extraction.text import TfidfVectorizer
import path
import re
with open(r'C:\Users\maxim\PycharmProjects\THESIS\data\santander2020_1.txt', 'r') as file:
data = file.read()
dataset = [data]
tfIdfVectorizer=TfidfVectorizer(use_idf=True, stop_words="english"
, lowercase=True,max_features=100,ngram_range=(1,3))
tfIdf = tfIdfVectorizer.fit_transform(dataset)
df = pd.DataFrame(tfIdf[0].T.todense(), index=tfIdfVectorizer.get_feature_names(), columns=["TF-IDF"])
df = df.sort_values('TF-IDF', ascending=False)
print (df.head(25))
The above code is what ive created to do a TF-IDF analysis on an annual report, however currently it is giving me the values of the most important words within the report. However, I only need the TFIDF values for the keywords
["digital","hardware","innovation","software","analytics","data","digitalisation","technology"], is there a way I can specify to only look for the tfidf values of these terms?
I'm very new to programming with little experience, I'm doing this for my thesis.
Any help is greatly appreciated.
You have defined tfIdf as tfIdf = tfIdfVectorizer.fit_transform(dataset).
So tfIdf.toarray() would be a 2-D array, where each row refers to a document and each element in the row refers to the TF-IDF score of the corresponding word. To know what word each element is representing, you could use the .get_feature_names() function which would print a list of words. Then you can use this information to create a mapping (dict) from words to scores, like this:
wordScores = dict(zip(tfIdfVectorizer.get_feature_names(), tfIdf.toarray()[0]))
Now suppose your document contains the word "digital" and you want to know its TF-IDF score, you could simply print the value of wordScores["digital"].

String classification, how to encode character-by-character and train?

I am trying to build a classifier to classify some files into 150 categories based on the name of those files. Here are some examples of file names in my dataset (~700k files):
104932489 - urgent - contract validation for xyz limited.msg
treatment - an I l - contract n°4934283 received by partner.pdf
- invoice_8843238_1_europe services_business 8592342sid paris.xls
140159498736656.txt
140159498736843.txt
fsk_000000001090296_sdiselacrusefeyre_2000912.xls
fsk_000000001091293_lidlsnd1753mdeas_2009316.xls
You can see that the filenames can really be anything, but that however there is always some pattern that is respected for the same categories. It can be in the numbers (that are sometimes close), in the special characters (spaces, -, °), sometimes the length, etc.
Extracting all those patterns one by one will take ages because I have approximately 700k documents. Also, I am not interested in 100% accuracy, 70% can be good enough.
The real problem is that I don't know how to encode this data. I have tried many methods:
Tokenizing character by character and feeding them to an LSTM model with an embedding layer. However, I wasn't able to implement it and got dimension errors.
Adapting Word2Vec to convert the characters into vectors. However, this automatically drops all punctuation and space characters, also, I lose the numeric data. Another problem is that it creates more useless dimensions: if the size is 20, I will have my data in 20 dimensions but if I look closely, there are always the same 150 vectors in those 20 dimensions so it's really useless. I could use a 2 dimensions size but still, I need the numeric data and the special characters.
Generating n-grams from each path, in the range 1-4, then using a CountVectorizer to compute the frequencies. I checked and special characters were not dropped but it gave me like 400,000 features! I am running a dimensionality reduction using UMAP (n_components=5, metric='hellinger') but the reduction runs for 2 hours and then the kernel crashes.
Any ideas?
I am currently also working on a character level lstm. And it works exactly the same like when you would use words. You need a vocabulary, for example a - z and then you just take the index of the letter as its integer representation. For example:
"bad" -> "b", "a", "d" -> [1, 0, 3]
Now you could create an embedding lookup table (for example using pytorchs nn.Embedding function). You just have to create a random vector for every index of your vocab. For example:
"a" -> 0 > [-0.93, 0.024, -.0.73, ..., -0.12]
You said that you tried this but encountered dimension errors? Maybe show us the code!
Or you could create non-random embedding using word2vec using the Gensim libary:
from gensim.models import Word2Vec
# 'total_words' is a list containing every word of your dataset split into its characters
total_words = [...]
model = Word2Vec(total_words , min_count=1, size=32)
model.save(save_model_file)
# lets test it for the character 'a'
embedder = Word2Vec.load(save_model_file)
v = embedder["a"]
# v now will be a the embedding vector of a with size 32x1
I hope I could make clear how to create embeddings for characters.
You can treat characters in single-word-classification the exact same way you would treat words in sentence-classification.

How to train a word embedding representation with gensim fasttext wrapper?

I would like to train my own word embeddings with fastext. However, after following the tutorial I can not manage to do it properly. So far I tried:
In:
from gensim.models.fasttext import FastText as FT_gensim
# Set file names for train and test data
corpus = df['sentences'].values.tolist()
model_gensim = FT_gensim(size=100)
# build the vocabulary
model_gensim.build_vocab(sentences=corpus)
model_gensim
Out:
<gensim.models.fasttext.FastText at 0x7f6087cc70f0>
In:
# train the model
model_gensim.train(
sentences = corpus,
epochs = model_gensim.epochs,
total_examples = model_gensim.corpus_count,
total_words = model_gensim.corpus_total_words
)
print(model_gensim)
Out:
FastText(vocab=107, size=100, alpha=0.025)
However, when I try to look in a vocabulary words:
print('return' in model_gensim.wv.vocab)
I get False, even the word is present in the sentences I am passing to the fast text model. Also, when I check the most similar words to return I am getting characters:
model_gensim.most_similar("return")
[('R', 0.15871645510196686),
('2', 0.08545402437448502),
('i', 0.08142799884080887),
('b', 0.07969795912504196),
('a', 0.05666942521929741),
('w', 0.03705815598368645),
('c', 0.032348938286304474),
('y', 0.0319858118891716),
('o', 0.027745068073272705),
('p', 0.026891689747571945)]
What is the correct way of using gensim's fasttext wrapper?
The gensim FastText class doesn't take plain strings as its training texts. It expects lists-of-words, instead. If you pass plain strings, they will look like lists-of-single-characters, and you'll get a stunted vocabulary like you're seeing.
Tokenize each item of your corpus into a list-of-word-tokens and you'll get closer-to-expected results. One super-simple way to do this might just be:
corpus = [s.split() for s in corpus]
But, usually you'd want to do other things to properly tokenize plain-text as well – perhaps case-flatten, or do something else with punctuation, etc.
In order to looking in vocabulary words, the vocabulary words should be written to a text file in order to become visible from that text file. This could be helpful for you:
with open("vocab.txt", "w", encoding="utf8") as vocab_out:
for word in model_gensim.wv.vocab:
vocab_out.write(word + "\n")

How to construct a character based seq2seq model in tensorflow

What changes are required to the existing seq2seq model in tensorflow so that I can use character units rather then the existing word units for the seq2seq task? And will this be a good configuration for a predictive ext application?
The following function signatures may need modification for this task:
def embedding_rnn_seq2seq(encoder_inputs, decoder_inputs, cell,
num_encoder_symbols, num_decoder_symbols,
output_projection=None, feed_previous=False,
dtype=dtypes.float32, scope=None):
Apart from reduced input output vocabulary what other parameter changes would be be required to implement such a character level seq2seq model ?
I think you could use the existing seq2seq model in tensorflow without any code changes for character-based units if you prepare your input data files by whitespace separating your training examples like this:
The quick brown fox.
Becomes:
T h e _SPACE_ q u i c k _SPACE_ b r o w n _SPACE_ f o x .
Then your vocabulary naturally becomes characters not words.
You can experiment vocab sizes, with embedding size, eliminate embedding layer, etc. to see what works best for your data.

how to predict using scikit?

I have trained an estimator, called clf, using fit method and save the model to disk. The next time to run the program , which will load clf from disk.
my problem is :
how to predict a sample which saved on disk? I mean, how to load it and predict?
how to get the sample label instead of label integer after predict?
how to predict a sample which saved on disk? I mean, how to load it and predict?
You have to use the same array representation for the new samples as the one used for the samples passed to fit method. If you want to predict a single sample, the input must be a 2D numpy array with shape (1, n_features).
The way to read your original file on the HDD and convert it to a numpy array representation suitable for classifier is a domain specific issue: it depends whether you are trying to classify text files, jpeg files, frames in a video file, rows in database, log lines for syslog monitored services...
how to get the sample label instead of label integer after predict?
Just keep a list of label names and ensure that the integer used as target values when fitting are in the range [0, n_classes). For instance ['spam', 'ham'], if you have predictions in the range [0, 1] then you can do:
new_samples = # 2D array with shape (n_samples, n_features)
label_names = ['ham', 'spam']
predictions = [label_names[pred] for pred in clf.predict(new_samples)]

Resources