Running SVM on positional embeddings using keras for text classification - machine-learning

How can I run SVM on a large text classification dataset for detecting fake news of 400 thousand entries that uses positional encoding for embeddings from keras and has a maximum sentence length of 15 with padding, without using TFIDF or word2vec as it tokenizes into words? I have tried running it on Google Colab free version, but it takes too long and keeps getting disconnected, due to the large sparse matrix. I would like to maintain the sentence embeddings as important information in the analysis. Are there any solutions or suggestions to this issue?
If there are resources or notebooks based on this,please do provide the link to it.
sent_length=15
embedded_docs=pad_sequences(onehot_repr,padding='post',maxlen=sent_length)
embedded_docs_test=pad_sequences(onehot_repr_test,padding='post',maxlen=sent_length)
## Creating model
embedding_vector_features=300 ##features representation
model=Sequential()
model.add(Embedding(voc_size,10,input_length=sent_length))
model.compile('adam','mse')
`
After getting the matrix,i am providing the above matrix as an input to SVM.
I tried using postional embeddings to run SVM,but due to sparsity it not able to run

Related

Predicting over data that has categorical, numerical and text

I am trying to build a classifier for my dataset. Each observation in the data has categorical and numerical values, as well as a more general description in free-text. I understand how to build a boosting algorithm to handle the categorical and numerical values, and I have already trained a neural network that predicted over the text quite succesfully. What I'm wrapping my head around is how to integrate both approaches?
Embed your free text using a Language Model (e.g. averaging fasttext wordembeddings, or using google-universal-sentence-encoder) into an N-dim vector of floats. One hot encode the categorical stuff. Concatenate [embedding, one_hot_encoding, numericals] and badabing badaboom, you've got yourself 1 vector representing your datapoint.
Tensorflow hub's KerasLayer + https://tfhub.dev/google/universal-sentence-encoder/4 is def a good starting point. I you need to train something yourself, you could look into tf.keras.layers.Embedding.

How to determine which words have high predictive power in Sentiment Analysis?

I am working on a classification problem with Tweeter data. User labeled tweets (relevant, not relevant) are used to train a machine learning classifier to predict if an unseen tweet is relevant or not to the user.
I use a simple preprocessing techniques like removal of stopwords, stemming etc and a sklearn Tfidfvectorizer to convert the words into numbers before feeding them into a classifier e.g. SVM, kernel SVM , Naïve Bayes.
I would like to determine which words (features) have the higher predictive power. What is the best way to do so?
I have tried wordcloud but it just shows the words with highest frequency in the sample.
UPDATE:
The following approach along with sklearns feature_selection seem to provide the best answer so far to my problem:
top features Any other suggestions?
Have you tried using tfidf? It creates a weighted matrix providing greater weight to the more semantically meaningful words of each text. It compares the individual text( in this case a tweet) to all of the texts (all of the tweets). It is much more helpful than using raw term counts for classification and other tasks. https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html

Character-Word Embeddings from lm_1b in Keras

I would like to use some pre-trained word embeddings in a Keras NN model, which have been published by Google in a very well known article. They have provided the code to train a new model, as well as the embeddings here.
However, it is not clear from the documentation how to retrieve an embedding vector from a given string of characters (word) from a simple python function call. Much of the documentation seems to center on dumping vectors to a file for an entire sentence presumably for sentimental analysis.
So far, I have seen that you can feed in pretrained embeddings with the following syntax:
embedding_layer = Embedding(number_of_words??,
out_dim=128??,
weights=[pre_trained_matrix_here],
input_length=60??,
trainable=False)
However, converting the different files and their structures to pre_trained_matrix_here is not quite clear to me.
They have several softmax outputs, so I am uncertain which one would belong - and furthermore how to align the words in my input to the dictionary of words for which they have.
Is there a simple manner to use these word/char embeddings in keras and/or to construct the character/word embedding portion of the model in keras such that further layers may be added for other NLP tasks?
The Embedding layer only picks up embeddings (columns of the weight matrix) for integer indices of input words, it does not know anything about the strings. This means you need to first convert your input sequence of words to a sequence of indices using the same vocabulary as was used in the model you take the embeddings from.
For NLP applications that are related to word or text encoding I would use CountVectorizer or TfidfVectorizer. Both are announced and described in a brief way for Python in the following reference: http://www.bogotobogo.com/python/scikit-learn/files/Python_Machine_Learning_Sebastian_Raschka.pdf
CounterVectorizer can be used for simple application as a SPAM-HAM detector, while TfidfVectorizer gives a deeper insight of how relevant are each term (word) in terms of their frequency in the document and the number of documents in which appears this result in an interesting metric of how discriminant are the terms considered. This text feature extractors may consider a stop-word removal and lemmatization to boost features representations.

Addressing synonyms in Supervised Learning for Text Classification

I am using scikit-learn supervised learning method for text classification. I have a training dataset with input text fields and the categories they belong to. I use tf-idf, SVM classifier pipeline for creating the model. The solution works well for normal testcases. But if a new text is entered which has synoynmous words as in the training set, the solution fails to classify correctly.
For e.g: the word 'run' might be there in the training data but if I use the word 'sprint' to test, the solution fails to classify correctly.
What is the best approach here? Adding all synonyms for all words in training dataset doesn't look like a scalable approach to me
You should look into word vectors and dense document embeddings. Right now you are passing scikit-learn a matrix X, where each row is a numerical representation of a document in your dataset. You are getting this representation with tf-idf but as you noticed this doesn't capture word similarities and you are also having issues with out of vocabulary words.
A possible improvement is to represent each word with a dense vector of lets say dimension 300, in such a way that words with similar meaning are close in this 300 dimensional space. Fortunately you don't need to build these vectors from scratch (look up gensim word2vec and spacy). Another good thing is that by using word embeddings pre-trained on very large corpus like Wikipedia you are incorporating a lot of linguistic information about the world into your algorithm that you couldn't infer from your corpus otherwise (like the fact that sprint and run are synonyms).
Once you get good and semantic numeric representation for words you need to get a vector representation for each document. The simplest way would be to average the word vectors of each word in the sentence.
Example pseudocode to get you started:
>>> import spacy
>>> nlp = spacy.load('en')
>>> doc1 = nlp('I had a good run')
>>> doc1.vector
array([ 6.17495403e-02, 2.07064897e-02, -1.56451517e-03,
1.02607915e-02, -1.30429687e-02, 1.60102192e-02, ...
Now lets try a different document:
>>> doc2 = nlp('I had a great sprint')
>>> doc2.vector
array([ 0.02453461, -0.00261007, 0.01455955, -0.01595449, -0.01795897,
-0.02184369, -0.01654281, 0.01735667, 0.00054854, ...
>>> doc2.similarity(doc1)
0.8820845113100807
Note how the vectors are similar (in the sense of cosine similarity) even when the words are different. Because the vectors are similar, a scikit-learn classifier will learn to assign them to the same category. With a tf-idf representation this would not be the case.
This is how you can use these vectors in scikit-learn:
X = [nlp(text).vector for text in corpus]
clf.fit(X, y)

LSTM neural network for a chemical process?

I have the following dataset
for a chemical process in a refinery. It's comprised of 5x5 input vector where each vector is sampled at every minute. The output is the result of the whole process and sampled each 5 minutes.
I concluded that the output (yellow) depends highly on past input vectors in a timely manner. And got recently to have a look on LSTMs and trying to learn a bit about them on Python and Torch.
However I don't have any idea how should I prepare my dataset in a manner where my LSTM could process it and show me future predictions if tested with new input vectors.
Is there a straight forward manner to preprocess my dataset accordingly?
EDIT1: Actually i found out this awesome blog about training LSTMs on natural language processing http://karpathy.github.io/2015/05/21/rnn-effectiveness/ . Long story short, an LSTM takes a character as an input and tries to generate the next character. Eventually, it can be trained on Shakespeare poems to generate new Shakespeare poems! But GPU acceleration is recommended.
EDIT2: Based on EDIT1, the best way to format my dataset is to just transform my excel to txt with TAB-separated columns. I'll post the results of the LSTM prediction on my above numbers dataset as soon as possible.

Resources