How can we use the dependency parser output to text embeddings? or Feature extractions from text? - machine-learning

Knowing the dependencies between various parts of the sentence
can add some information to existing knowledge from raw texts, Now the question is how can we use this to get a good feature representation, which can be fed into classifier such as logistic regression, sim etc. just as TfIdfvectorizer gives us a vector representation, for text documents. I'd like to know what different methods are there to get these kind of representation using output of dependency parser?

Related

Get sentence vector for a K-means clustering task

I am working on a project which groups jobs posted on various job portals into clusters based on the description of the jobs using K-means.
I found the work vector using Word2Vec, but i guess this will not serve the purpose as I will need a vector of the whole job description.
I know that I can average out the word vector of a sentence to get the sentence vector but worried about the accuracy as this will loose the ordering of the words.
Is there any other way I can get the vectors ?
The most using approaches for text vectorization:
Pure TF-IDF, still can be useful, especially using n-grams.
Using Word2Vec to get vectors for the words. For the whole text using the mean value of all vectors.
Combine the first two methods: get a weighted mean of all words in the text using the coefficients from the TF-IDF.
I would suggest trying each and pick what is performed better in your case. The results can be slightly different depends on the nature of the data.
You can facilitate transfer learning by very useful sentence embedding methods such as Bert-as-service or SentenceBert or even Universal Sentence encoding. All of them are easy to use and full of tutorials on the web. They will work better then TF-IDF in most cases.
You can also try doc2vec, an extension of word2vec that builds representations of a whole document. There is an implementation in gensim available:
https://radimrehurek.com/gensim/models/doc2vec.html

Data Preprocessing for NLP Pre-training Models (e.g. ELMo, Bert)

I plan to train ELMo or Bert model from scratch based on data(notes typed by people) on hand. The data I have now is all typed by different people. There are problems with spelling, formatting, and inconsistencies in sentences. After read the ELMo and Bert papers, I know that both models use a lot of sentences like from Wikipedia. I haven't been able to find any processed training samples or any preprocessing tutorial for Emlo or Bert model. My question is:
Does the Bert and ELMo models have standard data preprocessing steps or standard processed data formats?
Based on my existing dirty data, is there any way to preprocess this data so that the resulting word representation is more accurate?
Bert uses WordPiece embeddings which somewhat helps with dirty data.
https://github.com/google/sentencepiece
Also Google-Research provides data preprocessing in their code.
https://github.com/google-research/bert/blob/master/tokenization.py
Default Elmo implementation takes tokens as the output (if you provide an untokenized string, it will split it on spaces). Thus spelling correction, deduplication, lemmatization (e.g. as in spacy https://spacy.io/api/lemmatizer), separating tokens from punctuation and other standard preprocessing methods may help.
You may check standard ways to preprocess text in the NLTK package.
https://www.nltk.org/api/nltk.tokenize.html (for example Twitter tokenizer). (Beware that NLTK is slow by itself). Many machine learning libraries provide their basic preprocessing (https://github.com/facebookresearch/pytext https://keras.io/preprocessing/text/)
You may also try to experiment and provide bpe-encodings or character n-grams to the input.
It also depends on the amount of data that you have; the more data you have, the less is the benefit of preprocessing (in my opinion). Given that you want to train Elmo or Bert from scratch, you should have a lot of data.

Using Text Sentiment as feature in machine learning model?

I am researching what features I'll have for my machine learning model, with the data I have. My data contains a lot of textdata, so I was wondering how to extract valuable features from it. Contrary to my previous belief, this often consists of representation with Bag-of-words, or something like word2vec: (http://scikit-learn.org/stable/modules/feature_extraction.html#text-feature-extraction)
Because my understanding of the subject is limited, I dont understand why I can't analyze the text first to get numeric values. (for example: textBlob.sentiment =https://textblob.readthedocs.io/en/dev/, google Clouds Natural Language =https://cloud.google.com/natural-language/)
Are there problems with this, or could I use these values as features for my machine learning model?
Thanks in advance for all the help!
Of course, you can convert text input single number with sentiment analysis then use this number as a feature in your machine learning model. Nothing wrong with this approach.
The question is what kind of information you want to extract from text data. Because sentiment analysis convert text input to a number between -1 to 1 and the number represents how positive or negative the text is. For example, you may want sentiment information of the customers' comments about a restaurant to measure their satisfaction. In this case, it is fine to use sentiment analysis to preprocess text data.
But again, sentiment analysis is only given an idea about how positive or negative text is. You may want to cluster text data and sentiment information is not useful in this case since it does not provide any information about the similarity of texts. Thus, other approaches such as word2vec or bag-of-words will be used for the representation of text data in those tasks. Because those algorithms provide vector representation of the text instance of a single number.
In conclusion, the approach depends on what kind of information you need to extract from data for your specific task.

Character-Word Embeddings from lm_1b in Keras

I would like to use some pre-trained word embeddings in a Keras NN model, which have been published by Google in a very well known article. They have provided the code to train a new model, as well as the embeddings here.
However, it is not clear from the documentation how to retrieve an embedding vector from a given string of characters (word) from a simple python function call. Much of the documentation seems to center on dumping vectors to a file for an entire sentence presumably for sentimental analysis.
So far, I have seen that you can feed in pretrained embeddings with the following syntax:
embedding_layer = Embedding(number_of_words??,
out_dim=128??,
weights=[pre_trained_matrix_here],
input_length=60??,
trainable=False)
However, converting the different files and their structures to pre_trained_matrix_here is not quite clear to me.
They have several softmax outputs, so I am uncertain which one would belong - and furthermore how to align the words in my input to the dictionary of words for which they have.
Is there a simple manner to use these word/char embeddings in keras and/or to construct the character/word embedding portion of the model in keras such that further layers may be added for other NLP tasks?
The Embedding layer only picks up embeddings (columns of the weight matrix) for integer indices of input words, it does not know anything about the strings. This means you need to first convert your input sequence of words to a sequence of indices using the same vocabulary as was used in the model you take the embeddings from.
For NLP applications that are related to word or text encoding I would use CountVectorizer or TfidfVectorizer. Both are announced and described in a brief way for Python in the following reference: http://www.bogotobogo.com/python/scikit-learn/files/Python_Machine_Learning_Sebastian_Raschka.pdf
CounterVectorizer can be used for simple application as a SPAM-HAM detector, while TfidfVectorizer gives a deeper insight of how relevant are each term (word) in terms of their frequency in the document and the number of documents in which appears this result in an interesting metric of how discriminant are the terms considered. This text feature extractors may consider a stop-word removal and lemmatization to boost features representations.

How do I combine text and numerical features in training set for machine learning?

I am trying to predict the number of likes on a post in a social network basing on both on numerical features and text features. Now I have dataframe with required features, but I don't know what to do with posts text data. Should I vectorize it/do smth else in order to get a suitable train matrix? I am going to use LinearSVC from sklearn for analysis.
There are a lot of different ways you can transform your text features into numerical ones.
One of the most common ways is the Bag of Words approach. Where you transform your text into an array with the occurrences of each word.
If you are using scikit-learn I recommend you reading their Text Feature extraction User Guide.
Also look at the NLTK toolkit for more complex ways to process your text data.

Resources