I am doing my first project in the NLP domain which is sentiment analysis of a dataset with ~250 tagged english data points/sentences. The dataset is reviews of a pharmaceutical product having positive, negative or neutral tags. I have worked with numeric data in supervised learning for 3 years but NLP is unchartered territory for me. So I want to know the best pre-processing techniques and the steps that I need to do that are best suited to my problem. A guideline from an NLP expert would be much appreciated!
Based on your comment on mohammad karami answer, what you haven't understood is the paragraph or sentence representation (you said "converting to numeric is the real question"). So in numerical data, suppose you have like a table with 2 columns (features) and a label, maybe something like "work experience", "age", and a label "salary" (to predict a salary based on age and work experience). In NLP, features are usually if not most of the time on word level (can sometimes be character level or subword level too). These features are called tokens. Now the columns are replaced with these tokens. The simplest way to make a paragraph representation is by using bag of words. So after preprocessing, every unique words will be mapped as column. So suppose we have data train with 2 rows as follows:
"I help you and you should help me"
"you and I"
the unique words will become the column, so the table might look like:
I | help | you | and | should | me
Now the two samples would have value as follows:
[1, 2, 2, 1, 1, 1]
[1, 0, 1, 1, 0, 0]
Notice that the first element of the array is 1, because both samples have word I and occurred once, now see the second element is 2 on first row, and 0 on second row, because word help occurred twice on first row and never occurred on the second row. The logic behind this would be something like "if word A, word B... exists and word H, word I... doesn't exist, then the label is positive".
Bag of words works most of the time but it has problem such as dimensionality problem (imagine there are four billion unique words, the features are too many), and also notice that it doesn't take order of words into account, notice that similar words are represented the same way, and there are many more. The current state of the art for NLP is called BERT, learn that if you want to use what's best.
First of all, you have to specify what features you want to have and then do the pre-processing. However, you can: 1- Remove HTML tags
2- Remove extra whitespaces
3- Convert accented characters to ASCII characters
4- Expand contractions
5- Remove special characters
5 - Lowercase all texts
6- Convert number words to numeric form
7- Remove numbers
8- Remove stopwords
9- Lemmatization
Do your own Data. I suggest looking at the NLTK package for NLP. NLTK has sentiment analysis Function (maybe help your work).
Then extract your features with tf-idf or any other feature extraction or feature selection algorithms . And then give the machine learning algorithm after scaling.
Related
Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 4 years ago.
Improve this question
I'm working on project that generates mnemonics. I have a problem with my Model.
My question is ,How do I make sure my Model Generates Meaningful Sentence Using a Loss Function?
Aim of the project
To generate Mnemonics for a list of words. Given a list of words user wants to remember, the model will Output a meaningful, simple and easy to remember sentence which encapsulates the one-two first letters of the words that the user wants to remember in the words of Mnemonic to be generated. My model will receive only the first two letters of the words the user wants to remember as that is it carries all the information for the mnemonic to be generated.
Dataset
I’m Using Kaggle’s 55000+ song lyrics data and the sentences in those lyrics contain 5 to 10 words and Mnemonic I want to generate also contain the same number of words.
Input/Output Preprocessing.
I am iterating through all the sentences after removing punctuation and numbers and extracting first 2 letters from each word in a sentence and assigning a unique number to those pair of letters from a predefined dictionary which contains pairs of keys a key and a unique number as value.
List of these unique number assigned while act as input and Glove vectors of those words will act as the output. At each time step LSTM model will take these unique numbers assigned to these words and will output the corresponding word’s GloVe vector.
Model Architecture
I'm using LSTM's with 10 time steps.
At each time step the unique number associated with the pair of letters will be fed and the output will be the GloVe vector of the corresponding word.
optimizer=rmsprop(lr=0.0008)
model=Sequential()
model.add(Embedding(input_dim=733,output_dim=40,input_length=12))
model.add(Bidirectional(LSTM(250,return_sequences=True,activation='relu'),merge_mode='concat'))
Dropout(0.4)
model.add(Bidirectional(LSTM(350,return_sequences=True,activation='relu'),merge_mode='concat'))
Dropout(0.4)
model.add(TimeDistributed(Dense(332, activation='tanh')))
model.compile(loss='cosine_proximity',optimizer=optimizer,metrics=['acc'])
Results:
My model is outputting Mnemonics which match the first two letters of each word in the input. But the mnemonic generated carries little to no meaning.
I have realized this problem is caused because of the way I’m training. The order of letter extracts from words is already ready for sentence formation. But this is not the same in case of while testing. The order with which I’m feeding the letter extracts of words may not have a high probability of sentence formation.
So I built a bigram for my data and feed that permutation that had the highest probability of sentence formation into my mnemonic generator model. Though there were some improvements, the sentence as a whole didn’t make any sense.
I’m stuck at this point.
Input
Output
My question is,
How do I make sure my Model Generates Meaningful Sentence? Using a Loss Function
First, I have a couple of unrelated suggestions. I do not think you should output the GLoVe vector of each word. Why? Word2Vec approaches are meant to encapsulate word meanings and would probably not contain information about their spelling. However, the meaning is also helpful in order to produce a meaningful sentence. Thus, I would instead have the LSTM produce it's own hidden state after reading the first two letters of each word (just as you currently do). I would then have that sequence be unrolled (as you currently do) into sequences of dimension one (indexes into a index to word map). I would then take that output, process it through an embedding layer that maps the word indexes to their GLoVe embeddings, and I would run that through another output LSTM to produce more indexes. You can stack this as much as you want - but 2 or 3 levels will probably be good enough.
Even with these changes, it is unlikely you will see any success in generating easy-to-remember sentences. For that main issue, I think there are generally two ways you can go. The first is to augment your loss with some sense that the resulting sentence being a 'valid English sentence'. You can do this with some accuracy programtically by POS tagging the output sentence and adding loss relative to whether it follows a standard sentence structure (subject predicate adverbs direct-objects, etc). Though this result might be easier than the following alternative, It might not yield actually natural results.
I would recommend, in addition to training your model in it's current fashion, to use a GAN to judge if the output sentences are natural sentences. There are many resources of Keras GANs, so I do not think you need specific code in this answer. However, here is an outline of how your model should train logically:
Augment your current training with two additional phases.
first train the discriminator to judge whether or not the output sentence is natural. You can do this by having an LSTM model read sentences and giving a sigmoid output (0/1) to whether or not they are 'natural'. You can then train this model on some dataset of real sentences with 1 labels and your sentences with 0 labels at roughly a 50/50 split.
Then, in addition to the current loss function for actually generating the Mnemonics, add the loss that is the binary cross-entropy score for your generated sentences with 1 (true) labels. Be sure to obviously freeze the discriminator model while doing this.
Continue iterating over these two steps (training each for 1 epoch at a time) until you start to see more reasonable results. You may need to play with how much each loss term is weighted in the generator (your model) in order to get the correct trade-off between a correct mnemonic and an easy-to-remember sentence.
I am trying to build my own corpus for particular categories such as Engineering, Business, Math, Science and etc... This will be for automatic web page categorization. Let's say I manually collect 100 websites that are related to Math. Can these 100 websites be considered a corpus for Math?
Another related question. How does this differentiate from a lexicon wherein instead of a list of websites it shows a list of words with weights such as 0 or 1 to particular categories? Example would be a sentiment lexicon with words that has weights for positive and negative. But instead of positive and negative, categories such as Math, Science are used.
You say you want to make some web page categorization, then the problem you're facing is a supervised learning problem. The data you get are web pages, so I guess you actually extract their content as text. You work with textual input data. Since you want to categorize them, each of your input data has one or more corresponding labels, which are the outputs you want to predict. You have multiple label so you want to do multi-label classification
To tackle this problem, since most machine learning algorithms work with numerical vector, you need to transform your corpus of texts into vectors (or into one matrix). To do so, you can use the bag of word technique which first build a dictionary or lexicon and then count the occurrences of each word of the dictionary in each text. Actually, you can transform your output label in the same way, attributing an index of you output vector for each category.
The final pipeline would be something like this:
[input_text] --bag_of_word--> [input_vector] --prediction--> [output_vector] --label_matchnig--> [labels]
What ML Algorithms can I use to train Action phrases in a given Sentence.
Sentence1:I want to play cricket
Label1: play cricket
Sentence2: Need to wash my clothes
Label2: wash clothes
I have a data of some ~2k Sentences & corresponding Action phrases (Labels) and need to predict another bunch of sentences based on them. Can someone guide me on how to do this using NLP/ML? Which Algo's to use for the same? (preferably python)
Here's the process of sentence classification:
1) Normalize the text - bring all text to lower case
2) Remove all stop words - ensures that only relevant features are left
3) Tokenize the sentences to unigram tokens
4) Apply stemming technique - try out different stemming models/ lemmatizer to bring the words to their base word. See which one works best for your case. For example: play, played, plays will be converted to base word "play". This step reduces the number of features.
5) Create a Term Document Matrix for all the sentences. Each row of the TDM corresponds to a sentence and each column of the TDM corresponds to a token of the sentence. (There's another way of representing text in the form of matrix called Tf-Idf)
6) Now this term document matrix contains tokens as columns. You already have the labels in place. You can start training the ML models now. I'm assuming you know how to do this part.
Take a look at NLTK's Naive Bayes Classifier,
it's multiclass and you can feed it the sentence/label pairs directly.
NaiveBayesClassifier.train() will want training features, I would start
with the features simply being the words in each sentence. You can modify the feature selection with more complex methods until you get the results you want.
You can use nltk.classify.util.accuracy to evaluate results. Remember to split your sentences into training and test data.
I have a Naive Bayes classifier (implemented with WEKA) that looks for uppercase letters.
contains_A
contains_B
...
contains_Z
For a certain class the word LCD appears in almost every instance of the training data. When I get the probability for "LCD" to belong to that class it is something like 0.988. win.
When I get the probability for "L" I get a plain 0 and for "LC" I get 0.002. Since features are naive, shouldn't the L, C and D contribute to overall probability independently, and as a result "L" have some probability, "LC" some more and "LCD" even more?
At the same time, the same experiment with an MLP, instead of having the above behavior it gives percentages of 0.006, 0.5 and 0.8
So the MLP does what I would expect a Naive Bayes to do, and vise versa. Am I missing something, can anyone explain these results?
I am not familiar with the internals of WEKA - so please correct me if you think that I am not righth.
When using a text as a "feature" than this text is transformed to a vector of binary values. Each value correponds to one concrete word. The length of the vector is equal to the size of the dictionary.
if your dictionary contains 4 worlds: LCD, VHS, HELLO, WORLD
then for example a text HELLO LCD will be transformed to [1,0,1,0].
I do not know how WEKA builds it's dictionary, but I think it might go over all the words present in the examples. Unless the "L" is present in the dictionary (and therefor is present in the examples) than it's probability is logicaly 0. Actually it should not even be considered as a feature.
Actually you can not reason over the probabilities of the features - and you cannot add them together, I think there is no such a relationship between the features.
Beware that in text mining, words (letters in your case) may be given weights different than their actual counts if you are using any sort of term weighting and normalization, e.g. tf.idf. In the case of tf.idf for example, characters counts are converted into a logarithmic scale, also characters that appear in every single instance may be penalized using idf normalization.
I am not sure what options you are using to convert your data into Weka features, but you can see here that Weka has parameters to be set for such weighting and normalization options
http://weka.sourceforge.net/doc.dev/weka/filters/unsupervised/attribute/StringToWordVector.html
-T
Transform the word frequencies into log(1+fij)
where fij is the frequency of word i in jth document(instance).
-I
Transform each word frequency into:
fij*log(num of Documents/num of documents containing word i)
where fij if frequency of word i in jth document(instance)
I checked the weka documentation and I didn't see support for extracting letters as features. This implies the weka function may need a space or punctuation to delimit each feature from those adjacent. If so, then the search for "L", "C" and "D" would be interpreted as three separate one-letter-words and would explain why they were not found.
If you think this is it, you could try splitting the text into single characters delimited by \n or space, prior to ingestion.
I want to classify sentences with Weka. My features are sentence terms (words) and a Part of Speech tag of each terms. I don't know how figure attributes, because if each term is presented as one feature, number of feature for each instance (sentence) has become different. And, if all words in sentence is presented as one feature, how relate words and their POS tag.
Any ideas how I should proceed?
If I understand the question correctly, the answer is as follows: It is most common to treat words independently of their position in the sentence and represent a sentence in the feature space by the number of times each of the known words occurs in that sentence. I.e. there is usually a separate numerical feature for each word present in the training data. Or, if you're willing to use n-grams, a separate feature for every n-gram in the training data (possibly with some frequency threshold).
As for the POS tags, it might make sense to use them as separate features, but only if the classification you're interested in has to do with sentence structure (syntax). Otherwise you might want to just append the POS tag to the word, which would partly disambiguate those words that can represent different parts of speech.