I am looking for few best practices to clean up the Dutch text.
What I have done so far:
1. Used regex to remove all special characters, digits etc.
2. Spacy _ NL model for lemmetization of the words
3. NLTK stopwords for dutch
4. collecting adjectives for sentiment.
Feature vector - Count Vector
But the text is not getting cleaned as expected. There is no clean line for positive and negative.
I am looking for some guidance or solution to solve NLP problems in Dutch.
Related
I am doing my first project in the NLP domain which is sentiment analysis of a dataset with ~250 tagged english data points/sentences. The dataset is reviews of a pharmaceutical product having positive, negative or neutral tags. I have worked with numeric data in supervised learning for 3 years but NLP is unchartered territory for me. So I want to know the best pre-processing techniques and the steps that I need to do that are best suited to my problem. A guideline from an NLP expert would be much appreciated!
Based on your comment on mohammad karami answer, what you haven't understood is the paragraph or sentence representation (you said "converting to numeric is the real question"). So in numerical data, suppose you have like a table with 2 columns (features) and a label, maybe something like "work experience", "age", and a label "salary" (to predict a salary based on age and work experience). In NLP, features are usually if not most of the time on word level (can sometimes be character level or subword level too). These features are called tokens. Now the columns are replaced with these tokens. The simplest way to make a paragraph representation is by using bag of words. So after preprocessing, every unique words will be mapped as column. So suppose we have data train with 2 rows as follows:
"I help you and you should help me"
"you and I"
the unique words will become the column, so the table might look like:
I | help | you | and | should | me
Now the two samples would have value as follows:
[1, 2, 2, 1, 1, 1]
[1, 0, 1, 1, 0, 0]
Notice that the first element of the array is 1, because both samples have word I and occurred once, now see the second element is 2 on first row, and 0 on second row, because word help occurred twice on first row and never occurred on the second row. The logic behind this would be something like "if word A, word B... exists and word H, word I... doesn't exist, then the label is positive".
Bag of words works most of the time but it has problem such as dimensionality problem (imagine there are four billion unique words, the features are too many), and also notice that it doesn't take order of words into account, notice that similar words are represented the same way, and there are many more. The current state of the art for NLP is called BERT, learn that if you want to use what's best.
First of all, you have to specify what features you want to have and then do the pre-processing. However, you can: 1- Remove HTML tags
2- Remove extra whitespaces
3- Convert accented characters to ASCII characters
4- Expand contractions
5- Remove special characters
5 - Lowercase all texts
6- Convert number words to numeric form
7- Remove numbers
8- Remove stopwords
9- Lemmatization
Do your own Data. I suggest looking at the NLTK package for NLP. NLTK has sentiment analysis Function (maybe help your work).
Then extract your features with tf-idf or any other feature extraction or feature selection algorithms . And then give the machine learning algorithm after scaling.
In bag-of-words model, I know we should remove stopwords and punctuation before training. But in RNN model, if I want to do text classification, should I remove stopwords too ?
This depends on what your model classifies. If you're doing something in which the classification is aided by stop words -- some level of syntax understanding, for instance -- then you need to either leave in the stop words or alter your stop list, such that you don't lose that information. For instance, cutting out all verbs of being (is, are, should be, ...) can mess up a NN that depends somewhat on sentence structure.
However, if your classification is topic-based (as suggested by your bag-of-words reference), then treat the input the same way: remove those pesky stop words before they burn valuable training time.
Do not remove SW, as they add new information(context-awareness) to
the sentence (viz., text summarization, machine/language translation,
language modeling, question-answering)
Remove SW if we want only
general idea of the sentence (viz., sentiment analysis, language/text
classification, spam filtering, caption generation, auto-tag
generation, topic/document
Sentiment analysis helps us gauge sentiment of tweets, however many of the tweets we get from the api might really not be 'classifiable' into some sentiment.
Does anyone know of any api/literature that talks about pre-processing a tweet before running any kind of classifier over it (e.g. remove #, remove #name's etc).
Also, what topics/api/literature can i look up if i want determine if it makes sense to run sentiment analysis on a tweet (say as a movie review), before i even begin to run a sentiment analyzer over it?
Maybe you should read:
The Role of Pre-processing in Twitter Sentiment Analysis by Yanwei Bao, Changqin Quan, Lijuan Wang and Fuji Ren
Preprocessing the Informal Text for efficient Sentiment Analysis by I. Hemalatha, G.P. Saradhi Varma and A. Govardhan
(Then in Python, tweet = re.sub(old_pattern, new_pattern, tweet) for each modification to perform.)
I am using TextBlob Library for classifying my dataset.
TextBlob is a Python (2 and 3) library for processing textual data. It provides a simple API for diving into common natural language processing (NLP) tasks such as part-of-speech tagging, noun phrase extraction, sentiment analysis, classification, translation, and more.
Features:
-Noun phrase extraction
-Part-of-speech tagging
-Sentiment analysis
-Classification (Naive Bayes, Decision Tree)
-Language translation and detection powered by Google Translate
-Tokenization (splitting text into words and sentences)
-Word and phrase frequencies
-Parsing
-n-grams
-Word inflection (pluralization and singularization) and lemmatization
-Spelling correction
-Add new models or languages through extensions
-WordNet integration
Get it now:
$ pip install -U textblob
$ python -m textblob.download_corpora
Reference: https://textblob.readthedocs.org/en/dev/
*** I cannot tell you the result because this is a part of my thesis and I am still working on.
Actually you'd better do the dirty work by yourself. Regular Expression is easy to remove #,# or url. Punctuation marks and emojis are quite import for the sentiment analysis. I recommend using Tag of Speech trained by CMU NLP group(http://www.cs.cmu.edu/~ark/TweetNLP/) to express these characters.
For basic features like bag of words and tf-idf scores, I'd like to use Scikit-learn(http://scikit-learn.org/stable/).
For single word sentiment, you can use Stanford Nlp sentiment analysis.(http://nlp.stanford.edu/sentiment/)
I am working on a Word representation algorithm, similar to Word2Vec and GloVe.I have been asked to make it more dynamic, such that new words could be added to the vocabulary,and new documents could be submitted to the program even after the representations (vectors) have been created.
The problem is, how do I know if my representation work? How do I know if it actually captures the meaning of each word? How do I compare my representation with other existing vector space models?
As of now, I am doing the following tests to check the quality of my word vectors:
Distance test:
Does the cosine distance between vectors reflect the semantic distance between words?
Analogy test:
Can the representation be used to solve problems like "King is to queen what man is to ________ ", (the answer should be woman)
Picking the odd one out:
Can the vectors be used to pick the odd word in a given list of words. If the input is {"cat","dog","phone"}, the output should be "phone"?
What are the other tests that I should do to check the quality of the vectors? What other tasks are word vectors expected to be capable of doing? Is there a benchmark for vector space models?
Your tests sound very reasonable — they are the usual evaluation tasks that are used in research papers to test the quality of word embeddings.
In addition, the website www.wordvectors.org can give you a good idea of how your vectors measure up. It allows you to upload your embeddings, generates plots, gives correlations with word pair similarity rankings, and compares your embeddings with pre-trained vectors from previous research. You can find a more detailed description in the accompanying paper.
Usually one wants to get a feature from a text by using the bag of words approach, counting the words and calculate different measures, for example tf-idf values, like this: How to include words as numerical feature in classification
But my problem is different, I want to extract a feature vector from a single word. I want to know for example that potatoes and french fries are close to each other in the vector space, since they are both made of potatoes. I want to know that milk and cream also are close, hot and warm, stone and hard and so on.
What is this problem called? Can I learn the similarities and features of words by just looking at a large number documents?
I will not make the implementation in English, so I can't use databases.
hmm,feature extraction (e.g. tf-idf) on text data are based on statistics. On the other hand, you are looking for sense (semantics). Therefore no such a method like tf-idef will work for you.
In NLP exists 3 basic levels:
morphological analyses
syntactic analyses
semantic analyses
(higher number represents bigger problems :)). Morphology is known for majority languages. Syntactic analyses is a bigger problem (it deals with things like what is verb, noun in some sentence,...). Semantic analyses has the most challenges, since it deals with meaning which is quite difficult to represent in machines, have many exceptions and are language-specific.
As far as I understand you want to know some relationships between words, this can be done via so-called dependency tree banks, (or just treebank): http://en.wikipedia.org/wiki/Treebank . It is a database/graph of sentences where a word can be considered as a node and relationship as arc. There is good treebank for czech language and for english there will be also some, but for many 'less-covered' languages it can be a problem to find one ...
user1506145,
Here is a simple idea that I have used in the past. Collect a large number of short documents like Wikipedia articles. Do a word count on each document. For the ith document and the jth word let
I = the number of documents,
J = the number of words,
x_ij = the number of times the jth word appears in the ith document, and
y_ij = ln( 1+ x_ij).
Let [U, D, V] = svd(Y) be the singular value decomposition of Y. So Y = U*D*transpose(V)), U is IxI, D is diagonal IxJ, and V is JxJ.
You can use (V_1j, V_2j, V_3j, V_4j) as a feature vector in R^4 for the jth word.
I am surprised the previous answers haven't mentioned word embedding. Word embedding algorithm can produce word vectors for each word a given dataset. These algorithms can nfer word vectors from the context. For instance, by looking at the context of the following sentences we can say that "clever" and "smart" is somehow related. Because the context is almost the same.
He is a clever guy
He is a smart guy
A co-occurrence matrix can be constructed to do this. However, it is too inefficient. A famous technique designed for this purpose is called Word2Vec. It can be studied from the following papers.
https://arxiv.org/pdf/1411.2738.pdf
https://arxiv.org/pdf/1402.3722.pdf
I have been using it for Swedish. It is quite effective in detecting similar words and completely unsupervised.
A package could be find in gensim and tensorflow.