I'm working on a binary classification task to predict whether a user will default on his or her loan. The data I have is about 380,000 short messages from about 9,100 users, including texts and when they were received. These messages include repayment tips from financial institutions, advertisements and communications between users and friends. Unlike text classification, this is more like classifying collections of texts (since each user has multiple messages). I don't know how to extract features from these messages effectively.
I tried to concatenate all short messages of each user into one long text, so that the problem became text classification. Then I used the tf-idf method to extract features from these long texts, and used lightgbm for binary classification. However, the score on the test set was very low. The value for AUC was only about 0.53, which is no different from random guess. Do you have any suggestions?
Related
https://aclanthology.info/pdf/S/S17/S17-2132.pdf
In this paper that describes one of the systems used in the shared task of SemEval2017, it explains how an SVM Classifier was implemented for a twitter sentiment analysis task.
I am trying to imitate this system, but there are some parts that are not understandable as a beginner.
It lists bunch of features that describe a single tweet. To implement a feature vector X (where a single row represents a single tweet and the column contains all values of the features listed in the paper) am I supposed to compute vectors for each features and concatenate it at the end?
One of the features is the Tf-Idf vectorizer. As far as I know, Tf-Idf gives us the weight of the word per document. But in this case, what would be the document? Wouldn't using Tf-idf as one of the features cause the final X to have rows with different numbers of columns? I just want to know how Tf-idf would work here as one of the feature vectors.
What does the 2.8 Topic and Hashtag feature indicate? I don't quite what kind of value it is describing.
Any kind of advice would be great! Please help!
I am researching what features I'll have for my machine learning model, with the data I have. My data contains a lot of textdata, so I was wondering how to extract valuable features from it. Contrary to my previous belief, this often consists of representation with Bag-of-words, or something like word2vec: (http://scikit-learn.org/stable/modules/feature_extraction.html#text-feature-extraction)
Because my understanding of the subject is limited, I dont understand why I can't analyze the text first to get numeric values. (for example: textBlob.sentiment =https://textblob.readthedocs.io/en/dev/, google Clouds Natural Language =https://cloud.google.com/natural-language/)
Are there problems with this, or could I use these values as features for my machine learning model?
Thanks in advance for all the help!
Of course, you can convert text input single number with sentiment analysis then use this number as a feature in your machine learning model. Nothing wrong with this approach.
The question is what kind of information you want to extract from text data. Because sentiment analysis convert text input to a number between -1 to 1 and the number represents how positive or negative the text is. For example, you may want sentiment information of the customers' comments about a restaurant to measure their satisfaction. In this case, it is fine to use sentiment analysis to preprocess text data.
But again, sentiment analysis is only given an idea about how positive or negative text is. You may want to cluster text data and sentiment information is not useful in this case since it does not provide any information about the similarity of texts. Thus, other approaches such as word2vec or bag-of-words will be used for the representation of text data in those tasks. Because those algorithms provide vector representation of the text instance of a single number.
In conclusion, the approach depends on what kind of information you need to extract from data for your specific task.
I need to perform the text classification on set of emails. But all the words in my text are thinly sparse i.e frequency of each word with respect to all the documents are very less. words are not that much frequently repeating. Since to train the classifiers I think document term matrix with frequency as weightage is not suitable. Can you please suggest me what kind of other methods I need to use .
Thanks
The real problem will be, that if your words are that sparse a learned classifier will not generalise to the real world data. However, there are several solutions to it
1.) Use more data. This is kind-of a no-brainer. However, you can not only add labeled data you can also use unlabelled data in a semi-supervised learning
2.) Use more data (part b). You can look into the transfer learning setting. There you build a classifier on a large data set with similar characteristics. This might be twitter streams and then adapt this classifier to your domain
3.) Get your processing pipeline right. Your problem might origin from a suboptimal processing pipeline. Are you doing stemming? In the email the word steming should be mapped onto stem. This can be pushed even further by using synonym matching with a dictionary.
I am trying to do document classification. But I am really confused between feature selections and tf-idf. Are they the same or two different ways of doing classification?
Hope somebody can tell me? I am not really sure that my question will make sense to you guys.
Yes, you are confusion a lot of things.
Feature selection is the abstract term for choosing features (0 or 1). Stopword removal can be seen as feature selection.
TF is one method of extracting features from text: counting words.
IDF is one method of assigning weights to features.
Neither of them is classification... they are popular for text classification, but they are even more popular for information retrieval, which is not classification...
However, many classifiers work on numeric data, so the common process is to 1. Extract features (e.g.: TF) 2. Select features (e.g. remove stopwords) 3. Weight features (e.g. IDF) 4. Train a classifier on the resulting numerical vectors. 5. Predict the classes of new/unlabeled documents.
Taking a look at this explanation may help a lot when it comes to understanding text classifiers.
TF-IDF is a good way to find a document that answers a given query, but it does not necessarily assigns documents with classes.
Examples that may be helpful:
1) You have a bunch of documents with subjects ranging from politics, economics, computer science and the arts. The documents belonging to each subject are separated into the appropriate directories for each subject (you have a labeled dataset). Now, you received a new document whose subject you do not know. In which directory should it be stored? A classifier can answer this question from the documents that are already labeled.
2) Now, you received a query regarding computer science. For instance, you received the query "Good methods for finding textual similarity". Which document in the directory of computer science can provide the best response to that query? TF-IDF would be a good approach to figure that out.
So, when you are classifying documents, you are trying to make a decision about whether a document is a member of a particular class (like, say, 'about birds' or 'not about birds').
Classifiers predict the value of the class given a set of features. A good set of features will be highly discriminative - they will tell you a lot about whether the document is of one class or another.
Tf-idf (term frequency inverse document frequency) is a particular feature that seems to be discriminative for document classification tasks. There are others, like word counts (tf or term frequency) or whether a regexp matches the text or what have you.
Feature selection is the task of selecting good (discriminative) features. Tfidf is probably a good feature to select.
I am working on Topic Modeling where the given text corpus have lots of noise in form of supporting words after removal of stop words. These words have high term frequency but does not help in forming topic terms by using LDA along with other words with high frequency that are useful . How can this noise be removed?
LDA algorithms don't take tf-idf weights in input, but bag of words, however you could first filter words from your corpus based on their tf-idf score, and then feed the new texts to your LDA program.
Basic thing is that you do a TF-IDF and clean on scores, if that still doesnt help then you can create domain specific custom stopwords list. Suppose if I'm in a jobs domain, the word "job" is not a regular stopword but in jobs domain it is or the company name is a stopword since it repeats across many documents. So, building custom stopwords list is another way to go with.