Use classes based on multiple columns in text classification - machine-learning

I am working on a text classification problem for which I cant think of or find a solution. Essentially I am classifying a private complaint database which has custom categories per municipality this because some municipalities have other issues than others.
Example:
Mun. Issue Class
London Street lights are off Street-lighting
New York Street lights are off lighting
As you can see, I want to classify the issue based on the municipality, thus based on the first column select only the specific categories of that municipality and then choose the one which is classified by the issue. Currently I created superclasses which contains similar classes but now I want to be more specific. I have a big dataset and every municipality has around 10 classes.

You can use a normal classification algorithm with a neural net. steps would be:
1. Create the corpus into One hot vector
2. Train the neural network as multiclass classification
I think any normal neural network with a sufficient number of neurons can provide the results.

Related

Is it possible to have a class feature with several values?

I have a dataset in which the class has several values. For example, a dataset of face recognition where the class could be a tuple (man, old, Chinese).
Is it possible to have such data, if yes what ML classifier should I use?
I beleive this questions must be moved to another paltform like the https://datascience.stackexchange.com/
What you ask for is called Mutli-label Classification
In multiple label classification tasks, the model is trained to provide the probabilities or likelihood of more than one label for a given sample.
You can wether use the Multi-lable classification, or you can use multiple binary classifiers for the prediction of each feature. Like one binary classification for predicting Man or Woman, the other for Old or Young and etc. But you must be cautious that yoru labels be semantically mutual exclusive. I mean if you have labels like "sky" and "outdoor", the binary classification might be noisy if your labels are not carefully made. i.e if for a sample you have "sky" label, but no "outdoor" label, that will cause some noises during your training

ZERO SHOT LEARNING

I understand that in zero shot learning, the classes are divided into seen/unseen categories. Then, we train the network for example on 50 classes and test on the other 50 that the network has not seen. I also understand that the network uses attributes in the unseen classes(Not sure how it is used). However, my question is that how the network classifies the unseen classes? Does it actually label each class by its name. For example, if I am doing zero-shot action recognition and the unseen classes are such biking, swimming, football. Does the network actually name these classes? How does it know their labels?
The network uses the seen classes to learn relation between images and attributes or other information such as human gaze , word embeddings or whatever information that could be related between classes and images. Based on what the network learns it could be further mapped to the objects and attributes.
Say your classifier has pig , dogs , horses and cats images and its attributes during training time and has to classify a zebra during test time. During training time it learns the relation between image pixels and attribute 'stripes,tail,black,white...'
So during test time given image and attributes of zebra you need to use the classifier to figure out if they are related or not. Oh , well you could be given a image of a horse too which looks like a Zebra. So your classifier must learn to generalize well.

multi-label text classification with zero or more labels

I need to classify website text with zero or more categories/labels (5 labels such as finance, tech, etc). My problem is handling text that isn't one of these labels.
I tried ML libraries (maxent, naive bayes), but they match "other" text incorrectly with one of the labels. How do I train a model to handle the "other" text? The "other" label is so broad and it's not possible to pick a representative sample.
Since I have no ML background and don't have much time to build a good training set, I'd prefer a simpler approach like a term frequency count, using a predefined list of terms to match for each label. But with the counts, how do I determine a relevancy score, i.e. if the text is actually that label? I don't have a corpus and can't use tf-idf, etc.
Another idea , is to user neural networks with softmax output function, softmax will give you a probability for every class, when the network is very confident about a class, will give it a high probability, and lower probabilities to the other classes, but if its insecure, the differences between probabilities will be low and none of them will be very high, what if you define a treshold like : if the probability for every class is less than 70% , predict "other"
Whew! Classic ML algorithms don't combine both multi-classification and "in/out" at the same time. Perhaps what you could do would be to train five models, one for each class, with a one-against-the-world training. Then use an uber-model to look for any of those five claiming the input; if none claim it, it's "other".
Another possibility is to reverse the order of evaluation: train one model as a binary classifier on your entire data set. Train a second one as a 5-class SVM (for instance) within those five. The first model finds "other"; everything else gets passed to the second.
What about creating histograms? You could use a bag of words approach using significant indicators of for e.g. Tech and Finance. So, you could try to identify such indicators by analyzing the certain website's tags and articles or just browse the web for such inidicators:
http://finance.yahoo.com/news/most-common-words-tech-finance-205911943.html
Let's say your input vactor X has n dimensions where n represents the number of indicators. For example Xi then holds the count for the occurence of the word "asset" and Xi+k the count of the word "big data" in the current article.
Instead of defining 5 labels, define 6. Your last category would be something like a "catch-all" category. That's actually your zero-match category.
If you must match the zero or more category, train a model which returns probability scores (such as a neural net as Luis Leal suggested) per label/class. You could than rate your output by that score and say that every class with a score higher than some threshold t is a matching category.
Try this NBayes implementation.
For identifying "Other" categories, dont bother much. Just train on your required categories which clearly identifies them, and introduce a threshold in the classifier.
If the values for a label does not cross a threshold, then the classifier adds the "Other" label.
It's all in the training data.
AWS Elasticsearch percolate would be ideal, but we can't use it due to the HTTP overhead of percolating documents individually.
Classify4J appears to be the best solution for our needs because the model looks easy to train and it doesn't require training of non-matches.
http://classifier4j.sourceforge.net/usage.html

Training and Testing Data set for classification text file

Suppose we have 10000 text file and We would like to classify as political ,health,weather,sports,Science ,Education,.........
I need training data set for classification of text documents and I am Naive Bayes classification Algorithm. Anyone can help to get data sets .
OR
Is there any another way to get classification done..I am new at Machine Learning Please explain your answer completely.
Example:
**Sentence** **Output**
1) Obama won election. ----------------------------------------------->political
2) India won by 10 wickets ---------------------------------------------->sports
3) Tobacco is more dangerous --------------------------------------------->Health
4) Newtons laws of motion can be applied to car -------------->science
Any way to classify these sentences into their respective categories
Have you tried to google it? There are tons and tons of datasets for text categorization. The classical one is Reuters-21578 (https://archive.ics.uci.edu/ml/datasets/Reuters-21578+Text+Categorization+Collection), another famous one and mentioned almost in each ML book is 20 newsgroup: http://web.ist.utl.pt/acardoso/datasets/
But there are lots of other, one google query away from you. Just load them, slightly adjust if needed and train your classifier on that datasets.

Methods to ignore missing word features on test data

I'm working on a text classification problem, and I have problems with missing values on some features.
I'm calculating class probabilities of words from labeled training data.
For example;
Let word foo belongs to class A for 100 times and belongs to class B for 200 times. In this case, i find class probability vector as [0.33,0.67] , and give it along with the word itself to classifier.
Problem is that, in the test set, there are some words that have not been seen in training data, so they have no probability vectors.
What could i do for this problem?
I ve tried giving average class probability vector of all words for missing values, but it did not improve accuracy.
Is there a way to make classifier ignore some features during evaluation just for specific instances which does not have a value for giving feature?
Regards
There is many way to achieve that
Create and train classifiers for all sub-set of feature you have. You can train your classifier on sub-set with the same data as tre training of the main classifier.
For each sample juste look at the feature it have and use the classifier that fit him the better. Don't try to do some boosting with thoses classifiers.
Just create a special class for samples that can't be classified. Or you have experimented result too poor with so little feature.
Sometimes humans too can't succefully classify samples. In many case samples that can't be classified should just be ignore. The problem is not in the classifier but in the input or can be explain by the context.
As nlp point of view, many word have a meaning/usage that is very similare in many application. So you can use stemming/lemmatization to create class of words.
You can also use syntaxic corrections, synonyms, translations (does the word come from another part of the world ?).
If this problem as enouph importance for you then you will end with a combination of the 3 previous points.

Resources