Classification with no classifiers in the training data - machine-learning

I have data of news over the period of years and I want to train the data such that whenever I give it a testing news it returns me its the industry to which the news is related for example 'manufacturing' or 'finance'. It can be done using a classification algorithm but unfortunately I do not have the classifiers to train the data as well. My data looks like this:
ID | News
1 | News1
2 | News2
3 | News3
If the data would have been in the following form then I could apply classification algorithms to classify the industry:
ID | News | Industry Related to
1 | News1 | Manufacturing
2 | News2 | Finance
3 | News3 | e-commerce
But you know news apis does provide industry related to news. How can I train my model in this case ?

There are different ways to achieve this, and each has advantages and disadvantages. The problem you describe isn't an easy one.
I can't give a general and correct answer to this question, as it depends heavily on what you are trying to achieve.
What you are trying to do is called unsupervised learning
. Generally the Google-term you could use is "classify unlabeled data".
The Wikipedia-article of this topic has a very good overview of techniques that you may use. Since machine-learning problems often aren't clear-cut and algorithms chosen very a lot per project (size of the dataset, processing-power, cost of misclassification, ...) no one will be able to give you a general perfect answer without actually knowing your data and problem in detail.
Personally from just reading your post my first approach would be to use a clustering-algorithm (like k-means-clustering (see the Wikipedia-article, I cannot post more than two links), using the cosine-similarity of the texts) to generate different clusters of News, and then look through these clusters, label them manually, and use the result as training data - or automatically generate labels using tf*idf (see the Wikipedia-article, I cannot post more than two links)
However, the results of this may be very good, very bad or anything in between.

Recent advances in zero-shot and few-shot learning can let you build your classifier with little (100 - 200 training data) or no training data at all. Your classifier still retains all the benefits of a supervised classifier and gives you the control to decide your categories.
I have built one such system and you can try out the demo on your own categories and data to see the system in action.

Related

AutoML NL - Training model based on ICD10-CM - Amount of text required

We are currently working on integrating ICD10-CM for our medical company, to be used for patient diagnosis. ICD10-CM is a coding system for diagnoses.
I tried to import ICD10-CM data in description-code pairs but obviously, it didn't work since AutoML needed more text for that code(label). I found a dataset on Kaggle but it only contained hrefs to an ICD10 website. I did find out that the website contains multiple texts and descriptions associated with codes that can be used to train our desired model.
Kaggle Dataset:
https://www.kaggle.com/shamssam/icd10datacom
Sample of a page from ICD10data.com:
https://www.icd10data.com/ICD10CM/Codes/A00-B99/A15-A19/A17-/A17.0
Most notable fields are:
- Approximate Synonyms
- Clinical Information
- Diagnosis Index
If I made a dataset from the sentences found in these pages and assigned them to their code(labels), will it be enough for AutoML dataset training? since each label will have 2 or more texts finally instead of just one, but definitely still a lot less than a 100 for each code unlike those in demos/tutorials.
From what I can see here, the disease code has a tree-like structure where, for instance, all L00-L99 codes refer to "Diseases of the skin and subcutaneous tissue". At the same time L00-L08 codes refer to "Infections of the skin and subcutaneous tissue", and so on.
What I mean is that the problem is not 90000 examples for 90000 different independent labels, but a decision tree (you take several decisions in function of the previous decision: the first step would be choosing which of the about 15 most general categories fits best, then choosing which of the subcategories etc.)
In this sense, probably autoML is not the best product, given that you cannot implement a specially designed decision tree model that takes into account all of this.
Another way of using autoML would be training separately for each of the decisions and then combine the different models. This would easily work for the first layer of decision but would be exponentially time consuming (the number of models to train in order to be able to predict more accurately grows exponentially with the level of accuracy, by accurate I mean afirminng it is L00-L08 instad of L00-L99).
I hope this helps you understand better the problem and the different approaches you can give to it!

Can different summary metrics of a single feature be used as a features for k-means clustering?

I have a scenario where i wanted to understand the customers behavior pattern and group them into different segments/clusters for an e-commerce platform. I choose to un-supervised machine learning algorithm: k-means clustering to accomplish this task.
I have purchase_orders data available to me.
In the process of preparing my data set, i had a question: Can different summary metrics like (Sum, Avg, Min, Max, Standard Deviation) of a feature be considered into different features. Or should i take only one summary metric (say, sum of total transaction amount of a customer over multiple orders) of a feature.
Will this effect how the functioning of the k-means algorithm works?
Which of the below two data formats mentioned below, that i can feed to my algorithm be optimal to derive good results :
Format-1:
Customer ID | Total.TransactionAmount | Min.TransactionAmount |
Max.TransactionAmount | Avg.TransactionAmount |
StdDev.TransactionAmount | TotalNo.ofTransactions and so on...,
Format-2:
Customer ID | Total.TransactionAmount | TotalNo.ofTransactions and so
on...,
(Note: Consider "|" as feature separator)
(Note: Customer ID is not fed as input to the algo)
Yes you can, but whether this is a good idea is all but clear.
These values will be correlated and hence this will distort the results. It will likely make all the problems you already have (such as the values not being linear, of the same importance and hence need weighting, and of similar magnitude) worse.
With features such as "transaction amount"mand "number of transactions" you already have some pretty bad scaling issues to solve, so why add more?
It's straightforward to write down your objective function. Put your features into the equation, and try to understand what you are optimizing - is this really what you need? Or do you just want some random result?

Different format for training and testing data [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 5 years ago.
Improve this question
How can I use different format of data for training and testing?
Currently I am working on a classification problem and the data format is different for training and testing. So, it is not able to classify properly. But my usecase is such that I have to use in that way only. Below is my format.
I have the following structure for training now:
--------------------------------------------------------------
| Attribute_Names | Attribute_Values | Category |
--------------------------------------------------------------
| Brand | Samsung, Nokia, OnePlus | Mobile |
| RAM, Memory | 2 GB, 4 GB, 3 GB, 6GB | Mobile |
| Color,Colour | Black, Golden, White | Mobile |
--------------------------------------------------------------
| Fabric, Material | Cloth, Synthetic, Silk | Ethnic Wear |
| Pattern, Design | Digital, floral print | Ethnic Wear |
--------------------------------------------------------------
Testing: 'Samsung Galaxy On Nxt 3 GB RAM 16 GB ROM Expandable Upto 256 GB 5.5 inch Full HD Display'
I have also posted this problem with explanation in data-science group:
product-classification-in-e-commerce-using-attribute-keywords
Any help is greatly appreciated.
As a very general principle, the format of the training & test data must be the same. There are not simple workarounds for this. There can be cases of course where the information contained in the attributes of the said datasets is essentially the same, only formatted differently; such cases are handled with appropriate data preprocessing.
If your data gathering process changes at some point in time, collecting different attributes, there are some choices available, highly dependent on the particular case: if you happen to collect more attributes that the ones with which you have trained your initial classifier, you can choose to simply ignore them, with the downside of possibly throwing away useful information. But if you happen to collect less attributes, there is not much you can do, other than possibly treat them as missing values in your prediction pipeline (provided that your classifier can indeed handle such missing values without significant decrease in performance); the other choice would be to re-train your classifier by dropping from your initial training set the attributes not collected in your new data gathering process.
Machine learning is not magic; on the contrary, it is subject to very real engineering constraints, the compatibility of the training and test data being one of the most fundamental ones. If you look around for real-world cases of deployed ML models, you'll see that the issue of possible periodic re-training of the deployed models pops up almost everywhere...
So based on your description I think what you have is a bunch of data, where some are labeled and most are not, which leaves you to a semi-supervised kind of setting.
Now, testing different data than what you trained is definitely a no-go. What you should do is mix-up the data so that you distribute your labels amongst both training and test set, that is if you wanna use a supervised methodology of course. Given that you did that you could use some techniques for label propagation, where you e.g. cluster similar data and assign the most frequent label inside that group, this can be as easy as a simple k-means clustering or as difficult as creating your own graph structure and implement a custom algorithm. So assuming you have fully labeled data now you can train your model on a mixture of real labels and assigned ones and test it the same way.
This is the only way you get a reliable performance evaluation. You obviously could also try to just train on the completely labeled data and evaluate on a mixture of real labels and assigned ones but I'd assume this would lead over-fitting.
Another related way would be to go create some kind of word vectors (google word2vec e.g.) and then use your labeled data as clusters (one per category) assign new unlabeled data to the most similar cluster (there are many variants and ways to do this).
There are many ways to tackle such a problem that are not some standard algorithms in a textbook but rather things you have to implement yourself or at least you need to do some serious preprocessing to make your data fit a standard algorithm.
I hope this helps a bit.

unsupervised learning on sentences

I have a data that represents comments from the operator on various activities performed on a industrial device. The comments, could reflect either a routine maintainence/replacement activity or could represent that some damage occured and it had to be repaired to rectify the damage.
I have a set of 200,000 sentences that needs to be classified into two buckets - Repair/Scheduled Maintainence(or undetermined). These have no labels, hence looking for an unsupervised learning based solution.
Some sample data is as shown below:
"Motor coil damaged .Replaced motor"
"Belt cracks seen. Installed new belt"
"Occasional start up issues. Replaced switches"
"Replaced belts"
"Oiling and cleaning done".
"Did a preventive maintainence schedule"
The first three sentences have to be labeled as Repair while the second three as Scheduled maintainence.
What would be a good approach to this problem. though I have some exposure to Machine learning I am new to NLP based machine learning.
I see many papers related to this https://pdfs.semanticscholar.org/a408/d3b5b37caefb93629273fa3d0c192668d63c.pdf
https://arxiv.org/abs/1611.07897
but wanted to understand if there is any standard approach to such problems
Seems like you could use some reliable keywords (verbs it seems in this case) to create training samples for an NLP Classifier. Or you could use KMeans or KMedioids clustering and use 2 as K, which would do a pretty good job of separating the set. If you want to get really involved, you could use something like Latent Derichlet Allocation, which is a form of unsupervised topic modeling. However, for a problem like this, on the small amount of data you have, the fancier you get the more frustrated with the results you will become IMO.
Both OpenNLP and StanfordNLP have text classifiers for this, so I recommend the following if you want to go the classification route:
- Use key word searches to produce a few thousand examples of your two categories
- Put those sentences in a file with a label based on the OpenNLP format (label |space| sentence | newline )
- Train a classifier with the OpenNLP DocumentClassifier, and I recommend stemming for one of your feature generators
- after you have the model, use it in Java and classify each sentence.
- Keep track of the scores, and quarantine low scores (you will have ambiguous classes I'm sure)
If you don't want to go that route, I recommend using a text indexing technology de-jeur like SOLR or ElasticSearch or your favorite RDBMS's text indexing to perform a "More like this" type function so you don't have to play the Machine learning continuous model updating game.

Good training data for text classification by LDA?

I'm classifying content based on LDA into generic topics such as Music, Technology, Arts, Science
This is the process i'm using,
9 topics -> Music, Technology, Arts, Science etc etc.
9 documents -> Music.txt, Technology.txt, Arts.txt, Science.txt etc etc.
I've filled in each document(.txt file) with about 10,000 lines of content of what i think is "pure" categorical content
I then classify a test document, to see how well the classifier is trained
My Question is,
a.) Is this an efficient way to classify text (using the above steps)?
b.) Where should i be looking for "pure" topical content to fill each of these files? Sources which are not too large (text data > 1GB)
classification is only on "generic" topics such as the above
a) The method you describe sounds fine, but everything will depend on the implementation of labeled LDA that you're using. One of the best implementations I know is the Stanford Topic Modeling Toolbox. It is not actively developed anymore, but it worked great when I used it.
b) You can look for topical content on DBPedia, which has a structured ontology of topics/entities, and links to Wikipedia articles on those topics/entities.
I suggest you to use bag-of-words (bow) for each class you are using. Or vectors where each column is the frequency of important keywords related to the class you want to target.
Regarding the dictionaries you have DBPedia as yves referred or WordNet.
a.)The simplest solution is surely the k-nearest neighbors algorithm (knn). In fact, it will classify new texts with categorical content using an overlap metric.
You could find ressources here: https://github.com/search?utf8=✓&q=knn+text&type=Repositories&ref=searchresults
Dataset issue:
If you are dealing with classifying live user feeds, then I guess no single dataset will suffice your requirement.
Because if new movie X released, it might not catch by your classification dataset as the training dataset is obsoleted for it now.
For classification I guess to stay updated with latest datasets, use twitter training datasets. Develop dynamic algorithm which update the classifier with latest updated tweet datasets. You could select top 15-20 hash tag for each category of your choice to get most relevant dataset for each category.
Classifier:
Most of the classifier uses bag of words model, you can try out various classifiers and see which gives best result. see :
http://www.nltk.org/howto/classify.html
http://scikit-learn.org/stable/supervised_learning.html

Resources