Identifying specific parts of a document using CRF - machine-learning

My goal is given a set of documents (mostly in financial domain), we need to identify specific parts of it like Company Name or Type of the document, etc.
The training is assumed to be done on acouple of 100's of documents. Obviously I would have a skewed class distribution (with None dominating around 99.9% of the examples).
I plan to use CRF (CRFsuite on Sklearn) and have gone through the necessary literature . I needed some advice on the following fronts :
Will the dataset be sufficient to train the CRF? Considering each document can be split into around 100 tokens (each token being a training instance) , we would get 10000 instances in total.
Will the data set be too skewed for training a CRF? For ex: for 100 documents I would have around 400 instances of given class and around 8000 instances of None

Nobody knows that, you have to try it on your dataset, check resulting quality, maybe inspect the CRF model (e.g. https://github.com/TeamHG-Memex/eli5 has sklearn-crfsuite support - a shameless plug), try to come up with better features or decide to annotate more examples, etc. This is just a general data science work. Dataset size looks on a lower side, but depending on how structured is the data and how good are features a few hundred documents may be enough to get started. As the dataset is small, you may have to invest more time in feature engineering.
I don't think class imbalance would be a problem, at least it is unlikely to be your main problem.

Related

AutoML NL - Training model based on ICD10-CM - Amount of text required

We are currently working on integrating ICD10-CM for our medical company, to be used for patient diagnosis. ICD10-CM is a coding system for diagnoses.
I tried to import ICD10-CM data in description-code pairs but obviously, it didn't work since AutoML needed more text for that code(label). I found a dataset on Kaggle but it only contained hrefs to an ICD10 website. I did find out that the website contains multiple texts and descriptions associated with codes that can be used to train our desired model.
Kaggle Dataset:
https://www.kaggle.com/shamssam/icd10datacom
Sample of a page from ICD10data.com:
https://www.icd10data.com/ICD10CM/Codes/A00-B99/A15-A19/A17-/A17.0
Most notable fields are:
- Approximate Synonyms
- Clinical Information
- Diagnosis Index
If I made a dataset from the sentences found in these pages and assigned them to their code(labels), will it be enough for AutoML dataset training? since each label will have 2 or more texts finally instead of just one, but definitely still a lot less than a 100 for each code unlike those in demos/tutorials.
From what I can see here, the disease code has a tree-like structure where, for instance, all L00-L99 codes refer to "Diseases of the skin and subcutaneous tissue". At the same time L00-L08 codes refer to "Infections of the skin and subcutaneous tissue", and so on.
What I mean is that the problem is not 90000 examples for 90000 different independent labels, but a decision tree (you take several decisions in function of the previous decision: the first step would be choosing which of the about 15 most general categories fits best, then choosing which of the subcategories etc.)
In this sense, probably autoML is not the best product, given that you cannot implement a specially designed decision tree model that takes into account all of this.
Another way of using autoML would be training separately for each of the decisions and then combine the different models. This would easily work for the first layer of decision but would be exponentially time consuming (the number of models to train in order to be able to predict more accurately grows exponentially with the level of accuracy, by accurate I mean afirminng it is L00-L08 instad of L00-L99).
I hope this helps you understand better the problem and the different approaches you can give to it!

Different performance by different ML classifiers, what can I deduce?

I have used a ML approach to my research using python scikit-learn. I found that SVM and logistic regression classifiers work best (eg: 85% accuracy), decision trees works markedly worse (65%), and then Naive Bayes works markedly worse (40%).
I will write up the conclusion to illustrate the obvious that some ML classifiers worked better than the others by a large margin, but what else can I say about my learning task or data structure based on these observations?
Edition:
The data set involved 500,000 rows, and I have 15 features but some of the features are various combination of substrings of certain text, so it naturally expands to tens of thousands of columns as a sparse matrix. I am using people's name to predict some binary class (eg: Gender), though I feature engineer a lot from the name entity like the length of the name, the substrings of the name, etc.
I recommend you to visit this awesome map on choosing the right estimator by the scikit-learn team http://scikit-learn.org/stable/tutorial/machine_learning_map
As describing the specifics of your own case would be an enormous task (I totally understand you didn't do it!) I encourage you to ask yourself several questions. Thus, I think the map on 'choosing the right estimator' is a good start.
Literally, go to the 'start' node in the map and follow the path:
is my number of samples > 50?
And so on. In the end you might end at some point and see if your results match with the recommendations in the map (i.e. did I end up in a SVM, which gives me better results?). If so, go deeper into the documentation and ask yourself why is that one classifier performing better on text data or whatever insight you get.
As I told you, we don't know the specifics of your data, but you should be able to ask such questions: what type of data do I have (text, binary, ...), how many samples, how many classes to predict, ... So ideally your data is going to give you some hints about the context of your problem, therefore why some estimators perform better than others.
But yeah, your question is really broad to grasp in a single answer (and specially without knowing the type of problem you are dealing with). You could also check if there might by any of those approaches more inclined to overfit, for example.
The list of recommendations could be endless, this is why I encourage you to start defining the type of problem you are dealing with and your data (plus to the number of samples, is it normalized? Is it disperse? Are you representing text in sparse matrix, are your inputs floats from 0.11 to 0.99).
Anyway, if you want to share some specifics on your data we might be able to answer more precisely. Hope this helped a little bit, though ;)

Association Rule Mining on Categorical Data with a High Number of Values for Each Attribute

I am struggling with association rule mining for a data set, the data set has a lot of binary attributes but also has a lot of categorical attributes. Converting the categorical to binary is theoretically possible but not practical. I am searching for a technique to overcome this issue.
Example of data for a car specifications, to execute association rule mining, the car color attribute should be a binary, and in the case of colors, we have a a lot of colors to be transferred to binary (My data set is insurance claims and its much worse than this example).
Association rule mining doesn't use "attributes". It processes market basket type of data.
It does not make sense to preprocess it to binary attributes. Because you would need to convert the binary attributes into items again (worst case, you would then tranlate your "color=blue" item into "color_red=0, color_black=0, ... color_blue=1" if you are also looking for negative rules.
Different algorithms - and different implementations of the theoretically same algorithm, unfortunately - will scale very differently.
APRIORI is designed to scale well with the number of transactions, but not very well with the number of different items that have minimum support; in particular if you are expecting short itemsets to be frequent only. Other algorithms such as Eclat and FP-Growth may be much better there. But YMMV.
First, try to convert the data set into a market basket format, in a way that you consider every item to be relevant. Discard everything else. Then start with a high minimum support, until you start getting results. Running with a too low minimum support may just run out of memory, or may take a long time.
Also, make sure to get a good implementation. A lot of things that claim to be APRIORI are only half of it, and are incredibly slow.

Incremental Adding of New Categories

I was wondering if there is any algorithm for incrementally adding new classes to existing classifier system. For e.g. if I have trained a system with 50 categories, and I want to add another 10 categories to the system, what methods should I look into? There are wide range of algorithms that allow incrementally updating system with additional training samples from existing categories, but I am not aware of methods that will allow adding more categories. Theoretically, I think Nearest Neighbor like algorithms can be applied to this task, but are there other algorithms that are suitable for large scale tasks (say updating a system trained with 500 categories with 50 additional categories? May be in the domain of incremental decision trees? Algorithms like incremental SVM do not scale very well for large number of categories. If there is any paper/code I would appreciate pointers to it.
If I understand your question correctly, you're asking about divisive clustering (you have a given set of data and want to re-cluster them with a larger number of groups than before).
Most algorithms I'm familiar with would require re-building the clustering basically from scratch. However, you might want to look at the BIRCH algorithm. Since it stores only a summary of the classes (without explicit data references), it is a) suitable for Big Data™, and b) it features a kind of distance measure that might tell you which category you should split next (in case you want to dynamically generate additional 50 "most distinct" categories).

what should I do when training set contains some error data in supervised classification?

I am working on a project which performs text auto-classification, I have a lot of data set like as below:
Text | CategoryName
xxxxx... | AA
yyyyy... | BB
zzzzz... | AA
then, I will use the above data set to generate a classifier, once new text coming, the classifier can label new text with correct CategoryName
(text is natural language, size between 10-10000)
Now, the problem is, the original data set contains some incorrect data, (E.g. AAA should be labeled as Category AA, but it is labeled as Category BB accidentally ) because these data are classified manually. And I don't know which label is wrong and how many percentages are wrong because I can't review all data manually...
So my question is, what should I do?
Can I find the wrong labels via some automatic way?
How to increase precision and recall when new data coming?
How to evaluate the impact of wrong data? (since I don't know how many percentage data is wrong)
Any other suggestions?
Obviously, there is no easy way to solve your problem - after all, why build a classifier if you already have a system that can detect wrong classifications.
Do you know how much the erroneous classifications affect your learning? If there are only a small percentage of them, they should not hurt the performance much. (Edit. Ah, apparently you don't. Anyway, I suggest you try it out - at least if you can identify a false result when you see one.)
Of course, you could always first train your system and then have it suggest classifications for the training data. This might help you identify (and correct) your faulty training data. This obviously depends on how much training data you have, and if it is sufficiently broad to allow your system to learn correct classification despite the faulty data.
Can you review any of the data manually to find some mislabeled examples? If so, you might be able to train a second classifier to identify mislabeled data, assuming there is some kind of pattern to the mislabeling. It would be useful for you to know if mislabeling is a purely random process (it is just noise in the training data) or if mislabeling correlates with particular features of the data.
You can't evaluate the impact of mislabeled data on your specific data set if you have no estimate regarding what fraction of your training set is actually mislabeled. You mention in a comment that you have ~5M records. If you can correctly manually label a few hundred, you could train your classifier on that data set, then see how the classifier performs after introducing random mislabeling. You could do this multiple times with varying percentages of mislabeled data to see the impact on your classifier.
Qualitatively, having a significant quantity of mislabeled samples will increase the impact of overfitting so it is even more important that you do not overfit your classifier to the data set. If you have a test data set (assuming it also suffers from mislabling), then you might consider training your classifier to less-than-maximal classification accuracy on the test data set.
People usually deal with the problem you a describing by having multiple annotators and computing their agreement (e.g. Fleiss' kappa). This is often seen as the upper bound on the performance of any classifier. If three people give you three different answers, you know the task is quite hard and your classifier stands no chance.
As a side note:
If you do not know how many of your records have been labelled incorrectly, you do not understand one of the key properties of the problem. Select 1000 records at random and spend the day reviewing their labels to get an idea. It really is time well spent. For example, I found I can easily review 500 labelled tweets per hour. Health warning: it is very tedious, but a morning spent reviewing gives me a good idea of how distracted my annotators were. If 5% of the records are incorrect, it is not such a problem. If 50 are incorrect, you should go back you your boss and tell them it can't be done.
As another side note:
Someone mentioned active learning. I think it is worth looking into options from the literature, keeping in mind labels might have to change. You said that it hard.

Resources