AutoML NL - Training model based on ICD10-CM - Amount of text required - machine-learning

We are currently working on integrating ICD10-CM for our medical company, to be used for patient diagnosis. ICD10-CM is a coding system for diagnoses.
I tried to import ICD10-CM data in description-code pairs but obviously, it didn't work since AutoML needed more text for that code(label). I found a dataset on Kaggle but it only contained hrefs to an ICD10 website. I did find out that the website contains multiple texts and descriptions associated with codes that can be used to train our desired model.
Kaggle Dataset:
https://www.kaggle.com/shamssam/icd10datacom
Sample of a page from ICD10data.com:
https://www.icd10data.com/ICD10CM/Codes/A00-B99/A15-A19/A17-/A17.0
Most notable fields are:
- Approximate Synonyms
- Clinical Information
- Diagnosis Index
If I made a dataset from the sentences found in these pages and assigned them to their code(labels), will it be enough for AutoML dataset training? since each label will have 2 or more texts finally instead of just one, but definitely still a lot less than a 100 for each code unlike those in demos/tutorials.

From what I can see here, the disease code has a tree-like structure where, for instance, all L00-L99 codes refer to "Diseases of the skin and subcutaneous tissue". At the same time L00-L08 codes refer to "Infections of the skin and subcutaneous tissue", and so on.
What I mean is that the problem is not 90000 examples for 90000 different independent labels, but a decision tree (you take several decisions in function of the previous decision: the first step would be choosing which of the about 15 most general categories fits best, then choosing which of the subcategories etc.)
In this sense, probably autoML is not the best product, given that you cannot implement a specially designed decision tree model that takes into account all of this.
Another way of using autoML would be training separately for each of the decisions and then combine the different models. This would easily work for the first layer of decision but would be exponentially time consuming (the number of models to train in order to be able to predict more accurately grows exponentially with the level of accuracy, by accurate I mean afirminng it is L00-L08 instad of L00-L99).
I hope this helps you understand better the problem and the different approaches you can give to it!

Related

NLP Paragraph Level Predictions vs Doc Level Predictions? What Strategy to deploy

currently wanted to understand which model an approach I am incorporating for model development, I currently have a TF-IDF NLP model that reads in paragraphs for a document and makes a prediction based upon how many paragraphs scored a 1 label with that paragraph.
I am not sure if that is correct form of logic, should I just go with an document level model? what are the benefits and trade-offs of predicting at a paragraph level and rolling it up into a total prediction for the document vs just classifying the document itself.
Any Thoughts?
Thanks!
Depends on what problem you are trying to solve and the nature of your data.
If in one document different parts can be classified differently, it's better to make a prediction by paragraphs or even sentences. For example - quite often, the customer can be happy with one part of the product/item (the first sentence is positive). And be dissatisfied with another part of the product/item (the second sentence has a negative sentiment).
Or, if the document is entirely related to a specific topic, you can make a prediction using the entire text.
In the end, these are just assumptions. Hold out a test subset and validate your model for both cases.

Overfitting despite of Missing value, Tree Based Learing

my fellow students and me, we are working on a educational machine learning project and we are stuck with a overfitting-problem, as we are quite unexperienced to Data Mining.
Our business case is about retail banking and we aim to search for customer target groups according to products resp. to recommend customers specific products that are based on products that were bought already like stock shares, funds, deposits etc.
We received a data set with about 400 features and 150.000 data records. We build our workflows in Knime. Our workflow includes the following steps:
We explored the data and defined the target variables
We used a Missing Value Column Filter in order to eliminate all Columns with mostly Missing Values
We also applied the Tree Ensemble Workflow to reduce the dimensions
All in all we cleaned up our data and reduced it from 400 variables down to about 50.
For Modelling we use a simple decision tree - and here appears the problem: This tree always gives out an accuracy of 100 % - so we assume it's highly overfitted.
Is there anything we are doing wrong? Or on what should we focus on?
We hope the community could help us with some hints or tips.
Edit:
Are there any sources, papers etc how to apply cross up selling in a data mining tool e.g. knime? We googled it already but so far we've been unsuccessful.
One of the problem with decision trees are they are pron to overfit. You can
do Prunning that reduces the complexity of the model, and hence improves predictive accuracy by the reduction of overfitting also try tunning Min-sample-per-leaf, Maximum tree depth
Agree with the previous comment: the main advantage of DT is their overfitting.
Try to make decision tree simpler (reduce depth at least)
Use ensemble methods (Random Forests or even XGBoost). They are the next generation of DT.

Identifying specific parts of a document using CRF

My goal is given a set of documents (mostly in financial domain), we need to identify specific parts of it like Company Name or Type of the document, etc.
The training is assumed to be done on acouple of 100's of documents. Obviously I would have a skewed class distribution (with None dominating around 99.9% of the examples).
I plan to use CRF (CRFsuite on Sklearn) and have gone through the necessary literature . I needed some advice on the following fronts :
Will the dataset be sufficient to train the CRF? Considering each document can be split into around 100 tokens (each token being a training instance) , we would get 10000 instances in total.
Will the data set be too skewed for training a CRF? For ex: for 100 documents I would have around 400 instances of given class and around 8000 instances of None
Nobody knows that, you have to try it on your dataset, check resulting quality, maybe inspect the CRF model (e.g. https://github.com/TeamHG-Memex/eli5 has sklearn-crfsuite support - a shameless plug), try to come up with better features or decide to annotate more examples, etc. This is just a general data science work. Dataset size looks on a lower side, but depending on how structured is the data and how good are features a few hundred documents may be enough to get started. As the dataset is small, you may have to invest more time in feature engineering.
I don't think class imbalance would be a problem, at least it is unlikely to be your main problem.

metric learning for information retrieval in semi-structured text?

I am interested in parsing semi-structured text. Assuming I have a text with labels of the kind: year_field, year_value, identity_field, identity_value, ..., address_field, address_value and so on.
These fields and their associated values can be everywhere in the text, but usually they are near to each other, and more generally the text in organized in a (very) rough matrix, but rather often the value is just after the associated field with eventually some non-interesting information in between.
The number of different format can be up to several dozens, and is not that rigid (do not count on spacing, moreover some information can be added and removed).
I am looking toward machine learning techniques to extract all those (field,value) of interest.
I think metric learning and/or conditional random fields (CRF) could be of a great help, but I have not practical experience with them.
Does anyone have already encounter a similar problem?
Any suggestion or literature on this topic?
Your task, if I understand correctly, is to extract all pre-defined entities from a text. What you describe here is exactly named entity recognition.
Stanford has a Stanford Named Entity Recognizer that you can download and use (python/java and more)
Regarding the models you considers (CRF for example) - the hard thing here is to get the training data - sentences with the entities already labeled. This is why you should consider getting a trained model, or use someone else's data to train your model (again, the model will recognize only entities it saw in the training part)
A great choice for already train model in python is nltk's Information Extraction module.
Hope this sums it up

Using decision tree in Recommender Systems

I have a decision tree that is trained on the columns (Age, Sex, Time, Day, Views,Clicks) which gets classified into two classes - Yes or No - which represents buying decision for an item X.
Using these values,
I'm trying to predict the probability of 1000 samples(customers) which look like ('12','Male','9:30','Monday','10','3'),
('50','Female','10:40','Sunday','50','6')
........
I want to get the individual probability or a score which will help me recognize which customers are most likely to buy the item X. So i want to be able to sort the predictions and show a particular item to only 5 customers who will want to buy the item X.
How can I achieve this ?
Will a decision tree serve the purpose?
Is there any other method?
I'm new to ML so please forgive me for any vocabulary errors.
Using decision tree with a small sample set, you will definitely run into overfitting problem. Specially at the lower levels of the decision, where tree you will have exponentially less data to train your decision boundaries. Your data set should have a lot more samples than the number of categories, and enough samples for each categories.
Speaking of decision boundaries, make sure you understand how you are handling data type for each dimension. For example, 'sex' is a categorical data, where 'age', 'time of day', etc. are real valued inputs (discrete/continuous). So, different part of your tree will need different formulation. Otherwise, your model might end up handling 9:30, 9:31, 9:32... as separate classes.
Try some other algorithms, starting with simple ones like k-nearest neighbour (KNN). Have a validation set to test each algorithm. Use Matlab (or similar software) where you can use libraries to quickly try different methods and see which one works best. There is not enough information here to recommend you something very specific. Plus,
I suggest you try KNN too. KNN is able to capture affinity in data. Say, a product X is bought by people around age 20, during evenings, after about 5 clicks on the product page. KNN will be able to tell you how close each new customer is to the customers who bought the item. Based on this you can just pick the top 5. Very easy to implement and works great as a benchmark for more complex methods.
(Assuming views and clicks means the number of clicks and views by each customer for product X)
A decision tree is a classifier, and in general it is not suitable as a basis for a recommender system. But, given that you are only predicting the likelihood of buying one item, not tens of thousands, it kind of makes sense to use a classifier.
You simply score all of your customers and retain the 5 whose probability of buying X is highest, yes. Is there any more to the question?

Resources