Model that predict both categorical and numerical output

Model that predict both categorical and numerical output - machine-learning

I am building a RNN for a time series model, which have a categorical output.
For example, if precious 3 pattern is "A","B","A","B" model predict next is "A".
there's also a numerical level associated with each category.
For example A is 100, B is 50,
so A(100), B(50), A(100), B(50),
I have the model framework to predict next is "A", it would be nice to predict the (100) at the same time.
For real life examples, you have national weather data.
You are predicting the next few days weather type(Sunny, windy, raining ect...) at the same time, it would be nice model will also predict the temperature.
Or for Amazon, analysis customer's trxns pattern.
Customer A shopped category
electronic($100), household($10), ... ...
predict what next trxn category that this customer is likely to shop and predict at the same time what would be the amount of that trxns.
Researched a bit, have not found any relevant research on similar topics.

What is stopping you from adding an extra output to your model? You could have one categorical output and one numerical output next to each other. Every neural network library out there supports multiple outputs.
Your will need to normalise your output data though. Categories should be normalised with one-hot encoding and numerical values should be normalised by dividing by some maximal value.
Researched a bit, have not found any relevant research on similar topics.
Because this is not really a 'topic'. This is something completely normal, and it does not require some special kind of network.

Related

Ideas for handling problem of multiple records for one observation?

Background
I am trying to create a model that can predict Type 2 diabetes in a patient based on MRI scans of their thigh muscle. Previous literature has shown that fat deposition in the muscle of femur is linked to Type 2 Diabetes, so there is some valid relationship here.
I have a dataset comprised of several hundred patients. I am analyzing radiomics features of their MRI scans, which are basically quantitative imaging features (think things like texture, intensity, variance of texture in a specific direction, etc.). The kicker here is that an MRI scan is a three-dimensional object, but I have radiomics features for each of the 2D slices, not radiomics of the entire 3D thigh muscle. So this dataset has repeated rows for each patients, "multiple records for one observation." My objective is to output a binary classification of Yes/No for T2DM for a single patient.
Problem Description
Based on some initial exploratory data analysis, I think the key here is that some slices are more informative than others. For example, one thing I tried was to group the slices by patient, and then analyze each slice in feature hyperspace. I selected the slice with the furthest distance from the center of all the other slices in feature hyperspace and used only that slice for the patient.
I have also tried just aggregating all the features, so that each patient is reduced to a single row, but has way more features. For example, there might be a feature called "median intensity." But now the patient will have 5 features, called "median intensity__mean" "median intensity__median", "median intensity__max," and so forth. These aggregations are across all the slices that belong to that patient. This did not work well and yielded an AUC of 0.5.
I'm trying to find a way to select the most informative slices for each patient that will then be used for the classification; or an informative way of reducing all the records for a single observation down to a single record.
Solution Thoughts
One thing I'm thinking is that it would probably be best to train some sort of neural net to learn which slices to pick before feeding those slices to another classifier. Effectively, how this would work would be the neural net would learn a linear transformation that could be applied to the matrix of (slices, features) for each patient. So some slices would be upweighted while others would be downweighted. Then I could compute the mean along the ith axis and then use that as input to the final classifier. If you have examples of code for how this would work (I'm not sure how you would hook up the loss function from the final classifier (in my case, a LGBMClassifier) to the neural net so that backpropagation occurs from the final classification all throughout the ensemble model.
Overall, I'm open to any ideas on how to approach this issue of reducing multiple records for one observation down to the most informative / subset of the most informative records for one observation.

Word Embedding Model

I have been searching and attempting to implement a word embedding model to predict similarity between words. I have a dataset made up 3,550 company names, the idea is that the user can provide a new word (which would not be in the vocabulary) and calculate the similarity between the new name and existing ones.
During preprocessing I got rid of stop words and punctuation (hyphens, dots, commas, etc). In addition, I applied stemming and separated prefixes with the hope to get more precision. Then words such as BIOCHEMICAL ended up as BIO CHEMIC which is the word divided in two (prefix and stem word)
The average company name length is made up 3 words with the following frequency:
The tokens that are the result of preprocessing are sent to word2vec:
#window: Maximum distance between the current and predicted word within a sentence
#min_count: Ignores all words with total frequency lower than this.
#workers: Use these many worker threads to train the model
#sg: The training algorithm, either CBOW(0) or skip gram(1). Default is 0s
word2vec_model = Word2Vec(prepWords,size=300, window=2, min_count=1, workers=7, sg=1)
After the model included all the words in the vocab , the average sentence vector is calculated for each company name:
df['avg_vector']=df2.apply(lambda row : avg_sentence_vector(row, model=word2vec_model, num_features=300, index2word_set=set(word2vec_model.wv.index2word)).tolist())
Then, the vector is saved for further lookups:
##Saving name and vector values in file
df.to_csv('name-submission-vectors.csv',encoding='utf-8', index=False)
If a new company name is not included in the vocab after preprocessing (removing stop words and punctuation), then I proceed to create the model again and calculate the average sentence vector and save it again.
I have found this model is not working as expected. As an example, calculating the most similar words pet is getting the following results:
ms=word2vec_model.most_similar('pet')
('fastfood', 0.20879755914211273)
('hammer', 0.20450574159622192)
('allur', 0.20118337869644165)
('wright', 0.20001833140850067)
('daili', 0.1990675926208496)
('mgt', 0.1908089816570282)
('mcintosh', 0.18571510910987854)
('autopart', 0.1729743778705597)
('metamorphosi', 0.16965581476688385)
('doak', 0.16890916228294373)
In the dataset, I have words such as paws or petcare, but other words are creating relationships with pet word.
This is the distribution of the nearer words for pet:
On the other hand, when I used the GoogleNews-vectors-negative300.bin.gz, I could not add new words to the vocab, but the similarity between pet and words around was as expected:
ms=word2vec_model.most_similar('pet')
('pets', 0.771199643611908)
('Pet', 0.723974347114563)
('dog', 0.7164785265922546)
('puppy', 0.6972636580467224)
('cat', 0.6891531348228455)
('cats', 0.6719794869422913)
('pooch', 0.6579219102859497)
('Pets', 0.636363685131073)
('animal', 0.6338439583778381)
('dogs', 0.6224827170372009)
This is the distribution of the nearest words:
I would like to get your advice about the following:
Is this dataset appropriate to proceed with this model?
Is the length of the dataset enough to allow word2vec "learn" the relationships between the words?
What can I do to improve the model to make word2vec create relationships of the same type as GoogleNews where for instance word pet is correctly set among similar words?
Is it feasible to implement another alternative such as fasttext considering the nature of the current dataset?
Do you know any public dataset that can be used along with the current dataset to create those relationships?
Thanks

3500 texts (company names) of just ~3 words each is only around 10k total training words, with a much smaller vocabulary of unique words.
That's very, very small for word2vec & related algorithms, which rely on lots of data, and sufficiently-varied data, to train-up useful vector arrangements.
You may be able to squeeze some meaningful training from limited data by using far more training epochs than the default epochs=5, and far smaller vectors than the default size=100. With those sorts of adjustments, you may start to see more meaningful most_similar() results.
But, it's unclear that word2vec, and specifically word2vec in your averaging-of-a-name's-words comparisons, is matched to your end goals.
Word2vec needs lots of data, doesn't look at subword units, and can't say anything about word-tokens not seen during training. An average-of-many-word-vectors can often work as an easy baseline for comparing multiword texts, but might also dilute some word's influence compared to other methods.
Things to consider might include:
Word2vec-related algorithms like FastText that also learn vectors for subword units, and can thus bootstrap not-so-bad guess vectors for words not seen in training. (But, these are also data hungry, and to use on a small dataset you'd again want to reduce vector size, increase epochs, and additionally shrink the number of buckets used for subword learning.)
More sophisticated comparisons of multi-word texts, like "Word Mover's Distance". (That can be quite expensive on longer texts, but for names/titles of just a few words may be practical.)
Finding more data that's compatible with your aims for a stronger model. A larger database of company names might help. If you just want your analysis to understand English words/roots, more generic training texts might work too.
For many purposes, a mere lexicographic comparison - edit distances, count of shared character-n-grams – may be helpful too, though it won't detect all synonyms/semantically-similar words.

Word2vec does not generalize to unseen words.
It does not even work well for wards that are seen but rare. It really depends on having many many examples of word usage. Furthermore a you need enough context left and right, but you only use company names - these are too short. That is likely why your embeddings perform so poorly: too little data and too short texts.
Hence, it is the wrong approach for you. Retraining the model with the new company name is not enough - you still only have one data point. You may as well leave out unseen words, word2vec cannot work better than that even if you retrain.

If you only want to compute similarity between words, probably you don't need to insert new words in your vocabulary.
By eye, I think you can also use FastText without the need to stem the words. It also computes vectors for unknown words.
From FastText FAQ:
One of the key features of fastText word representation is its ability
to produce vectors for any words, even made-up ones. Indeed, fastText
word vectors are built from vectors of substrings of characters
contained in it. This allows to build vectors even for misspelled
words or concatenation of words.
FastText seems to be useful for your purpose.
For your task, you can follow FastText supervised tutorial.
If your corpus proves to be too small, you can build your model starting from availaible pretrained vectors (pretrainedVectors parameter).

Different scenario based queries on Imputing and Machine Learning

I am new to Data Science and learning to impute and about model training. Below are my few queries that I came across when training the datasets. Please provide answers to these.
Suppose I have a dataset with 1000 observations. Now I train the model on the complete dataset in one go. Another way I did it, I divided my dataset in 80% and 20% and trained my model first at 80% and then on 20% data. Is it same or different? Basically, if I train my already trained model on new data, what does it mean?
Imputing Related
Another question is related to imputing. Imagine I have a dataset of some ship passengers, where only first-class passengers were given cabin. There is a column that holds cabin numbers (categorical) but very few observations have these cabin numbers. Now I know this column is important so I cannot remove it and because it has many missing values, so most of the algorithms do not work. How to handle imputing of this type of column?
When imputing the validation data, do we impute with same values that were used to impute training data or the imputing values are again calculated from validation data itself?
How to impute data in the form of a string like a Ticket number (like A-123). The column is important because the 1st alphabet tells the class of passenger. Therefore, we cannot drop it.

Suppose I have a dataset with 1000 observations. Now I train the model
on the complete dataset in one go. Another way I did it, I divided my
dataset in 80% and 20% and trained my model first at 80% and then on
20% data. Is it same or different?
It's hard to say: is it good or not. Generally, if your data (splits) are taken from the same distribution - you can perform additional training. However, not all model types are good for it. I advice you to run some kind of cross-validation with 80/20 splitting and error measurement checking before additional training and after.
Basically, if I train my already
trained model on new data, what does it mean?
If you take the datasets from the same distribution: you perform additional learning what theoretically should have positive influence on your model.
Imagine I have a dataset of some ship passengers, where only first-class passengers were given cabin. There is a column that holds cabin numbers (categorical) but very few observations have these cabin numbers. Now I know this column is important so I cannot remove it and because it has many missing values, so most of the algorithms do not work. How to handle imputing of this type of column?
You need clearly understand what do you want to do by imputation. If only first-class has values, how you can perform imputation for the second- or third-class? What do you need to find? Deck? Cabin number? Do you want to find new values or impute by already existing values?
When imputing the validation data, do we impute with same values that were used to impute training data or the imputing values are again calculated from validation data itself?
Very generally, you run imputation algorithm on the whole data you have (without target column).
How to impute data in the form of a string like a Ticket number (like A-123). The column is important because the 1st alphabet tells the class of passenger. Therefore, we cannot drop it.
If you have the finite number of cases, you just need to impute values as strings. If not, perform feature engineering: try to predict letter, number, first digit of the number, len(number) and so on.

multi-label text classification with zero or more labels

I need to classify website text with zero or more categories/labels (5 labels such as finance, tech, etc). My problem is handling text that isn't one of these labels.
I tried ML libraries (maxent, naive bayes), but they match "other" text incorrectly with one of the labels. How do I train a model to handle the "other" text? The "other" label is so broad and it's not possible to pick a representative sample.
Since I have no ML background and don't have much time to build a good training set, I'd prefer a simpler approach like a term frequency count, using a predefined list of terms to match for each label. But with the counts, how do I determine a relevancy score, i.e. if the text is actually that label? I don't have a corpus and can't use tf-idf, etc.

Another idea , is to user neural networks with softmax output function, softmax will give you a probability for every class, when the network is very confident about a class, will give it a high probability, and lower probabilities to the other classes, but if its insecure, the differences between probabilities will be low and none of them will be very high, what if you define a treshold like : if the probability for every class is less than 70% , predict "other"

Whew! Classic ML algorithms don't combine both multi-classification and "in/out" at the same time. Perhaps what you could do would be to train five models, one for each class, with a one-against-the-world training. Then use an uber-model to look for any of those five claiming the input; if none claim it, it's "other".
Another possibility is to reverse the order of evaluation: train one model as a binary classifier on your entire data set. Train a second one as a 5-class SVM (for instance) within those five. The first model finds "other"; everything else gets passed to the second.

What about creating histograms? You could use a bag of words approach using significant indicators of for e.g. Tech and Finance. So, you could try to identify such indicators by analyzing the certain website's tags and articles or just browse the web for such inidicators:
http://finance.yahoo.com/news/most-common-words-tech-finance-205911943.html
Let's say your input vactor X has n dimensions where n represents the number of indicators. For example Xi then holds the count for the occurence of the word "asset" and Xi+k the count of the word "big data" in the current article.
Instead of defining 5 labels, define 6. Your last category would be something like a "catch-all" category. That's actually your zero-match category.
If you must match the zero or more category, train a model which returns probability scores (such as a neural net as Luis Leal suggested) per label/class. You could than rate your output by that score and say that every class with a score higher than some threshold t is a matching category.

Try this NBayes implementation.
For identifying "Other" categories, dont bother much. Just train on your required categories which clearly identifies them, and introduce a threshold in the classifier.
If the values for a label does not cross a threshold, then the classifier adds the "Other" label.
It's all in the training data.

AWS Elasticsearch percolate would be ideal, but we can't use it due to the HTTP overhead of percolating documents individually.
Classify4J appears to be the best solution for our needs because the model looks easy to train and it doesn't require training of non-matches.
http://classifier4j.sourceforge.net/usage.html

How to run a reverse prediction with machine learning?

I am quite new to machine learning but I am looking to solve following problem. It is a kind of reverse prediction.
I have a lot of inputs and accordingly for each record one output. So I could do easily a classification and predict the output for an unknown new set of data.
The problem I would like to solve is taking one expected outcome and get a classification of the set of input data which will end up on a very high probability to the expected defined output.
To make the problem more complex I would like to have the flexibility to define some of the input criteria which are probably not changeable j(e.g. Male/female) and add these criteria like filters and get a new Revers prediction - what would be the most relevant important input beside the given one to end up with an expected and defined Outcome.
Let's give an example: I have thousands of records of students including education etc. and the information if they earn normal or extreme money after 10 years of work experience. So if I am a new student I could predict the outcome if I will earn a lot of money or average based on my education, gender, age at degree, what I am studying etc.
what I would like to get is given the fact that I am male and have an expected age at time of degree, what should I study to have a high probability of earning extreme?

This problem has not an unique or optimal solution, though it can be tackled in several ways, IMO.
The key fact to understand is that you have a loss of information from the vector input to the scalar/categorical output. It is not an 'invertible' or 'reversible' transformation, due to the fact that multiple and very different input vector could lead to the same output value, thus diluting the info component.
Said that, one possible angle of attack for the problem would be to cluster your input vectors, obtaining several relevant clusters for every output value. Then, you could extract those input cluster centers and inspect what are these prototypical values that lead to the desired outcome. This way you will have your desired reverse 'input points of interest'.

Develop Reference

ios ruby-on-rails asp.net-mvc docker delphi jenkins grails google-sheets machine-learning dart