Using currently invalid input data for prediction purposes

Using currently invalid input data for prediction purposes - machine-learning

Let's say we have some data (input) with which we want to predict some output. If the possible values that a specific input can take has changed over time, is it still appropriate to use all of the data?
Let me try to clarify with an example. Suppose that one of the inputs is a categorical variable that has the unique values [A, B, C] in the data, but we know for a fact that in the current setting in which we will ultimately make predictions, only the values [A, B] are possible.
Would it still be appropriate to use all of the data, or should all of the observations that include a C be excluded?

If C does not uniquely map to the Target variable, but rather it shares some target variables with A or/and B. In this case, leaving C in the dataset, knowing that it'll definitely not occur in the future input (i.e. where you predict for unseen inputs), will adjust the hypothesis of the model (and that depends on the model, linear models are more prone to this) and the final hypothesis will consequently be based on redundant information.
In simple terms: In-Sample does not represent the Out-of-Sample, so it will overfit and won't generalize!.

Related

Using awkward1.Array for BDT

I want to implement a boosted decision tree for my analysis. But the entries in my array contain are of varying length, so the array is not convertible directly into numpy or pandas.
Is there any way to use existing ML libraries with awkward array?

Your ML library might assume that the arrays are NumPy arrays and not recognize an ak.Array. That problem, in itself, is easily solved: call np.to_numpy (or equivalently, cast it with np.asarray) to put it in a form the ML library expects. Incidentally, there's also ak.to_pandas to make a DataFrame in which variable-length nested lists are represented by a MultiIndex (with limitations: there has to be only one nested list, since a DataFrame has only one index).
The above is what I'd call a "branding" issue: the ML library just doesn't recognize the ak.Array "brand" of array, so we relabel it. But there's a more fundamental issue: does the ML algorithm in question intrinsically require rectilinear data? For instance, a feedforward neural network maps N-dimensional inputs to M-dimensional outputs; N and M can't be different for each input. This is a problem even if you're not using Awkward Array. In HEP, the old solution was to run variable-length data through a recurrent neural network (thus ignoring the boundaries between lists and imposing an irrelevant order on them) and the new solution seems to be graph neural networks (which is a more theoretically correct thing to do).
I've noticed that some ML libraries are introducing their own "jagged arrays," which are the minimum structure that Awkward Array provides: TensorFlow has RaggedTensors and PyTorch is getting NestedTensors. I don't know to what degree these data types have been integrated into the ML algorithms, though. If they have been, then Awkward Array ought to get an ak.to_tensorflow and ak.to_pytorch to complement ak.to_numpy and ak.to_pandas, as a way to preserve jaggedness when sending data to these libraries. Hopefully, they'll be able to use that jaggedness in their ML algorithms! (Otherwise, what's the point? But I haven't been following these developments closely.)
You're interested in boosted decision trees (BDTs). I can't think of how a decision tree model, boosted or not, could be adapted to different length inputs... Or maybe I can: the nodes of a decision tree choose which subtree to pass the data down to based on the value of one index in the N-dimensional input. That doesn't imply there's a maximum index value N, though a particular tree would have a set of indexes that it splits on, and there would be some maximum of that set (because the tree is finite!). Apply a tree that wants to split on index k on an input with n < k elements would have to have a contingency for how to split anyway, but there are already methods for applying decision trees to datasets with missing values. An input datum with n elements could be treated as an input for which indexes greater than n are considered missing values. To train such a BDT, you'd have to give it inputs with missing values beyond each list's maximum element.
In Awkward Array, the function for that is ak.pad_none. If you know the maximum length list in your sample (ak.num and ak.max), you can pad the whole array such that all lists have the same length with missing values at the end. If you set clip=True, then the resulting array type is "regular," it no longer considers the possibility that a list can have a length different from the chosen length. If you pass such an array to np.to_numpy (and not np.asarray), then it becomes a NumPy masked array, which a BDT algorithm that expects missing values should be able to recognize.
The only problem with this plan is that padding every list to have the same length as the maximum length list uses more memory. If the BDT algorithm were aware of jaggedness (the way that TensorFlow and soon PyTorch is/will be aware of jaggedness), then it should be able to make these trees and apply them to data without the memory-padding step. I don't know if there are any such BDT implementations out there, but if someone wants to write a "BDT with missing values that accepts jagged arrays," I'd be happy to help them get it set up with Awkward Arrays!

Sequence learning with discrete/categorical output available to the objective function

Please bear with, I am still quite new to deep learning so what I am looking for may or may not make a lot of sense. But this is why I ask, I need some guidance on where to find a guiding example or paper.
Here is what I want and have:
I am interested in sequence data (time-series actually) hence I am using RNN/LSTM/GRU and those types of things
Now suppose I have a 1-D time-series, lets call it X = [x_1,...,x_n]
For my particular problem, it turns out that X is a function of another function, a generator function if you will s.t. X = f(a,b)
That function takes two integer parameters a and b
Here is my problem: I want to find the value of a and b that best reconstructs my time-series X (assume that I can generate time-series with f(a,b))
This leads me to believe that I must feature the actual values of the network output, i.e. a and b in my objective function
My objective function could be something like objectiveFunction(X_true,X_pred) but then my X_pred is generated from f with parameters a and b
Further, the batch size may need to be the whole time-series (they are small, and I have many examples) but we can use big mini batches if needs be.
Suppose the search space over a and b is [0,10] for both (again a and b can only assume integer values). Then I have 100 pairs e.g. (6,7) for values of a and b. As I train the network I expect the weight of edge leading from the Dense to the categorical outputs (i.e. my parameters pairs (a,b)), to get maximised as the network finds better and better time-series generator parameters a and b.
Initially, I was just going to test a network structure as such:
Input -->> RNN -->> Dense -->> Categorical output
I want to keep it simple for now and not try anything fancier such as an LSTM as my time-series only have short-term dependencies.
Hence, any advice would be most welcome. Thanks in advance.

consistency among different training pairs of skip-gram model

Regarding skip-gram model, for training purpose, the input is a word (one-hot representation) and outputs are its context words (multiple one-hot representations). For example, (A,B), (A,C), (A, D)
My question is: when we are running the training process, we are running the model pair by pair or we just feed in [A, B|C|D] altogether?
Another question is regarding the word vector matrix "M" (matrix between input and hidden layer). Since input is one-hot, so the result of input (size |V|) x M is a vector of size |V| which is a row of the word vector matrix. My question is: when we are running the back propagation, it seems only that row of word vector matrix updated.
Is this true?
If that is the case and suppose we are training the model pair by pair (A,B), (A,C), (A, D), how to keep the consistency among the different back propagation of different pairs? For example, once pair (A,B) is done, the row in word vector matrix got updated. Through the update, the error will become less for (A,B). Then we run pair (A,C), the same row is picked and got updated through back propagation. Now this time, the error will become less for (A,C). But the correction for (A,B) will be erased, which means the back propagation of (A,B) is discarded. Is my understanding correct here?
Thanks

You can think of it as presenting the pairs, individually as training examples: (A, B), (A, C), (A, D). That's how the algorithm is defined. (However, some implementations might in practice be able to do them as a batch.)
Note that 'outputs' are only "one-hot" when using negative-sampling. (In hierarchical-softmax, predicting one word requires a variable huffman-coded set of output nodes to have specific 0/1 values.)
Yes, in skip-gram, each training example (such as "A predicts B") only results in a single word's vector ("row of word vector matrix") being updated, since only one word is involved.
Note that while the paper describes using a focus word to predict its nearby words, the released Google word2vec.c code (and other implementations modeled on it) as a matter of implementation uses context words to predict the focus word. Essentially it'd just a different way of structuring the loops (and thus ordering the training pair examples), which I believe they did for slightly better CPU cache efficiency.
In your scenario, the subsequent backprop for the (A, C) example does not necessarily "erase" the prior (A, B) backprop – depending on the out-representations of B & C, it might reinforce it (in some ways/directions/dimensions), or weaken it (in other ways/directions/dimensions). It is essentially that repeated tug-of-war, between all the diverse examples, that eventually creates an arrangement of weights which somehow reflect some useful similarities of A, B, and C.
The best way to rigorously understand what is happening is to review the source code of implementations, such as the original Google word2vec.c or the Python implementation in gensim.

How to explain feature importance after one-hot encode used for decision tree

I know decision tree has feature_importance attribute calculated by Gini and it could be used to check which features are more important.
However, for application in scikit-learn or Spark, it only accepts numeric attribute, so I have to transfer string attribute to numeric attribute and then do one-hot encoder on that. When features are put into decision tree model, it's 0-1 encoded other than original format, my question is, how to explain feature importance for original attributes? should I avoid one-hot encoder when try to explain feature importance?
Thanks.

Conceptually, you may want to use something along the lines of permutation importance. The basic idea, is that you take your original dataset, and randomly shuffle the values of each column 1 at a time. Then, you score your perturbed data with the model and compare the performance to the original performance. If done 1 column at a time, you can assess the performance hit you take by destroying each variable, indexing it to the variable that had the most loss (which would become 1, or 100%). If you can do this to your original dataset, prior to the 1 hot encoding, then you'll be getting an importance measure that groups them together overall.

How to select and use features of varying datatypes?

I'm a complete newbie to machine learning and while I have some sci-kit classifiers "working" on my dataset I'm not sure if I'm using them correctly. I'm doing supervised learning with a hand labeled training set.
The problem is: each item in my data set is a dictionary with approx. 80 keys that are either text, boolean, or integers that I want to use as features. I have about 40,000 items and have hand labeled about 800 of them. Am I meant to select, for example, only boolean features to use, or only integers? Do I need to normalize the features (remove mean + scale to unit variance)? I'm currently not even going to attempt analysis of the text yet so it may be worth not even giving those features to the classifier. Would it be dumb to just try various permutations/combinations of features of the same type (ints)? It could also be that I'm approaching my dataset completely wrong... it's shaped like this:
[ [a, b, c, ...], [a, b, c, ...], [a, b, c, ...], ...]
Essentially what I hope to achieve is a binary classification of each item in the dataset, basically just "Good" or "Bad" according to what I've hand labeled. I read that some classifiers work better on different data types, like Bernoulli Naive Bayes, and K Nearest Neighbors works when the "decision boundary is very irregular".
Ultimately I want a comparison of classifier accuracy across several different algorithms, in addition to hopefully isolating one that is actually accurate for classifying my data...

All classifiers in scikit-learn require numeric data. Boolean features are fine, for integer features it depends on whether they encode categorical, ordinal or numeric data.
The preprocessing you need to do depends on the type of feature, not on whether you want to combine them. Combining them is probably a good idea.
You can do a simple transformation for the text data using CountVectorizer or TFIDFVectorizer.

Develop Reference

ios ruby-on-rails asp.net-mvc docker delphi jenkins grails google-sheets machine-learning dart