I have a series of tri-axial accelerometer data of dimension (N, 1000, 3), where N is the number of instances, 1000 is the length of the acceleration data (i.e. 10 seconds sampled at 100 Hz) and 3 are the axes X, Y and Z. The data is also divided into two classes, A and B, where A accounts for 95% of the data. In total I have just under 3000 instances of class B. The aim of my project is to create model to detect class B.
I have been creating a number of machine learning models (decision trees, boosted modes etc) with features obtained via signal processing and statsitics (e.g. standard deviation, mean, magnitude, area under curve etc). These models perform well, but they seem to be missing a number of events in the real world, that by eye I can distinguish. This led me to believe that my features are missing key components of the classes. I've been going down into the rabbit hole of signal processing, but to date none has been that Eureka moment.
Now I am no expert in Deep learning, but by combining the data into a single axis (i.e. taking the magnitude) gave promising results (i.e. just as good as the current models). However, again taking the magnitude removes information. So I was wondering if there is a way to use deep learning to 1. select features from the individual axes and 2. use these as input for another deep learner to perform the classification. Something like this:
My simple view of multiple axis deep learner. Here the individual axes (i.e. X, Y and Z) are fed into seperate deep learners and the outputs are then fed into a single deep learner.
Apologies for the lots of text and lack of examples, as I'm not allowed to share the data, and only looking for guidance on whether deep learning can be of help. Thanks for taking the time read my post.
Since there is no specifics in the question, the answer can only be given in general terms.
If magnitude gives good result, you can fed X, Y, Z and magnitude into a single deep learner as 4 input.
In this case, your deep learner will be able to use a) separate features of axis, b) combining the data into a single axis, c) the relationship between the axes.
Related
Given n samples with d features of stock A, we can build a (d+1) dimensional linear model to predict the profit. However, in some books, I found that if we have m different stocks with n samples and d features for each, then they merge these data to get m*n samples with d features to build a single (d+1) dimensional linear model to predict the profit.
My confusion is that, different stocks usually have little connection with each other, and their profit are influenced by different factors and environment, so why they can be merged to build a single model?
If you are using R as tool of choice, you might like the time series embedding howto and its appendix -- the mathematics behind that is Taken's theorem:
[Takens's theorem gives] conditions under which a chaotic dynamical system can be reconstructed from a sequence of observations of the state of a dynamical system.
It looks to me as the statement's you quote seem to relate to exactly this theorem: For d features (we are lucky, if we know that number - we usually don't), we need d+1 dimensions.
If more time series should be predicted, we can use the same embedding space if the features are the same. The dimensions d are usually simple variables (like e.g. temperature for different energy commodity stocks) - this example helped me to intuitively grasp the idea.
Further reading
Forecasting with Embeddings
I am studying principle component analysis, and I have just learnt that before applying PCA to the data samples, we have to apply two preprocessing steps which are mean normalization and feature scaling. However, I have no idea about what mean normalization is and how it can be implemented.
At first I searched it; however, I could not find a instructive explanation. Is there anyone who can explain what is mean normalization and how it can be implemented ?
Assume there is a dataset with 'd' features(Columns) and 'n' Observations(Rows). For simplicity sake lets consider d=2 and n=100. Which means now you dataset has 2 features and 100 observations.
In other words, now your dataset is a 2-dimensional array with 100 rows and 2 columns - (100x2).
Initially, when you visualize it, you can see that the points are scattered in a 2 dimension.
When you standardize the dataset, and when you visualize it you can actually see that all the points have shifted towards the origin. In other words, all the observation points have a mean of value 0 and standard deviation of value 1. This process is called Standardization.
How do you Standardize..?
Its pretty simple. The Formula is straight forward.
z = (X - u) / s
Where,
X - an observation in the feature column
u - mean of the feature column
s - standard deviation of the feature column
Note: You have to apply standardization with respect to all feature in the dataset
Reference:
https://machinelearningmastery.com/normalize-standardize-machine-learning-data-weka/
https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.StandardScaler.html
I used the caret package to train a random forest, including repeated cross-validation. I’d like to know whether the OOB, as in the original RF by Breiman, is used or whether this is replaced by the cross-validation. If it is replaced, do I have the same advantages as described in Breiman 2001, like increased accuracy by reducing the correlation between input data? As OOB is drawn with replacement and CV is drawn without replacement, are both procedures comparable? What is the OOB estimate of error rate (based on CV)?
How are the trees grown? Is CART used?
As this is my first thread, please let me know if you need more details. Many thanks in advance.
There are a lot of basic questions here and you would be better served by reading a book on machine learning or predictive modeling. Thats probably why you haven't gotten much of a response.
For caret you should also consult the package website where some of these questions are answered.
Here are some notes:
CV and OOB estimation for RF are somewhat different. This post might help explain how. For this application, the OOB rate from random forest is computed while the model is being build whereas CV uses holdout samples that are predicted after the random forest model is computed.
The original random forest model (used here) uses unpruned CART trees. Again, this is in many text books and papers.
Max
I recently got a little confused with this too, but reading chapter 4 in Applied Predictive Modeling by Max Kuhn helped me to understand the difference.
If you use randomForest in R, you grow a number of decision trees by sampling N cases with replacement (N is the number of cases in the training set). You then sample m variables at each node where m is less than the number of predictors. Each tree is then grown fully and terminal nodes are assigned to a class based on the mode of cases in that node. New cases are classified by sending them down all the trees and then taking a vote; the majority vote wins.
The key points to note here are:
how the trees are grown - sampling WITH replacement (a bootstrap). This means that some cases will be represented many times in your bootstrap sample and others may not be represented at all. The bootstrap sample will be the same size as your training dataset.
The cases that are not selected for building trees are referred to as the OOB samples- an OOB error estimate is calculated by classifying the cases that aren't selected when building a tree. About 63% of the data points in the bootstrap sample are represented at least once.
If you use caret in R, you will normally use caret::train(....) and specify the method as "rf" and trControl="repeatedcv". You can change trControl to "oob" if you want out of the bag. The way this works is as follows (I'm going to use a simple example of a 10 fold cv repeated 5 times): the training dataset is split into 10 folds of roughly equal size, a number of trees will be built using only 9 samples - so omitting the 1st fold (which is held out). The held out sample is predicted by running the cases through the trees and used to estimate performance measures. The first subset is returned to the training set and the procedure repeats with the 2nd subset held out, and so on. The process is repeated 10 times. This whole procedure can be repeated multiple times (in my example, I do this 5 times); for each of the 5 runs, the training dataset with be split into 10 slightly different folds. It should be noted that 50 different held out samples are used to calculate model efficacy.
The key points to note are:
this involves sampling WITHOUT replacement - you split the training data and build a model on 9 samples and predict the held out sample (the remaining 1 sample of the 10) and repeat this process as above
the model is built using a dataset that is smaller than the training dataset; this is different to the bootstrap method discussed above
You are using 2 different resampling techniques which will yield different results therefore they are not comparable. The k fold repeated cv tends to have low bias (for k large); where k is 2 or 3, bias is high and comparable to the bootstrap method. K fold cv tends to have high variance though...
Description of classification problem:
Assume a regular dataset X with n samples and d features.
This classification problem is somewhat hard (many features, few samples, low overall AUC ~70%).
It might be useful to mention that feature selection/extraction, dimension reduction, kernels, many classifiers have been applied. So I am not interested in trying these.
I am not looking forward to see an improvement in overall AUC. The goal is to find relevant features in haystack of features.
Description of my approach:
I select all pairwise combination of d features and create many two dimensional sub-datasets x with n samples.
On each sub-dataset x, I perform a 10-fold cross-validation (using all samples of the main dataset X). A very long process, assume weeks of computation.
I select top k pairs (according to highest AUC for example) and label them as +. All other pairs are labeled as -.
For each pair, I can compute several properties (e.g. relations between each pair using Expert's knowledge). These properties can be calculated without using the labels in main dataset X.
Now I have pairs which are labeled as + or -. In addition, each pair has many properties calculated based on Expert's knowledge (i.e. features). Hence, I have a new classification problem. Lets call this newly generated dataset Y.
I train a classifier on Y while following cross-validation rules. Surprisingly, I can predict the + and - labels with 90% AUC.
As far as I can see, it means that I am able to select relevant features. However, seeing a 90% AUC makes me worried about information leakage somewhere in this long process. Specially in step 3.
I was wondering if anyone can see any leakage in this approach.
Information Leakage:
Incorporation of target labels in the actual features. Your classifier will produce good prediction while did not learn anything.
Showing your test set to you classifier during the training phase. Your classifier will "memorize" the test set and its corresponding labels without "learning" anything.
Update 1:
I want to stress that indeed I am using all data points of X in step 1. However, I am not using them ever again (even for testing). The final 90% AUC is obtained from predicting labels of dataset Y.
On the other hand, it would be useful to note that, even if I randomize the values of my main dataset X, the computed features for dataset Y is going to be the same. However, the sample labels in Y would change because the previous + pairs might not be a good one anymore. Therefore they will be labeled as -.
Update 2:
Although I haven't got any opinion, I am going to state what I have got during 4 days of talking with pattern recognition researchers. Briefly I became confident that there is no information leakage (as long as I wont go back to the first dataset X and using its labels). Later on, in case I wanted to check to see if I could have better performance in X (i.e. predicting sample labels), I need to use only a part of dataset X for pairwise comparison (as training set). Then I can use the rest of samples in X as test set while using positively predicted pairs of Y as features.
I will set this as an answer in case no one could reject this method.
If your processes in step 1 uses all data. then the features you are learning have information from the whole data set. Since you selected based on the whole dataset and THEN validation, you are leaking serious information.
You should probably stick with tools that are well known / already done for you before running out and trying weird strategies like this. Try using a model with L1 regularization to do feature selection for your, or start with some of the simpler searches like Sequential Backward Selection.
If you do cross validation correctly in the end, each training will perform its own independent feature selection. If you do one global feature selection and then do CV, you are going to be doing it wrong and probably leaking information.
I have been working through the concepts of principal component analysis in R.
I am comfortable with applying PCA to a (say, labeled) dataset and ultimately extracting out the most interesting first few principal components as numeric variables from my matrix.
The ultimate question is, in a sense, now what? Most of the reading I've come across on PCA immediately halts after the computations are done, especially with regards to machine learning. Pardon my hyperbole, but I feel as if everyone agrees that the technique is useful, but nobody wants to actually use it after they do it.
More specifically, here's my real question:
I respect that principle components are linear combinations of the variables you started with. So, how does this transformed data play a role in supervised machine learning? How could someone ever use PCA as a way to reduce dimensionality of a dataset, and THEN, use these components with a supervised learner, say, SVM?
I'm absolutely confused about what happens to our labels. Once we are in eigenspace, great. But I don't see any way to continue to move forward with machine learning if this transformation blows apart our concept of classification (unless there's some linear combination of "Yes" or "No" I haven't come across!)
Please step in and set me straight if you have the time and wherewithal. Thanks in advance.
Old question, but I don't think it's been satisfactorily answered (and I just landed here myself through Google). I found myself in your same shoes and had to hunt down the answer myself.
The goal of PCA is to represent your data X in an orthonormal basis W; the coordinates of your data in this new basis is Z, as expressed below:
Because of orthonormality, we can invert W simply by transposing it and write:
Now to reduce dimensionality, let's pick some number of components k < p. Assuming our basis vectors in W are ordered from largest to smallest (i.e., eigenvector corresponding to the largest eigenvalue is first, etc.), this amounts to simply keeping the first k columns of W.
Now we have a k dimensional representation of our training data X. Now you run some supervised classifier using the new features in Z.
The key is to realize that W is in some sense a canonical transformation from our space of p features down to a space of k features (or at least the best transformation we could find using our training data). Thus, we can hit our test data with the same W transformation, resulting in a k-dimensional set of test features:
We can now use the same classifier trained on the k-dimensional representation of our training data to make predictions on the k-dimensional representation of our test data:
The point of going through this whole procedure is because you may have thousands of features, but (1) not all of them are going to have a meaningful signal and (2) your supervised learning method may be far too complex to train on the full feature set (either it would take too long or your computer wouldn't have a enough memory to process the calculations). PCA allows you to dramatically reduce the number of features it takes to represent your data without eliminating features of your data that truly add value.
After you have used PCA on a portion of your data to compute the transformation matrix, you apply that matrix to each of your data points before submitting them to your classifier.
This is useful when the intrinsic dimensionality of your data is much smaller than the number of components and the gain in performance you get during classification is worth the loss in accuracy and the cost of PCA. Also, keep in mind the limitations of PCA:
In performing a linear transformation, you implicitly assume that all components are expressed in equivalent units.
Beyond variance, PCA is blind to the structure of your data. It may very well happen that the data splits along low-variance dimensions. In that case, the classifier won't learn from transformed data.