Training data and testing data from the same sensor - machine-learning

When using learning methords, We have training and testing data.
I'd like to confirm that
1)whether the training data and testing data must capture from the same sensor 2)What if they are from different sensors?
3) If they must be captured from the same sensor, are there any methods to uniform the data even they are not from the same sensor?
Thank you.

Yes, you would need both train and test data from the same sensor because of measurement error and detection bias that's specific to that sensor. If the test data came from a sensor that's always different from the sensor training data came from, you could have total system failure. Each sensor has it's own precision, bias, detection limits, etc, so that has to be distributed to both training and testing.
The idea with testing and training is not so much what you are thinking about, but rather the idea that when training the algorithm, the objects used in testing were never used in training. It's called selection bias. But you can have have objects from the same sensor used in training or testing.
If, however, the measurement wavelength or angle (pitch) of each sensor is different, then you are dealing more with a problem requiring MUlti-Signal Classification or Pisarenko harmonic decomposition.

Related

What is meant by stability in relation to neural networks

I hear the terms stability/instability thrown around a lot when reading up on Deep Q Networks. I understand that stability is improved with the addition of a target network and replay buffer but I fail to understand exactly what it's refering to.
What would the loss graph look like for an instable vs stable neural network?
What does it mean when a neural network converges/diverges?
Stability, also known as algorithmic stability, is a notion in
computational learning theory of how a machine learning algorithm is
perturbed by small changes to its inputs. A stable learning algorithm
is one for which the prediction does not change much when the training
data is modified slightly.
Here Stability means suppose you have 1000 training data that you use to train the model and it performs well. So in terms of model stability if you train the same model with 900 training data the model should still perform well , thats why it is also called as algorithmic stability.
As For the loss Graph if the model is stable the loss graph probably should be same for both size of training data (1000 & 900). And different in case of unstable model.
As in Machine learning we want to minimize loss so when we say a model converges we mean to say that the model's loss value is within acceptable margin and the model is at that stage where no additional training would improve the model.
Divergence is a non-symmetric metrics which is used to measure the difference between continuous value. For example you want to calculate difference between 2 graphs you would use Divergence instead of traditional symmetric metrics like Distance.

Unsupervised Learning for regression analysis

I am a geophysics student and I am trying to predict shear wave velocity which is numerical data. I feel since it is numerical data it'd be regression analysis but the problem I have now is that I don't have a shear wave log I can use as a target which then makes the project unsupervised, How do I go about it, please?
I want to if it's possible to predict numerical data because I have tried picking out random logs I feel will predict it but how do I check the accuracy
The solution inhere for you is to make data out of the signal data. I was also working on similar kind of problem where I was to predict the intensity of fall and data that I got was signal data having x,y,z axis. I managed to solve the problem by initially creating the data using clustering methodology according to my use case.Now since I have supervised data I proceded with futher analysis and predictions.

Do we need to care about target variable distribution in train and validation set in regression problem?

In a classification problem, we care about the distribution of the labels in train and validation set. In sklearn, there is stratify option in train_test_split to ensure that the distribution of the labels in train and validation set are similar.
In a regression problem, let's say we want to predict the housing price based on a bunch of features. Do we need to care about the distribution of the housing price in train and validation set?
If yes, how to we achieve this in sklearn?
Forcing features to have similar distributions in your training and in your validation set assumes highly trusting the data you have to be representative of the data you will encounter in real life (ie. in a production environment), which is often not the case.
Also, doing so may virtually increase your validation score, compared to your test score.
Instead of adjusting feature distributions in train and validation sets, I would suggest you to perform cross-validation (in sklearn), which may be more representative of a testing situation.
This book ('A. Geron, Hands-On Machine Learning with Scikit-Learn & TensorFlow, O'Reilly, 2017) provides an excellent introductory discussion of this in chapter 2. To paraphrase:
Generally for large datasets you don't need to perform stratified sampling: You training set should be a fair representation of the range of observed instances (there are of course exceptions to this). For smaller datasets you could introduce sampling bias (I.e., disproportionately recording data from only a particular region of the expected range of target attributes) if you performed random sampling and stratified sampling is properly required.
Practically, you will need to create a new categorical feature by binning this continuous feature. You can then perform stratified sampling of this categorical feature. Make sure to remove this new categorical feature before training your data!
However, to do this you will need to have a good understanding of your features, I doubt there will be much point in performing stratified sampling of features of weak predictive power. I guess it could even do harm if you introduce some unintentional bias in the data by performing non-random sampling.
Take home message:
My instinct is that stratified sampling of a continuous feature should always be information and understanding lead. I.e, if you know a feature is a strong predictor of the target variable and you also know the sampling across its values is not uniform, you probably want to perform stratified sampling to make sure the range of values are properly represented in both the training and validation set.

State-of-art for sensor's anomaly detection

I am working on anomaly detection problem and I need your help and expertise. I have a sensor that records episodic time series data. For example, once in a while, the sensor activates for 10 seconds and records values at millisecond interval. My task is to identify whether the recorded pattern is not normal. In other words, I need to detect anomalies in that pattern compared to other recorded patterns.
What would be the state-of-the-art approaches to that?
After doing my own research, the following methods proven to work very well in practice:
Variational Inference for On-line Anomaly Detection in
High-Dimensional Time Series
Multivariate Industrial Time Series
with Cyber-Attack Simulation: Fault Detection Using an LSTM-based
Predictive Data Model

Testing an image processing algorithm on noisy data

I wrote an image processing program that train some classifier to recognize some object in the image. now I want to test the response of my algorithm to noise. I wish the algorithm have some robustness to noise.
My question is that, should I train the classifier using noisy version of train dataset, or train the classifier using original version of dataset, and see its performance on noisy data.
Thank you.
to show robustness of classifier one might use highly noisy test data on the originally trained classifier. depending on that performance, one can train again using noisy data and then test again. obviously for an application development, if including extremely noisy samples increase accuracy then that's the way to go. literature says to have as large a range of training samples as possible. however sometimes this degrades performances in specific cases.

Resources