At what rate should I sample to make a dependent data stream independent? - stream

I am an undergraduate student who is volunteering in a computer vision research project. As a part of the project, I wish to make a dependent data stream (a stream in which the value of each data sample depends on the previous data sample seen), independent. For this, I need to determine a scalar at which intervals I must sample the stream so that no 2 consecutive data samples are dependent.
For instance, maybe at a jump factor of 10, that is, sampling after every 10 data points in the stream, the resultant reduced data stream is independent.
My question is how can we determine this scalar jump factor for effective sampling such that the new data stream has independent data points?
From my research, I have been unable to find any statistical test that could be helpful.
Thanks in advance.

Related

Multiple data from different sources on time series forecasting

I have an interesting question about time series forecasting. If someone has temporal data from multiple sensors, each dataset would have data, e.g., from 2010 to 2015, so if one were to train a forecasting model using all the data from those different sensors, how should the data be organized? because if one just stacked up the data set, it would generate, e.g., sensorDataset1 (2010–2015), sensorDataset2 (2010–2015), and the cycle would start over with sensors 3, 4, and n. Is this a problem with time series data or not?
If yes, what is the proper way to handle this?
I tried using all the data stacked up and training the model anyway, and actually it has a good error, but I wonder if that approach is actually valid.
Try sampling your individual sensor data sets to the same period.
For example, if sensor 1 has a data entry every 5 minutes and sensor 2 has an entry every 10 minutes. Try to sample your data to a common period across all sensors. Each data point you show to your model will have better quality data that should influence the performance of your model.
The aspect that will influence your error depends on what you're trying to forecast and the relationships that exist in your data that showcase the relationship between variables.

What is 'Refresh Rate' in the context of machine learning algorithms?

I've recently been using an AI/ML platform called Monument(Monument.Ai) to project time series. The platform contains various ML algos and parameters within the algo to tune the projections. When using algos such as Light GBM and LSTM, there is a parameter called 'Refresh Rate.' Refresh rate is a parameter that takes in an integer. In the platform, it describes refresh rate as
How frequently windows are constructed. Every window is used to validate this number of data points
where windows in this context are 'sub windows' within the main training period. My question is what is the underlying use of refresh rate and how does changing it from 1, 10, or 50 impact the projections?
Monument worker here. I think we should set up an Faq platform somewhere, as the questions could be confusing to others without context :-)
Back to your question, the refresh rate affects only the "validation" part for a time series analysis. It is interpreted as a frequency number, so 1 = high refresh rate and 50 = low refresh rate. A higher refresh rate gives you the better validation effectiveness, but is slower than a lower refresh rate; hence you usually choose a moderate one (10 is a good choice).
====== More technical explanations below. ======
On Monument, you choose an algorithm to make future "prediction" on your time series data, and look at the "validation" results to see how suitable the algorithm is to your problem. The prediction task is specified by two "window" parameters: lookback and lookahead. Selecting lookback=10 and lookahead=5 means you are trying to "predict 5 data points into the future by using the last 10 data points".
Validation needs to reflect the result from the exact same prediction task. Particularly, for each historical data point, you want to train a new model with 10 points in the past to predict 5 points ahead. This is when refresh rate=1, i.e., refresh for every data point. For each historical data point, you create a "sub-window" of length 15 (10+5). That is a lot of new models to train and could be very very slow.
If time and memory limit is not a concern then refresh rate=1 is a good choice, but usually we want to be more efficient. Here we are exploiting a "local reusability" assumption, that is a model trained for a sub-window is useful for adjacent sub-windows. Then we can train model on one sub-window and use it on 10 historical points, that is, refresh rate=10. This way much less computation is required and validation is still accurate to a certain extent. Note you may not want to set refresh rate=200, because it is not very convincing that my model is still useful for data 200 points away. As you see there is a tradeoff between speed and accuracy.

Applying machine learning to training data parameters

I'm new to machine learning, and I understand that there are parameters and choices that apply to the model you attach to a certain set of inputs, which can be tuned/optimised, but those inputs obviously tie back to fields you generated by slicing and dicing whatever source data you had in a way that makes sense to you. But what if the way you decided to model and cut up your source data, and therefore training data, isn't optimal? Are there ways or tools that extend the power of machine learning into, not only the model, but the way training data was created in the first place?
Say you're analysing the accelerometer, GPS, heartrate and surrounding topography data of someone moving. You want to try determine where this person is likely to become exhausted and stop, assuming they'll continue moving in a straight line based on their trajectory, and that going up any hill will increase heartrate to some point where they must stop. If they're running or walking modifies these things obviously.
So you cut up your data, and feel free to correct how you'd do this, but it's less relevant to the main question:
Slice up raw accelerometer data along X, Y, Z axis for the past A number of seconds into B number of slices to try and profile it, probably applying a CNN to it, to determine if running or walking
Cut up the recent C seconds of raw GPS data into a sequence of D (Lat, Long) pairs, each pair representing the average of E seconds of raw data
Based on the previous sequence, determine speed and trajectory, and determine the upcoming slope, by slicing the next F distance (or seconds, another option to determine, of G) into H number of slices, profiling each, etc...
You get the idea. How do you effectively determine A through H, some of which would completely change the number and behaviour of model inputs? I want to take out any bias I may have about what's right, and let it determine end-to-end. Are there practical solutions to this? Each time it changes the parameters of data creation, go back, re-generate the training data, feed it into the model, train it, tune it, over and over again until you get the best result.
What you call your bias is actually the greatest strength you have. You can include your knowledge of the system. Machine learning, including glorious deep learning is, to put it bluntly, stupid. Although it can figure out features for you, interpretation of these will be difficult.
Also, especially deep learning, has great capacity to memorise (not learn!) patterns, making it easy to overfit to training data. Making machine learning models that generalise well in real world is tough.
In most successful approaches (check against Master Kagglers) people create features. In your case I'd probably want to calculate magnitude and vector of the force. Depending on the type of scenario, I might transform (Lat, Long) into distance from specific point (say, point of origin / activation, or established every 1 minute) or maybe use different coordinate system.
Since your data in time series, I'd probably use something well suited for time series modelling that you can understand and troubleshoot. CNN and such are typically your last resort in majority of cases.
If you really would like to automate it, check e.g. Auto Keras or ludwig. When it comes to learning which features matter most, I'd recommend going with gradient boosting (GBDT).
I'd recommend reading this article from AirBnB that takes deeper dive into journey of building such systems and feature engineering.

normalization methods for stream data

I am using Clustream algorithm and I have figured out that I need to normalize my data. I decided to use min-max algorithm to do this, but I think in this way the values of new coming data objects will be calculated differently as the values of min and max may change. Do you think that I'm correct? If so, which algorithm shall I use?
Instead to compute the global min-max based on the whole data, you can use a local nomarlization based on a sliding window (e.g. using just the last 15 secconds of data). This approach is very commom to compute Local Mean Filter on signal and image processing.
I hope it can help you.
When normalizing stream data you need to use the statistical properties of the train set. During streaming you just need to cut too big/low values to a min/max value. There is no other way, it's a stream, you know.
But as a tradeoff, you can continuously collect the statistical properties of all your data and retrain your model from time to time to adapt to evolving data. I don't know Clustream but after short googling: it seems to be an algorithm to help to make such tradeoffs.

Recover the original analog signal (time varying Voltage) from digitized version?

I have been looking into how to convert my digital data into analog.
So, I have a two column ASCII data file (x: time, y=voltage amplitude) which I would like to convert into an analog signal (varying Voltage with time). There are Digital to Analog converters, but the good ones are quite expensive. There should be a more trivial way to achieve this.
Ultimately what I'd like to do is to reconstruct the original time variant voltage which was sampled every nano-second and recorded as an ASCII data file.
I thought I may feed the data into my laptop's sound card and re-generate the time variant voltage which I can then feed into the analyzer via the audio jack. Does this sound feasible?
I am not looking into recovering the "shape" but the signal (voltage) itself.
Puzzled on several accounts.
You want to convert into an analog signal (varying Voltage with time) But the what you already have, the discrete signal, is indeed a "varying voltage with time", only that both the values (voltages) and times are discrete. That's the way computers (digital equipment, in general) work.
Only when the signal goes to some non-discrete medium (eg. a classical audio cable+plug) we have an analog signal. Precisely, the sound card of your computer is at its core a "Digital to Analog converter".
So, it appears you are not trying to do some digital processing of your signal (interpolation, or whatever), you are not dealing with computer programming, but with a hardware thing: getting the signal to a cable. If so, SO is not the proper place. YOu might try https://electronics.stackexchange.com/ ...
But, on another thing, you say that your data was "sampled every nano-second". That means 1 billion samples per second, or a sample freq of 1Ghz. That's a ridiculously high frequency, at least in the audio world. You cant output that to a sound card, which would be limited to the audio range (about 48Khz = 48000 samples per second).
You want to just fit a curve to the data. Assuming the sampling rate is sufficient, a third-order polynomial would be plenty. At each point N, you fit a cubic polynomial to points N-1, N, N+1, and N+2, and then you have an analytic expression for the data values between those points. Shift over one, and repeat. You can average the values for multiple successive curves, if you want.

Resources