Does python mne raw object represent a single trail? if so, how to average across many trials? - mne-python

I'm new to python MNE and EEG data in general.
From what I understand, MNE raw object represent a single trial (with many channels). Am I correct? What is the best way to average data across many trials?
Also, I'm not quite sure what the mne.Epochs().average() represents. Can anyone pls explain?
Thanks a lot.

From what I understand, MNE raw object represent a single trial (with many channels). Am I correct?
An MNE raw object represents a whole EEG recording. If you want to separate the recording into several trials, then you have to transform the raw object into an "epoch" object (with mne.Epochs()). You will receive an object with the shape (n_epochs, n_channels and n_times).
What is the best way to average data across many trials? Also, I'm not quite sure what the mne.Epochs().average() represents. Can anyone pls explain?
About "mne.Epochs().average()": if you have an "epoch" object and want to combine the data of all trials into one whole recording again (for example, after you performed certain pre-processing steps on the single trials or removed some of them), then you can use the average function of the class. Depending on the method you're choosing, you can calculate the mean or median of all trials for each channel and obtain an object with the shape (n_channels, n_time).
Not quite sure about the best way to average the data across the trials, but with mne.epochs.average you should be able to do it with ease. (Personally, I always calculated the mean for all my trials for each channel. But I guess that depends on the problem you try to solve)

Related

How to deal with missing values in K-means clustering?

I am working on customer segmentation based on their purchases for different type of product category.
Below is a dummy representation of my data. (The data is in percentage of the total revenue per each category the customer purchased):
Image Link
As seen in the image link above, altho this data have only a few 0's but the original data has many 0s. therefore, using this data for kmeans clustering does not output any acceptable insights and skews the data towards the left.
dropping the rows or averaging the missing data is misleading. :/
How to deal with missing values it's your choice, it will impact your clustering of course. There is no one "correct" way.
Few popular ways:
Fill each column missing values with average/mean of that feature
Bootstrapping: select random row and copy it's value to fill missing value
Closer Neighbor: find the closest neighbor and fill according to his missing values.
Without seeing your full data and why you're trying to do with clustering, it's a bit hard to help. Depends on the case...
You can always do some feature extraction (e.g. PCA), maybe it will give some better insights

What can we do with the dataset that 98 percent of the columns are null values?

I want to predict down time of the servers before it happens. To achive this aim, I collected many data from different data sources.
One of the data sources is metric data which contain cpu-time, cpu-percentage, memory-usage, etc. However, values of the columns in this dataset are null. I mean 98% of the many columns are null.
What kind of data preperation technique can be used to prepere the data before apply it to a prediction algorithm.
I appreciate any help.
If I were in your situation my first option would be to ignore this data source. There is too much missing data to be a relevant source of information for any ML algorithm.
That being said, if you still want to use this source of data, you will have to fill the gaps. Infer the missing data with only 2% of available data is hardly possible, but when you are speaking of more than 90% of missing data, I would advise to have a look at Non-Negative Matrix Factorization (NMF) here.
A few versions of this algorithm are implemeted in R, also to have better results in inferring such a big amount of missing data you could read this paper which uses times series information -which could be your case- with NMF to get better results. I ran some tests up to 95% of missing data and results were not so bad, hence, as discussed earlier, you could discard some of your data to have only 80% or 90% of missing data, then apply NMF for times series.
Normally various data imputation techniques can be applied, but in the case of 98% null values, I don't think this would be a correct approach, you are going to infer the empty data from just 2% available information; this would generate an enormous amount of bias in your data. I would go for such an option: Sort your rows such in descending order, such that the rows with the largest number of non-null columns come first. Then determine a cutoff from the beginning of the sorted list of rows, such that, for example, only 20% of the data missing in the selected subset of the data. Then apply data imputation. But of course, this assumes that you will have enough number of data points (rows) after determining this cutoff, which you may not have and the data is not missing at random for each row (if data is missing at random for each row, you cannot use this sorting method at all).
In any case, I can hardly see a concrete way of getting a meaningful model built by using such a high amount of missing data.
First, there can be many reasons why your data are null, like, it was not planed to get those data in the previous project version, then you upgrade it but it is not retroactive so you only have access to the data from the new version, meaning the 2% are fine data but represent nothing compared to total volume cause the new version is just up since X days; etc.
ANYWAY
Even if you have only 2% of non-null data, it does not really matters, what does matter is "how many data represent those 2%" ? If it is 2% of 5 billions, then it is enough to take "just" the 2% of non-null as training data and ignore the others!
Now, if the 2% represents just few data, then I really advise you to NOT fill the null values with them, because it will create enormous bias, furthermore, it means your actual process is not ready for implementing machine learning project => Just adapt to get more data.

Why should i use summary, and what can i get from these?

I'm studying deep-learning and tensorboard, almost example code use summaries.
I wonder that why I need to use Variables summaries.
Their are a many type of data for summary like min, max, mean, variation, etc.
What should I use in a typical situation?
How to analyze and What can i get from these summary graph?
thank you :D
There is an awesome video tutorial (https://www.youtube.com/watch?v=eBbEDRsCmv4) on Tensorboard that describes almost everything about Tensorboard (Graph, Summaries etc.)
Variable summaries (scalar, histogram, image, text, etc) help track your model through the learning process. For example, tf.summary.scalar('v_loss', validation_loss) will add one point to the loss curve each time you call the summary op, thus give you a rough idea whether the model has converged and when to stop.
It depends on your variable type. For values like loss, tf.summary.scalar shows the trend across epochs; for variables like weights in a layer, it would be better to use tf.summary.histogram, which shows the change of entire distribution of weights; I typically use tf.summary.image and tf.summary.text to check the images / texts my model generates over different epochs.
The graph shows your model structure and the size of tensors flowing through each op. I found it hard at the beginning to organise ops nicely in the graph presentation, and I learnt a lot about variable scope from that. The other answer provides a link for a great tutorial for beginners.

Do you have any suggestions for a Machine Learning method that may actually learn to distinguish these two classes?

I have a dataset that overlaps a lot. So far my results with SVM are not good. Do you have any recomendations for a model that may be able to differ between these 2 datasets?
Scatter plot from both classes
It is easy to fit the dataset by interpolation of one of the classes and predicting the other one otherwise. The problem with this approach is though, that it will not generalize well. The question you have to ask yourself is, if you can predict the class of a point given its attributes. If not then every ML algorithm will also fail to do so.
Then the only reasonable thing you can do is to collect more data and more attributes for every point. Maybe by adding a third dimension you can seperate the data more easily.
If the data is overlapping so much, both should be of the same class, but we know they are not. So, there is/are some feature(s) or variable(s) that is/are separating these data points into two classes. Try to add more features for data.
And sometimes, just transforming the data into a different scale can help.
Both the classes need not be equally distributed, as skewed data distribution can be handled separately.
First of all, what is your criterion for "good results"? What style of SVM did you use? Simple linear will certainly fail for most concepts of "good", but a seriously convoluted Gaussian kernel might dredge something out of the handfuls of contiguous points in the upper regions of the plot.
I suggest that you run some basic statistics on the data you've presented, to see whether they're actually as separable as you'd want. I suggest a T-test for starters.
If you have other dimensions, I strongly recommend that you use them. Start with the greatest amount of input you can handle, and reduce from there (principal component analysis). Until we know the full shape and distribution of the data, there's not much hope of identifying a useful algorithm.
That said, I'll make a pre-emptive suggestion that you look into spectral clustering algorithms when you add the other dimensions. Some are good with density, some with connectivity, while others key on gaps.

normalization methods for stream data

I am using Clustream algorithm and I have figured out that I need to normalize my data. I decided to use min-max algorithm to do this, but I think in this way the values of new coming data objects will be calculated differently as the values of min and max may change. Do you think that I'm correct? If so, which algorithm shall I use?
Instead to compute the global min-max based on the whole data, you can use a local nomarlization based on a sliding window (e.g. using just the last 15 secconds of data). This approach is very commom to compute Local Mean Filter on signal and image processing.
I hope it can help you.
When normalizing stream data you need to use the statistical properties of the train set. During streaming you just need to cut too big/low values to a min/max value. There is no other way, it's a stream, you know.
But as a tradeoff, you can continuously collect the statistical properties of all your data and retrain your model from time to time to adapt to evolving data. I don't know Clustream but after short googling: it seems to be an algorithm to help to make such tradeoffs.

Resources