log transformation of data does not normalize the data - normal-distribution

I have a phenotype data which is not normally distributed. So I log transformed the data to normalize the data centering to zero. The distribution became better but it is still not normal. What can I do to make it normal or should I proceed with the analysis as such. My goal is to make a co-expression network.
Thank you!

You should continue your analysis as such. There is no point in normalizing the data further if the data is simply not normally distributed.
You're trying to fit your model to the data and not your data to the model.

Related

Does python mne raw object represent a single trail? if so, how to average across many trials?

I'm new to python MNE and EEG data in general.
From what I understand, MNE raw object represent a single trial (with many channels). Am I correct? What is the best way to average data across many trials?
Also, I'm not quite sure what the mne.Epochs().average() represents. Can anyone pls explain?
Thanks a lot.
From what I understand, MNE raw object represent a single trial (with many channels). Am I correct?
An MNE raw object represents a whole EEG recording. If you want to separate the recording into several trials, then you have to transform the raw object into an "epoch" object (with mne.Epochs()). You will receive an object with the shape (n_epochs, n_channels and n_times).
What is the best way to average data across many trials? Also, I'm not quite sure what the mne.Epochs().average() represents. Can anyone pls explain?
About "mne.Epochs().average()": if you have an "epoch" object and want to combine the data of all trials into one whole recording again (for example, after you performed certain pre-processing steps on the single trials or removed some of them), then you can use the average function of the class. Depending on the method you're choosing, you can calculate the mean or median of all trials for each channel and obtain an object with the shape (n_channels, n_time).
Not quite sure about the best way to average the data across the trials, but with mne.epochs.average you should be able to do it with ease. (Personally, I always calculated the mean for all my trials for each channel. But I guess that depends on the problem you try to solve)

Machine learning data modeling

I am a beginner in Machine learning. I have seen videos which teaches machine learning. But my questions is How can we model our data.
Mostly we get unstructured data. How can I convert that unstructured data into structured format, The BEST way. So that we can find the most useful information from the data.
Any help w.r.t books or links is very thankful.
As a machine learning engineer, You will be responsible for preprocessing your data in a way such that it will be acceptabele. by the model.
There is no best way to do this and moresoo, it depends on what type of data you have such as 1. csv datasets, 2. Text dataset, file(image & audio).
In the real world all the data will not be in a structured form. When we get the data very first thing is find
1. what is the data is all about.
2. what are the features of it and output of it.
Ex: Dataset to predict the height a person and you have all the below info like from which country, Weight, Gender, Hair color etc.. these are the features we say usually term in Machine learning.
3. Then we need to see how the data features are. Like text data or numerical etc.. We need to pre-process the data before we do any analysis of the data. For Ex: In case you data, a feature is all about a review then you need remove all the special function and corpous your data.
4. You need to understand the way model accepts the data and parameters the model has how can we improve the data.( We can do some feature engineering to improve the models etc..)
There is no hard and fast rule you need to do in the same way.
First, you need to learn about preprocessing and feature extraction. If you make a model in Python, then libraries like Pandas or Scikit learn are very useful. As a first step try to create sentences like "when x occurs then my output y becomes ...".
Before modeling, the data has to be cleaned. There are several methods to clean the data. Go through the link on how to convert data from unstructured data to structured data.
https://www.geeksforgeeks.org/how-to-convert-unstructured-data-to-structured-data-using-python/

Data augmentation in test/validation set?

It is common practice to augment data (add samples programmatically, such as random crops, etc. in the case of a dataset consisting of images) on both training and test set, or just the training data set?
Only on training. Data augmentation is used to increase the size of the training set and to get more different images.
Technically, you could use data augmentation on the test set to see how the model behaves on such images, but usually, people don't do it.
Data augmentation is done only on training set as it helps the model become more generalize and robust. So there's no point of augmenting the test set.
This answer on stats.SE makes the case for applying crops on the validation / test sets so as to make that input similar the the input in the training set that the network was trained on.
Do it only on the training set. And, of course, make sure that the augmentation does not make the label wrong (e.g. when rotating 6 and 9 by about 180°).
The reason why we use a training and a test set in the first place is that we want to estimate the error our system will have in reality. So the data for the test set should be as close to real data as possible.
If you do it on the test set, you might have the problem that you introduce errors. For example, say you want to recognize digits and you augment by rotating. Then a 6 might look like a 9. But not all examples are that easy. Better be save than sorry.
I would argue that, in some cases, using data augmentation for the validation set can be helpful.
For example, I train a lot of CNNs for medical image segmentation. Many of the augmentation transforms that I use are meant to reduce the image quality so that the network is trained to be robust against such data. If the training set looks bad and the validation set looks nice, it will be hard to compare the losses during training and therefore assessing overfit will be complicated.
I would never use augmentation for the test set unless I'm using test-time augmentation to improve results or estimate aleatoric uncertainty.
In computer vision, you can use data augmentation during test time to obtain different views on the test image. You then have to aggregate the results obtained from each image for example by averaging them.
For example, given this symbol below, changing the point of view can lead to different interpretations :
Some image preprocessing software tools like Roboflow (https://roboflow.com/) apply data augmentation to test data as well. I'd say that if one is dealing with small and rare objects, say, cerebral microbleeds (which are tiny and difficult to spot on magnetic resonance images), augmenting one's test set could be useful. Then you can verify that your model has learned to detect these objects given different orientation and brightness conditions (given that your training data has been augmented in the same way).
The goal of data augmentation is to generalize the model and make it learn more orientation of the images, such that the during testing the model is able to apprehend the test data well. So, it is well practiced to use augmentation technique only for training sets.
The point of adding validation data is to build generalized model so it is nothing but to predict real-world data. inorder to predict real-world data, the validation set should contain real data. There is no problem with augmenting validation data but it won't increase the accuracy of the model.
Here are my two cents:
You train your model on the training data and the validation data: the former to optimize your parameters, and the latter to give you an appropriate stopping condition. The test data is to give you a real-world estimate of how well you can expect your model to perform.
For training, you can augment your training data to increase robustness to various factors including, but not limited to, sampling error, bias between data sources, shifts in global data distribution, positioning, and any other sort of variation you would like to account for.
The validation data should indicate to the training method when the model is most generalizable. By this logic, if you expect to see some variation in real-world data that can be simulated using data augmentation, then by all means, the validation dataset should be augmented.
The test data, on the other hand, should not be augmented, except potentially in special scenarios where data is very limited, and an estimate of real-world performance on test data has too much variance.
You can use augmentation data in training, validation and test sets.
The only thing to avoid is using the same data from the training set in validation or test sets.
For example, if you generate 3 augmented instances from an register of the training data, make sure that no one of these 3 augmented instances accidentally ends up in the validation or test sets.
It turns out that using data from the training set, even augmented data, to validate or test a model is a methodology mistake.

How do I learn parameters per input example in TensorFlow without using too much memory?

I want to train my TensorFlow model to learn some parameters separately for each input element - in particular, I want it to learn the initial states of an LSTM network for every input in my training set. The goal is to use these initial states as an embedding.
In order to loop through my input data, I use a PaddingFIFOQueue that goes over my input stored in .tfrecord files as SequenceExample. If I just allocate a huge tensor for the learned initial states, my memory gets exhausted. As my input data are loaded in batches, I was wondering if there is some way to split the learned initial states up in batches, too. But when and how can I save those data to disk? And how do I make sure the same initial states come up when the same batch is read again? Also, I know that Adam keeps track of changes to all trainable variables – can I somehow tell Adam to swap out its memory, too?

normalization methods for stream data

I am using Clustream algorithm and I have figured out that I need to normalize my data. I decided to use min-max algorithm to do this, but I think in this way the values of new coming data objects will be calculated differently as the values of min and max may change. Do you think that I'm correct? If so, which algorithm shall I use?
Instead to compute the global min-max based on the whole data, you can use a local nomarlization based on a sliding window (e.g. using just the last 15 secconds of data). This approach is very commom to compute Local Mean Filter on signal and image processing.
I hope it can help you.
When normalizing stream data you need to use the statistical properties of the train set. During streaming you just need to cut too big/low values to a min/max value. There is no other way, it's a stream, you know.
But as a tradeoff, you can continuously collect the statistical properties of all your data and retrain your model from time to time to adapt to evolving data. I don't know Clustream but after short googling: it seems to be an algorithm to help to make such tradeoffs.

Resources