In time series data there are three dimentional shape as (input,timesteps,features) for CNN1D/LSTM model. For CNN2D model timeseries data are used for 5D shape(input,timesteps,features,rows ,coulmns).
How can convert data shape from 3D to 5 D dimentional data for CNN2D model?
For CNN2D model timeseries data are used for 5D shape(input,timesteps,features,rows ,coulmns).
How can convert data shape from 3D to 5 D dimentional data for CNN2D model?
Related
I have a matrix of data where the 1st column are dates and the rest (nearly 100 columns) are binary data (0's and 1's). I was tasked to turn them into time series in R, but I'm lost in how to do it.
I'm lost in this one, especially because all info is for time series involving data different than binary data
I have developed a Random Forest model which is including two inputs as X and one output as Y. I have normalized both X and Y values for the training process.
After the model get trained, I selected the dataset as an unseen data for an input for the model. The data is coming from another resource. I normalized the X values and imported them to the trained model and get the Y-normalized value as an output. I wonder how the de normalizing process would be. I mean I have to multiply the output by which value to get the denormalized value?
I'd appreciate it if someone can help me in this regard.
You need to do the prepossessing inversely. But, you the mean and sd (standard deviation) values that used for normalization.
For example with scikit learn you can do it easily. You can do it with 1 line of code.
enter code here
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
data= ...
scaled_data = scaler.fit_transform(data)
inverse = scaler.inverse_transform(scaled_data)
I'm trying to learn the basics of Linear Regression.
I tried to build the simplest model with the simplest data for the starter .
for data I had:
## Data (Apple stock prices)
apple = np.array([155, 160, 165])
days = np.array([1, 2, 3])
the X would be the days and y would be the apple stock price.
I try to build the model with a one-liner :
model = LinearRegression().fit(X=days,y=apple)
Then I get the error that says, the model expects 2d data as input.
but "why" ? both the X and y, in this case, the number of days and the stock prices for the apple, are one dimensional. why it should be converted into a 2d array?
The model was created to support both 1D and 2D data, if the input shape was 1D, 2D won't be supported, but when the input shape is 2D, both 2D and 1D is supported by just reshaping the 1D array to a 2D array. This is why the model was built to accept 2D arrays for flexibility. So, just reshape your data and your model will function right. Try this:
apple = np.array([155, 160, 165]).reshape(-1,1)
days = np.array([1, 2, 3]).reshape(-1,1)
model = LinearRegression().fit(X=days,y=apple)
The input is an array of size (nxm) with m being the number of x variables.
in your case m=1, so you need an array of (3x1). your current input is (3,).
try:
days = np.array([1, 2, 3]).reshape(-1,1)
The Problem
After pre-processing a raw dataset, I obtained a clean but severely imbalanced dataset with 341 observations with label 1 and 3 observations with label 0 (more details about the dataset at the bottom).
Dataset shape: (344, 1500)
Dataset class label distribution: Counter({1: 341, 0: 3})
What can I do to proceed with this dataset for classification?
What I have tried:
Split the dataset into train-test sets with 70:30 ratio with stratify on class label
Train data shape: (240, 1500)
Train data class label distribution: Counter({1: 238, 0: 2})
Test data shape: (104, 1500)
Test data class label distribution: Counter({1: 103, 0: 1})
Perform oversampling on train data using SMOTE (synthetic minority oversampling technique) with k_neighbour set to 1
After SMOTE:
Train data shape: (476, 1500)
Train data class label distribution: Counter({1: 238, 0: 238})
I plan to train a classifier using the oversampled train data and use the test data to get the classification result.
But does this make sense? In my opinion it does not make sense since
The oversampled train data might overfit the model because the oversampled train data now has many observations with class label 0 which are oversampled based on only 2 observations.
The minority class label of the test data only have 1 observation out of 104 samples. Therefore the classifier will have high accuracy by just making prediction on the majority class label (Initially I plan to perform SMOTE on test data too but I read from somewhere that oversampling techniques are only used on train data).
I am really stuck here and I could not find any relevant information for this problem.
A brief summary of the acquired mulit-omics dataset:
The raw lung cancer (LUSC) dataset was obtained from http://acgt.cs.tau.ac.il/multi_omic_benchmark/download.html. It consists of 3 omics data types with 1 clinical dataset. The 3 omics data types consists of 3 different omics expressions (gene expression, DNA methylation & miRNA expression) while the clinical dataset consists of the binary class label sample_type (along with other unimportant attributes) for the 3 omics data types.
The aim is to obtain a multi-omics dataset by combining the 3 omics data types.
To obtain the multi-omics data, the 3 omics data types were concatenated with the clinical data (with sample_type as the class label) based on sampleID in all 4 datasets. The end product is a severely imbalaned dataset which consists of 344 observations with 341 observations with Primary Tumour label (has cancer, referred as 1) and 3 observations with Solid Tissue Normal label (no cancer, referred as 0)
This is more of a statistics question. In my opinion, you should not try to estimate anything on these data. You do not know what sets the 0's apart. Just to make a simple logistic regression, I'd recommend having at least 30-40 observations (ideally more).
The simplest estimator based on your data would be to guess 1 every time. That would lead to 99 percent accuracy, you can't expect to beat that with any complex models.
I want use k-means to discretize a time series data in two values (0 or 1). My time series data is a matrix time per genes (line = time, column = gene). Ex:
t\x x1 x2 x3
1 0.122 0.324 0.723
2 0.543 0.573 0.329
3 0.901 0.445 0.343
4 0.612 0.353 0.435
5 0.192 0.233 0.023
My question: Should I use k clusters for all data of matrix or k clusters for each column (so I will have k cluster per column totalizing k.number_columns)? and my genes are independents
Either may work.
Discretising all attributes at once has the benefit of giving you only one symbol per time, i.e. a univariate series.
But on the other hand, if columns are independent, the quality could be better if you discretise them individually. Note thatfor one-dimensional data, if it is noisy, quantiles may be much better than k-means (which is sensitive to noise).