why sklearn LinearRegression model does not accept one dimensional data?

why sklearn LinearRegression model does not accept one dimensional data? - machine-learning

I'm trying to learn the basics of Linear Regression.
I tried to build the simplest model with the simplest data for the starter .
for data I had:
## Data (Apple stock prices)
apple = np.array([155, 160, 165])
days = np.array([1, 2, 3])
the X would be the days and y would be the apple stock price.
I try to build the model with a one-liner :
model = LinearRegression().fit(X=days,y=apple)
Then I get the error that says, the model expects 2d data as input.
but "why" ? both the X and y, in this case, the number of days and the stock prices for the apple, are one dimensional. why it should be converted into a 2d array?

The model was created to support both 1D and 2D data, if the input shape was 1D, 2D won't be supported, but when the input shape is 2D, both 2D and 1D is supported by just reshaping the 1D array to a 2D array. This is why the model was built to accept 2D arrays for flexibility. So, just reshape your data and your model will function right. Try this:
apple = np.array([155, 160, 165]).reshape(-1,1)
days = np.array([1, 2, 3]).reshape(-1,1)
model = LinearRegression().fit(X=days,y=apple)

The input is an array of size (nxm) with m being the number of x variables.
in your case m=1, so you need an array of (3x1). your current input is (3,).
try:
days = np.array([1, 2, 3]).reshape(-1,1)

Related

Representing an array as a feature in ML training

I have a set of features x1,x2,x3,x4 where x1,x2,x3 are floats and x4 is an array of floats.
To give an example, say that I am trying to predict the price of a house. I could use the size of the house as an array (e.g. length, width, and height) along with other features like number of bedrooms, age of house, no of bathrooms etc.
This is simple, but I am sort of struggling how to represent this.
Here is a similar sample based on heart attack prediction https://colab.research.google.com/drive/1CQX2d0vkjlZKjX6wbG4ga6tRcI-zOMNA
I tried to add a column to add an array feature, with np.c_ to the end
##################################-Check-########################
print("Before",X_s[:1])
X_s =np.c_[ X_s,np.random.rand(303,2)] # add a numpy array here as a feature
print("After",X_s[:1])
print("shape of X_s",X_s.shape)
print(X_s[:1])
dataset = tf.data.Dataset.from_tensor_slices((X_s, y_s))
But the problems is that the array is added as two extra columns in the end
shape of X_s (303, 13)
shape of X_s (303, 15)
So if I have a feature array of say 330*300 with the above approach it will add 300 columns to the end. Not something I want
I am aware of CNN network, and one option is to model the whole problem as a CNN; that is pad the other features also as arrays and created an n dimension tensor and use a CNN
Is there something simpler and better than these approaches

Can the deep learning model do the training and testing stages using different length of output

As I know, the deep learning model does the training for such inputs to produce an output, then that trained model will be used to predict the output based on the novel input which has the same length of input used in the training stage, also the predicted output has the same length of vector used in training stage. My concern is little bit different.
When having a deep learning model, assume the input to be a vector y with length N, and the output x with length M. Can the deep learning model do the training for the input vector y till some values of x are correct, then the deep learning model will be used to predict the other values of x? How can we do that process, I mean which process can be followed in that case?
For example, I have random vector y with size 50 x 1, and the output x a vector x = [0, 1, 2, 3, 4, 5, 6, 7, 8, 9]. So can the deep learning use the vector y and do the training till have the first four values of x to be [0,1,2,3], then we let the trained model to predict the other values of the vector x. It is very important to mention there that values of the output vector depends on each other, so expecting part of them can yield the other values too.
I tried to follow the conventional way to do that, but I find that same size of input/outputs during the training and testing stages must be used while what I am looking for is little bit different.

Creating a ML algorithm where the train data does not have same number of columns in all records

So I have the following train data (no header, explanation bellow):
[1.3264,1.3264,1.3263,1.32632]
[2.32598,2.3256,2.3257,2.326,2.3256,2.3257,2.32566]
[10.3215,10.3215,10.3214,10.3214,10.3214,10.32124]
It does not have an header because all elements with exception of the last 1 on each array are inputs and the last one is the result/output.
So taking first example: 1.3264,1.3264,1.3263 are inputs/feed data that I want to give to the algorith and 1.32632 is the outcome/result.
All of these are historical values that would lead to a pattern recognition.
I would like to give some test data to the algorith and he would give me outcome/result based on that pattern he identified.
From all the examples I looked into with ML and sklearn, I have never seen one where you have(for the same type of data) multiple entries. They all seem to have the same number of columns and diferent types of inputs whereas mine is always the same type of input.

You can try two different approaches:
Extract features from your variable length data to make the features have fixed size. After that you can use any algorithm from sklearn or other packages. Feature extraction is highly domain-specific process that requires context of what the data actually is. For example you can try similar features:
import numpy as np
def extract_features_one_row(arr):
arr = np.array(arr[:-1])
y = arr[-1]
features = [
np.mean(arr),
np.sum(arr),
np.median(arr),
np.std(arr),
np.percentile(arr, 5),
np.percentile(arr, 95),
np.percentile(arr, 25),
np.percentile(arr, 75),
(arr[1:] > arr[:-1]).sum(), # number of increasing pairs
(arr > arr.mean()).sum(), # number of elements > mean value
# extract trends, number of modes, etc
]
return features, y
data = [
[1.3264, 1.3264, 1.3263, 1.32632],
[2.32598, 2.3256, 2.3257, 2.326, 2.3256, 2.3257, 2.32566],
[10.3215, 10.3215, 10.3214, 10.3214, 10.3214, 10.32124],
]
X, y = zip(*[extract_features_one_row(row) for row in data])
X = np.array(X) # (3, 10)
print(X.shape, y)
So now X_data have the same number of columns.
Use ML algorithm that supports variable length data: Recurrent neural networks, transformers, convolutional networks with padding.

How do I match samples with their predictions when doing inference with PyTorch's DistributedSampler?

I have trained a torch model for NLP tasks and would like to perform some inference using a multi GPU machine (in this case with two GPUs).
Inside the processing code, I use this
dataset = TensorDataset(encoded_dict['input_ids'], encoded_dict['attention_mask'])
sampler = DistributedSampler(
dataset, num_replicas=args.nodes * args.gpus, rank=args.node_rank * args.gpus + gpu_number, shuffle=False
)
dataloader = DataLoader(dataset, batch_size=batch_size, sampler=sampler)
For those familiar with NLP, encoded_dict is the output from the tokenizer.batch_encode_plus function where the tokenizer is an instance of transformers.BertTokenizer.
The issue I’m having is that when I call the code through the torch.multiprocessing.spawn function, each GPU is doing predictions (i.e. inference) on a subset of the full dataset, and saving the predictions separately; for example, if I have a dataset with 1000 samples to predict, each GPU is predicting 500 of them. As a result, I have no way of knowing which samples out of the 1000 were predicted by which GPU, as their order is not preserved, therefore the model predictions are meaningless as I cannot trace each of them back to their input sample.
I have tried to save the dataloader instance (as a pickle) together with the predictions and then extracting the input_ids by using dataloader.dataset.tensors, however this requires a tokeniser decoding step which I rather avoid, as the tokenizer will have slightly changed the text (for example double whitespaces would be removed, words with dashes will have been split and so on).
What is the cleanest way to save the input text samples together with their predictions when doing inference in distributed mode, or alternatively to keep track of which prediction refers to which sample?

As I understand it, basically your dataset returns for an index idx [data,label] during training and [data] during inference. The issue with this is that the idx is not preserved by the dataloader object, so there is no way to obtain the idx values for the minibatch after the fact.
One way to handle this issue is to define a very simple custom dataset object that also returns [data,id] instead of only data during inference. Probably the easiest way to do this is to make the dataset return a dictionary object with keys id and data. The dictionary return type is convenient because Pytorch collates (converts data structures to batches) this type automatically, otherwise you'd have to define a custom collate_fn and pass it to the dataloader object, which is itself not very hard but is an extra step.
In any case, here's I would define a new dataset object as follows which should be almost a one-to-one substitute for your current dataset (I believe):
def TensorDictDataset(torch.data.Dataset):
def __init__(self,ids,attention_mask):
self.ids = ids
self.mask = attention_mask
def __len__(self):
return len(self.ids)
def __getitem(self,idx):
datum = {
"mask": self.mask[idx],
"id":ids[idx]
}
return datum
The only change you'll then have to make is that rather than returning mask your dataset will now return dict{"mask":mask,"id":id} so you'll have to parse that appropriately.

thanks for your answer. I have done further debugging and found another solution and wanted to post it.
Your solution is quite elegant (there was one minor misunderstanding, in that the predictions contain only the predicted labels and not the data contrary to what you understood, but this doesn't affect your answer anyway). Mask is NLP is also something else, and instead of having the mask tokens together with predictions I would like to have the untokenized text string. This is not so easy to achieve because the splitting of the data into different GPUs happens AFTER the tokenisation, however I believe that with a slight adaptation to your answer it could work.
However, I’ve done some further debugging and I’ve noticed that the data are not actually randomly split across GPUs as I thought. If I set shuffle=False in the DistributedSampler then this happens:
in the case of two GPUs, GPU 0 and GPU 1, all the samples with even index (starting from 0) will be passed to GPU 0, and all those with odd index will be passed to GPU 1.
So for example, if you have 10 samples, whose indices are [0, 1, 2, 3, 4, 5, 6, 7, 8, 9], then samples 0, 2, 4, 6, 8 will go to GPU 0 and samples 1, 3, 5, 7, 9 will to go GPU 1. Therefore this allows me to map the predictions back to the original text string samples by just using this ordering. Not sure if this is the best solution, as keeping the original text string next to its prediction would be ideal, but at least it works.
N.B. Special case: As the two GPUs must be passed the SAME number of inputs, if the number of inputs is an odd number, for example we have 9 samples with indices [0, 1, 2, 3, 4, 5, 6, 7, 8], then GPU 0 will be passed samples 0, 2, 4, 6, 8 and GPU 1 will be passed samples 1, 3, 5, 7, 0 (in this exact order). In other words, the first sample with index 0 is repeated at the very end of the dataset to make sure each GPU has the same number of samples, in which case we can then write some codes which drops the last prediction from GPU 1 as it is redundant.

Reshaping numpy array as an input to CNN

I have seen multiple posts on reshaping numpy arrays as inputs to CNN's however, I haven't been able to successfully reshape my array as an input to my CNN!
I have a CNN that merges with another model further downstream. The input shape of the CNN is (4,4,1) -- it is bigger but i have purposefully made it smaller to establish he pipeline and get it running before i put in the proper size.
the format will be the same however, its a 1 channel n x n np.array. I am getting errors when reshaping which I will mention after the code. The input dimensions are put in to the model as follows:
cnn_branch_input = tf.keras.layers.Input(shape=(4,4,1))
cnn_branch_two = tf.keras.layers.Conv2D(etc....)(cnn_branch_input)
the np array (which is originally a pandas dataframe) characteristics and reshaping are as follows:
np.array(array).shape
(4,4)
input = np.array(array).reshape(-1,1,4,4)
input.shape
(1,1,4,4)
the input to my merged model is as follows:
model.fit([cnn_input,gnn_input, gnn_node_feat], y,
#sample_weight=train_mask,
#validation_data=validation_data,
batch_size=4,
shuffle=False)
this causes an error which makes sense to me:
ValueError: Data cardinality is ambiguous:
x sizes: 1, 4, 4 -- Please provide data which shares the same first dimension.
So now when reshaping to intentionally have a 4x4 plus 1 channel shape as follows:
input = np.array(array).reshape(-1,4,4,1)
input.shape
(1,4,4,1)
Two things, the array reshapes to 4, 1x1 arrays, so it seems the structure of the original array is lost, and I get the same error!!
Notice that in both reshape methods, the shape is either (1,4,4,1) or (1,1,4,4).. the -1 entry simply becomes a 1, making the CNN think the first element is shape 1. I thought the -1 would allow me to successfully add the sample dimension as 'any number of samples'.
Simply entering the original (4,4) array, I receive the error that the CNN received a 2 dim array while a 4 dimension array is required.
Im really confused as to how to correctly reshape this array! I would appreciate any help!

Develop Reference

ios ruby-on-rails asp.net-mvc docker delphi jenkins grails google-sheets machine-learning dart