Is a linear stack of layers equal to multilinear regression? - machine-learning

So for an application I'm making I'm using tf.keras.models.Sequential. I know that there are linear and multilinear regression models for machine learning. In the documentation of Sequential is said that the model is a linear stack of layers. Is that equal to multilinear regression? The only explaination of linear stack of layers I could find was this question on Stackoverflow.
def trainModel(bow,unitlabels,units):
x_train = np.array(bow)
print("X_train: ", x_train)
y_train = np.array(unitlabels)
print("Y_train: ", y_train)
model = tf.keras.models.Sequential([
tf.keras.layers.Dense(256, activation=tf.nn.relu),
tf.keras.layers.Dropout(0.2),
tf.keras.layers.Dense(len(units), activation=tf.nn.softmax)])
model.compile(optimizer='adam',
loss='sparse_categorical_crossentropy',
metrics=['accuracy'])
model.fit(x_train, y_train, epochs=50)
return model

you are confusing two things very important here. One is the model and the other is the model of the model.
The model of the model is indeed a linear one because it follows a direct line (straightforward) from beginning till end.
the model itself is not linear: The relu activation is here to make sure that the solutions are not linear.
the linear stack is not a linear regression nor a multilinear one. The linear stack is not a ML term here but the english one to say straightforward.
tell me if i misunderstood the question in any regard.

In the documentation of Sequential is said that the model is a linear stack of layers. Is that equal to multilinear regression?
Assuming you mean a regression with multiple variables, no.
tf.keras.models.Sequential() defines how the layers in your model are connecting, specifically in this case it means they are fully connected (every output from the first layer is connected as an input to every neuron in the next layer). The term linear is used to mean that there is no funny business going on, e.g. recurrency (connections can go backwards) or residual connections (connections can skip layers).
For context, a regression with multiple variables is comparable to a single layered network with a single neuron with multiple inputs and no transfer function.

Related

Can logistic and lineair regression produce a prediction on a scale?

I currently have a dataset of drawings, each drawing being represented by some features. Each feature (independent variable) is a continuous number. None of the drawings have a label as of yet, which is why I am planning to start a sort of questionaire with people. However, before I can correctly setup such questionaire, I should have an idea of what kind of labels I should use for my training data.
At first thought, I was thinking about letting people rate the drawings on a scale, for example from 1 to 5 with 1 being bad, 3 being average and 5 being good. Alternatively, I could also reduce the question to a simple good or bad question. The latter would mean I lose some valuable information, but the dependent variable could then be considered 'binary'.
Using the training data I then composed, I would need to have a machine learning algorithm (model) which given a drawing, predicts if the drawing is good or not. Ideally, I would have some way of tuning the strictness in this prediction. For example, the model could instead of simply predicting 'good' or 'bad', predict the likelyhood of a painting being good on a scale of 0 to 1. I could then say "Well, let's say all paintings which are 70% likely to be good, are considered as good". Another example would be that the model predicts the goodness using the same categorical values the people used to rate the drawing initially. So it would either predict the drawing being a 1, 2, 3, 4 or 5. Similar to my first example, I could then say "Well, all paintings which are rated at least a 4, are considered good paintings" and tune this threshhold to my liking.
After doing some research, I came up with logistic and linear regression being good candidates. However, if which of the two would be the best for my scenario? Equally important, how would I need to format my labels? Just simple 0's and 1's or a scale?
You could use a 1 vs all representation if you wanted to use a multi-class categorical classification:
Essentially, you train 1 classifier for every category you have (you have 10 categories, so you have 10 classifiers) and then each classifier is just trained to predict whether or not the category belongs to each specific class.
There are alternative ways to make multi-class logistic regression work that only require training a single model, such as by using categorical cross entropy, but given that you'd like to use ordinal data, a linear regression used as a regression model is likely more ideal. You'd predict a value between 1 and 10 and then just round to the nearest integer. This way you aren't penalizing close guesses as much as far away guesses.
what keeps you from using a logistic regression model. Due to a lack of better dataset I used the standard diabetes data. The target variable is an integer between 50 and 200. I normalised the data between [-1,1] such that I can use sigmoid as activation function. For the loss I decided to use
import tensorflow as tf
from tensorflow.keras.models import Model
from tensorflow.keras.layers import MaxPooling2D, Input, Convolution2D
import numpy as np
from sklearn import datasets
diabetes = datasets.load_diabetes()
x_train=diabetes.data
y_train=2*(diabetes.target-min(diabetes.target))/(max(diabetes.target)-min(diabetes.target))-1
inputs = tf.keras.Input(shape=(x_train.shape[1],))
outputs = tf.keras.layers.Dense(1, activation=tf.nn.sigmoid)(inputs)
model = tf.keras.Model(inputs=inputs, outputs=outputs)
model.compile(optimizer=tf.keras.optimizers.Adam(), # Optimizer
loss=tf.keras.losses.MSE,
metrics=['sparse_categorical_accuracy'])
history = model.fit(x_train, y_train,
batch_size=64,
epochs=300,
validation_data=(x_train, y_train))
You could also use a linear regression model. There you only need to replace the activation function by linear. However I think the squashing character, besides ensuring hat there is no rating larger 1 or smaller -1.
A last alternative would be to train pair-wise preference. The idea is to show the human two drawings and ask which one he likes more. Then build a binary model, e.g., logistic regression. This approach appears preferable to me as it is easier to answer for the human

Classification Using DBSCAN w/ Test-Train Split

The question proposed reads as follows: Use scikit-learn to split the data into a training and test set. Classify the data as either cat or dog using DBSCAN.
I am trying to figure out how to go about using DBSCAN to fit a model using training data and then predict the labels of a testing set. I am well aware that DBSCAN is meant for clustering and not prediction. I have also looked at Use sklearn DBSCAN model to classify new entries as well as numerous other threads. DBSCAN only comes with fit and fit_predict functions, which don't seem relatively useful when trying to fit the model using the training data and then test the model using the testing data.
Is the question worded poorly or am I missing something? I have looked at the scikit-learn documentation as well as looked for examples, but have not had any luck.
# Split the samples into two subsets, use one for training and the other for testing
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42)
# Instantiate the learning model
dbscan = DBSCAN()
# Fit the model
dbscan.fit(X_train, y_train)
# Predict the response
# Confusion matrix and quantitative metrics
print("The confusion matrix is: " + np.str(confusion_matrix(y_test, dbscan_pred)))
print("The accuracy score is: " + np.str(accuracy_score(y_test, dbscan_pred)))
Whoever gave you that assignment has no clue...
DBSCAN will never predict "cat" or "dog". It just can't.
Because it is an unsupervised algorithm, it doesn't use training labels. y_train is ignored (see the parameter documentation), and it is stupid that sklearn will allow you to pass it at all! It will output sets of points that are clusters. Many tools will enumerate these sets as 1, 2, ... But it won't name a set "dogs".
Furthermore it can't predict on new data either - which you need for predicting on "test" data. So it can't work with a train-test split, but that does not really matter because it does not use labels anyway.
The accepted answer in the question you linked is a pretty good one for you, too: you want to perform classification, not discover structure (which is what clustering does).
DBSCAN, as implemented in scikit-learn, is a transductive algorithm, meaning you can't do predictions on new data. There's an old discussion from 2012 on the scikit-learn repository about this.
Suffice to say, when you're using a clustering algorithm, the concept of train/test splits is less defined. Cross-validation usually involves a different metric; for example, in K-means, the cross-validation is often over the hyperparameter k, rather than mutually exclusive subsets of the data, and the metric that is optimized is the intra-vs-inter cluster variance, rather than F1 accuracy.
Bottom line: trying to perform classification using a clustering technique is effectively square-peg-round-hole. You can jam it through if you really want to, but it'd be considerably easier to just use an off-the-shelf classifier.

difference between LinearRegression and svm.SVR(kernel="linear")

First there are questions on this forum very similar to this one but trust me none matches so no duplicating please.
I have encountered two methods of linear regression using scikit's sklearn and I am failing to understand the difference between the two, especially where in first code there's a method train_test_split() called while in the other one directly fit method is called.
I am studying with multiple resources and this single issue is very confusing to me.
First which uses SVR
X = np.array(df.drop(['label'], 1))
X = preprocessing.scale(X)
y = np.array(df['label'])
X_train, X_test, y_train, y_test = cross_validation.train_test_split(X, y, test_size=0.2)
clf = svm.SVR(kernel='linear')
clf.fit(X_train, y_train)
confidence = clf.score(X_test, y_test)
And second is this one
# Split the data into training/testing sets
diabetes_X_train = diabetes_X[:-20]
diabetes_X_test = diabetes_X[-20:]
# Split the targets into training/testing sets
diabetes_y_train = diabetes.target[:-20]
diabetes_y_test = diabetes.target[-20:]
# Create linear regression object
regr = linear_model.LinearRegression()
# Train the model using the training sets
regr.fit(diabetes_X_train, diabetes_y_train)
# Make predictions using the testing set
diabetes_y_pred = regr.predict(diabetes_X_test)
So my main focus is the difference between using svr(kernel="linear") and using LinearRegression()
cross_validation.train_test_split : Splits arrays or matrices into random train and test subsets.
In second code, splitting is not random.
svm.SVR: The Support Vector Regression (SVR) uses the same principles as the SVM for classification, with only a few minor differences. First of all, because output is a real number it becomes very difficult to predict the information at hand, which has infinite possibilities. In the case of regression, a margin of tolerance (epsilon) is set in approximation to the SVM which would have already requested from the problem. But besides this fact, there is also a more complicated reason, the algorithm is more complicated therefore to be taken in consideration. However, the main idea is always the same: to minimize error, individualizing the hyperplane which maximizes the margin, keeping in mind that part of the error is tolerated.
Linear Regression: In statistics, linear regression is a linear approach for modeling the relationship between a scalar dependent variable y and one or more explanatory variables (or independent variables) denoted X. The case of one explanatory variable is called simple linear regression.
Reference:
https://cs.adelaide.edu.au/~chhshen/teaching/ML_SVR.pdf
This is what I found:
Intuitively, as all regressors it tries to fit a line to data by minimising a cost function. However, the interesting part about SVR is that you can deploy a non-linear kernel. In this case you end making non-linear regression, i.e. fitting a curve rather than a line.
This process is based on the kernel trick and the representation of the solution/model in the dual rather than in the primal. That is, the model is represented as combinations of the training points rather than a function of the features and some weights. At the same time the basic algorithm remains the same: the only real change in the process of going non-linear is the kernel function, which changes from a simple inner product to some non linear function.
So SVR allows non linear fitting problems as well while LinearRegression() is only for simple linear regression with straight line (may contain any number of features in both cases).
The main difference for these methods is in mathematics background!
We have samples X and want to predict target Y.
The Linear Regression method just minimizes the least squares error:
for one object target y = x^T * w, where w is model's weights.
Loss(w) = Sum_1_N(x_n^T * w - y_n) ^ 2 --> min(w)
As it is a convex functional the global minimum will be always found.
After taking derivative of Loss by w and transforming sums to vectors you'll get:
w = (X^T * X)^(-1)* (X^T * Y)
So, in ML (i'm sure sklearn also has the same implementation) the w is calculated according above formula.
X is train samples, when you call fit method.
In predict this weights just multiplies on X_test.
So the decision is explicit and faster (except for Big selections as finding inverse matrix in this cases is complicated task) than converging methods such as svm.
In addition: Lasso and Ridge solves the same task but have additionally the regularization on weights in their losses.
And you can calculate the weights explicit in that cases too.
The SVM.Linear does almost the same thing except it has an optimization task for maximizing the margin (i apologize but it is difficult to put it down because i didn't find out how to write in Tex format here).
So it uses gradient descent methods for finding global extremum.
Sklearn's class SVM even have attribute max_iter which is used in the converging tasks.
To sum up: Linear Regression has explicit decision and SVM finds approximate of real decision because of numerical(computational) solution.

Transfer Learning and linear classifier

In cs231n handout here, it says
New dataset is small and similar to original dataset. Since the data
is small, it is not a good idea to fine-tune the ConvNet due to
overfitting concerns... Hence, the best idea might be to train a
linear classifier on the CNN codes.
I'm not sure what linear classifier means. Does the linear classifier refer to the last fully connected layer? (For example, in Alexnet, there are three fully connected layers. Does the linear classifier the last fully connected layer?)
Usually when people say "linear classifier" they refer to Linear SVM (support vector machine). A linear classifier learns a weight vecotr w and a threshold (aka "bias") b such that for each example x the sign of
<w, x> + b
is positive for the "positive" class and negative for the "negative" class.
The last (usually fully connected) layer of a neural-net can be considered as a form of a linear classifier.

Fine Tuning of GoogLeNet Model

I trained GoogLeNet model from scratch. But it didn't give me the promising results.
As an alternative, I would like to do fine tuning of GoogLeNet model on my dataset. Does anyone know what are the steps should I follow?
Assuming you are trying to do image classification. These should be the steps for finetuning a model:
1. Classification layer
The original classification layer "loss3/classifier" outputs predictions for 1000 classes (it's mum_output is set to 1000). You'll need to replace it with a new layer with appropriate num_output. Replacing the classification layer:
Change layer's name (so that when you read the original weights from caffemodel file there will be no conflict with the weights of this layer).
Change num_output to the right number of output classes you are trying to predict.
Note that you need to change ALL classification layers. Usually there is only one, but GoogLeNet happens to have three: "loss1/classifier", "loss2/classifier" and "loss3/classifier".
2. Data
You need to make a new training dataset with the new labels you want to fine tune to. See, for example, this post on how to make an lmdb dataset.
3. How extensive a finetuning you want?
When finetuning a model, you can train ALL model's weights or choose to fix some weights (usually filters of the lower/deeper layers) and train only the weights of the top-most layers. This choice is up to you and it ususally depends on the amount of training data available (the more examples you have the more weights you can afford to finetune).
Each layer (that holds trainable parameters) has param { lr_mult: XX }. This coefficient determines how susceptible these weights to SGD updates. Setting param { lr_mult: 0 } means you FIX the weights of this layer and they will not be changed during the training process.
Edit your train_val.prototxt accordingly.
4. Run caffe
Run caffe train but supply it with caffemodel weights as an initial weights:
~$ $CAFFE_ROOT/build/tools/caffe train -solver /path/to/solver.ptototxt -weights /path/to/orig_googlenet_weights.caffemodel
Fine-tuning is a very useful trick to achieve a promising accuracy compared to past manual feature. #Shai already posted a good tutorial for fine-tuning the Googlenet using Caffe, so I just want to give some recommends and tricks for fine-tuning for general cases.
In most of time, we face a task classification problem that new dataset (e.g. Oxford 102 flower dataset or Cat&Dog) has following four common situations CS231n:
New dataset is small and similar to original dataset.
New dataset is small but is different to original dataset (Most common cases)
New dataset is large and similar to original dataset.
New dataset is large but is different to original dataset.
In practice, most of time we do not have enough data to train the network from scratch, but may be enough for pre-trained model. Whatever which cases I mentions above only thing we must care about is that do we have enough data to train the CNN?
If yes, we can train the CNN from scratch. However, in practice it is still beneficial to initialize the weight from pre-trained model.
If no, we need to check whether data is very different from original datasets? If it is very similar, we can just fine-tune the fully connected neural network or fine-tune with SVM. However, If it is very different from original dataset, we may need to fine-tune the convolutional neural network to improve the generalization.

Resources