LogisticRegression penalty set l1 or l2 is equal to use Lasso or Ridge? - machine-learning

from sklearn.linear_model import LogisticRegression
I want to know if using LogisticRegression(penalty='l1') equal to use Lasso()?
or LogisticRegression(penalty='l2') equal to use Ridge()?
I know if use Lasso alpha=1.0, if use LogisticRegression() C=1.0,but What a difference?
In addtion,when the question is a binary classification,I should use which of them?
Multi-classification? regression?
PS:I'm not very good at English,sry.
I know if use Lasso alpha=1.0, if use LogisticRegression() C=1.0.

Related

How can I import Elastic-Net, Lasso and Ridge regression in Pyspark?

Could you please tell me how can I use Elastic-Net, Lasso and Ridge regression in Pyspark? Actually I chose Linear, Elastic-Net, Lasso and Ridge regression these 4 algorithms according to machine learning cheatsheet. However, I don't know how to import Elastic-Net, Lasso and Ridge regression in Pyspark and cannot google the right answers. I just know using the Linear Regression in Pyspark.
Have a look at https://spark.apache.org/docs/1.5.2/ml-linear-methods.html
You can use something like:
from pyspark.ml.regression import LinearRegression
from pyspark.mllib.regression import LabeledPoint
from pyspark.mllib.util import MLUtils
# Load training data
training = MLUtils.loadLibSVMFile(sc, "data/mllib/sample_libsvm_data.txt").toDF()
lr = LinearRegression(maxIter=10, regParam=0.3, elasticNetParam=0.8)
# Fit the model
lrModel = lr.fit(training)
# Print the weights and intercept for linear regression
print("Weights: " + str(lrModel.weights))
print("Intercept: " + str(lrModel.intercept))
If you read into the setup, you'll find that:
By setting α properly, elastic net contains both L1 and L2 regularization as special cases. For example, if a linear regression model is trained with the elastic net parameter α set to 1, it is equivalent to a Lasso model. On the other hand, if α is set to 0, the trained model reduces to a ridge regression model.
Where:
elasticNetParam corresponds to α and regParam corresponds to λ.

How to run PCA with dask_ml. I am getting an error, "This function (tsqr) supports QR decomposition in the case of tall-and-skinny matrices"?

I want to perform dimensionality reduction over data with around 3000 rows and 6000 columns. Here the number of observations (n_samples) < number of features (n_columns). I am not able to achieve the result using dask-ml whereas the same is possible through scikit learn. What modifications do I need to perform to my existing code?
#### dask_ml
from dask_ml.decomposition import PCA
from dask_ml import preprocessing
import dask.array as da
import numpy as np
train = np.random.rand(3000,6000)
train = da.from_array(train,chunks=(100,100))
complete_pca = PCA().fit(train)
#### scikit learn
from sklearn.decomposition import PCA
from sklearn import preprocessing
import numpy as np
train = np.random.rand(3000,6000)
complete_pca = PCA().fit(train)
The PCA algorithm in Dask-ML is only designed for tall-and-skinny matrices. You could try using the raw SVD algorithms in dask.array. Also, with a 3000x6000 matrix you can probably also just use a single machine.
Adding in something like Dask-ML for a problem of this size might be adding more complexity than you need. If Scikit-Learn works for you then I would stick with that.

Accuracy matrics isn't working on linear regression

Kindly help here:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
X = [[1.1],[1.3],[1.5],[2],[2.2],[2.9],[3],[3.2],[3.2],[3.7],[3.9],[4],[4],[4.1],[4.5],[4.9],[5.1],[5.3],[5.9],[6],[6.8],[7.1],[7.9],[8.2],[8.7],[9],[9.5],[9.6],[10.3],[10.5]]
y = [39343,46205,37731,43525,39891,56642,60150,54445,64445,57189,63218,55794,56957,57081,61111,67938,66029,83088,81363,93940,91738,98273,101302,113812,109431,105582,116969,112635,122391,121872]
#implement the dataset for train & test
from sklearn.cross_validation import train_test_split
X_train,X_test,y_train,y_test = train_test_split(X,y,test_size = 1/3,random_state=0)
#implement our classifier based on Simple Linear Regression
from sklearn.linear_model import LinearRegression
SimpleLinearRegression = LinearRegression()
SimpleLinearRegression.fit(X_train,y_train)
y_predict= SimpleLinearRegression.predict(X_test)
from sklearn.metrics import accuracy_score
print(accuracy_score(y_test,y_predict))
I'm sure I'm missing something here, is there some other way to calculate accuracy score for regression? Thanks in advance :)
Accuracy as a metric is applicable to a classification problem, as it is defined as a fraction of labels that is correctly predicted. In your case you do a regression (LinearRegression), i.e. your target variable is continuous. So either you picked a wrong model my mistake or accuracy is a wrong metric for your problem.
You can use mean absolute error and mean squared error.
from sklearn.metrics import mean_absolute_error, mean_squared_error
import numpy as np
MAE = mean_absolute_error(y_test, y_predict)
RMSE = np.sqrt(mean_squared_error(y_test, y_predict))
We can't use accuracy for regression problems, Its only used in classification problem.
You can use MSE,RMSE,MAPE,MAE as the matrix to determine how good your regression model is.
These values tell us how far we are from the correct predictions. The lower values are better for these cases.

can I use PCA for dimensionality reduction and then use its o/p for one class SVM classifier in python

I want to use PCA for dimensionality reduction and then use its o/p for one class SVM classifier in python. My training data set is of the order 16000x60. Also how to map principal component to original column to use it in SVM or can I use principal component directly?
It is unclear what the problem is and what did you try already. Of course you can. You can either add PCA output to your original set or just use the output as a single feature. I encourage you to use sklearn pipelines.
Simple example:
from sklearn import decomposition, datasets
from sklearn.pipeline import Pipeline
from sklearn import svm
svc = svm.SVC()
pca = decomposition.PCA()
pipe = Pipeline(steps=[('pca', pca), ('svc', svc)])
digits = datasets.load_digits()
X_digits = digits.data
y_digits = digits.target
pipe.fit(X_digits, y_digits)
print(pipe.score(X_digits,y_digits))

Using Scorer Object for Classifier Score Method

I have written my custom scorer object which is necessary for my problem and which I've called "p_value_scoring_object".
For the function sklearn.cross_validation.cross_val_score one of the parameters is "scoring", which allows to use this scorer object.
However, this option is not available for the score method of a classifier. Is sklearn just lacking that feature, or is there a way around it?
from sklearn.datasets import load_iris
from sklearn.cross_validation import cross_val_score
from sklearn.tree import DecisionTreeClassifier
clf = DecisionTreeClassifier(random_state=0)
iris = load_iris()
cross_val_score(clf, iris.data, iris.target, cv=10,scoring=p_value_scoring_object)
This works. However, this doesn't:
clf.fit(iris.data,iris.target)
clf.score(iris.data,iris.target,scoring=p_value_scoring_object)
sklearn just lacking that feature. Score is internally binded to different metrics for different types of estimators. For example classifiers are binded to classification accuracy score metric, for regressors it's binded to r2_score.
You can look at these binds in sklearn.base, every mixin (For example ClassifierMixin) provides this score method.
Istead of this you can just run:
p_value_scoring_object(p_value_scoring_object, iris.data, iris.target)

Resources