Being novice to Machine Learning, I have been exploring background of regression and classification Algorithms.
I have been working on a Classification dataset from Kaggle using Logistic Regression & KNN. I noticed that KNeighborsClassifier doesn't have a fit_transform(). My understanding of fit & transform goes like this
fit():- fit uses the required formula and perform the calculation on feature variable of input data and fit this calculation
Transform():- Transform is used to change the data calculated with fit() function.
If this is the case & my understanding is correct , then can some help me explaining why KNeighborsClassifier doesn't have a fit_transform().
Thanks
Related
I understand that both LinearRegression class and SGDRegressor class from scikit-learn performs linear regression. However, only SGDRegressor uses Gradient Descent as the optimization algorithm.
Then what is the optimization algorithm used by LinearRegression, and what are the other significant differences between these two classes?
LinearRegression always uses the least-squares as a loss function.
For SGDRegressor you can specify a loss function and it uses Stochastic Gradient Descent (SGD) to fit. For SGD you run the training set one data point at a time and update the parameters according to the error gradient.
In simple words - you can train SGDRegressor on the training dataset, that does not fit into RAM. Also, you can update the SGDRegressor model with a new batch of data without retraining on the whole dataset.
To understand the algorithm used by LinearRegression, we must have in mind that there is (in favorable cases) an analytical solution (with a formula) to find the coefficients which minimize the least squares:
theta = (X'X)^(-1)X'Y (1)
where X' is the the transpose matrix of X.
In the case of non-invertibility, the inverse can be replaced by the Moore-Penrose pseudo-inverse calculated using "singular value decomposition" (SVD). And even in the case of invertibility, the SVD method is faster and more stable than applying the formula (1).
PS - No LaTeX (MathJaX) in Stackoverflow ???
--
Pierre (from France)
I am training my dataset using a multivariable (say, 10) linear regression model. I have to choose the parameters but it results in overfitting. I have read that using the genetic algorithm for tuning my parameters we may achieve the minimum possible cost function error.
I am a complete beginner in this area and am unable to understand how the genetic algorithm can help in choosing the parameters using the parent's MSE. Any help is appreciated.
My question is regarding the Novelty detection algorithms - Isolation Forest and One Class SVM.
I have a training dataset(with 4-5 features) where all the sample points are inliers and I need to classify any new data as an inlier or outlier and ingest in another dataframe accordingly.
While trying to use Isolation Forest or One Class SVM, i have to input the contamination percentage(nu) during the training phase. However as the training dataset doesn't have any contamination, do I need to add outliers to the training dataframe and put that outlier fraction as nu.
Also while using the Isolation forest, I noticed that the outlier percentage changes everytime I predict, even though i don't change the model. Is there a way to take care of this problem apart from going into the Extended Isolation Forest algorithm.
Thanks in advance.
Regarding contamination for isolation forest,
If you are training for the normal instances (all inliers), you should put zero for contamination. If you don't specify this, contamination would be 0.1 (for version 0.2).
The following is a simple code to show this,
1- Import libraries
import numpy as np
import matplotlib.pyplot as plt
from sklearn.ensemble import IsolationForest
rng = np.random.RandomState(42)
2- Generate a 2D dataset
X = 0.3 * rng.randn(1000, 2)
3- Train iForest model and predict the outliers
clf = IsolationForest(random_state=rng, contamination=0)
clf.fit(X)
y_pred_train = clf.predict(X)
4- Print # of anomalies
print(sum(y_pred_train==-1))
This would give you 0 anomalies. Now if you change the contamination to 0.15, the program specifies 150 anomalies out of the same dataset you already had (same because of RandomState(42)).
[References]:
1 Liu, Fei Tony, Ting, Kai Ming and Zhou, Zhi-Hua. "Isolation forest."
Data Mining, 2008. ICDM'08. Eighth IEEE International Conference
2 Liu, Fei Tony, Ting, Kai Ming and Zhou, Zhi-Hua. "Isolation-based
anomaly detection." ACM Transactions on Knowledge Discovery from
Data (TKDD), (2012)
"Training with normal data(inliers) only".
This is against the nature of Isolation Forest. The training is here completely different than training in the Neural Networks. Because everyone is using these without clarifying what is going on, and writing blogs with 20% of ML knowledge, we are having questions like this.
clf = IsolationForest(random_state=rng, contamination=0)
clf.fit(X)
What does fit do here? Is it training? If yes, what is trained?
In Isolation Forest:
First, we build trees,
Then, we pass each data point through each tree,
Then, we calculate the average path that is required to isolate the point.
The shorter the path, the higher the anomaly score.
contamination will determine your threshold. if it is 0, then what is your threshold?
Please read the original paper first to understand the logic behind it. Not all anomaly detection algorithms suit for every occasion.
The question proposed reads as follows: Use scikit-learn to split the data into a training and test set. Classify the data as either cat or dog using DBSCAN.
I am trying to figure out how to go about using DBSCAN to fit a model using training data and then predict the labels of a testing set. I am well aware that DBSCAN is meant for clustering and not prediction. I have also looked at Use sklearn DBSCAN model to classify new entries as well as numerous other threads. DBSCAN only comes with fit and fit_predict functions, which don't seem relatively useful when trying to fit the model using the training data and then test the model using the testing data.
Is the question worded poorly or am I missing something? I have looked at the scikit-learn documentation as well as looked for examples, but have not had any luck.
# Split the samples into two subsets, use one for training and the other for testing
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42)
# Instantiate the learning model
dbscan = DBSCAN()
# Fit the model
dbscan.fit(X_train, y_train)
# Predict the response
# Confusion matrix and quantitative metrics
print("The confusion matrix is: " + np.str(confusion_matrix(y_test, dbscan_pred)))
print("The accuracy score is: " + np.str(accuracy_score(y_test, dbscan_pred)))
Whoever gave you that assignment has no clue...
DBSCAN will never predict "cat" or "dog". It just can't.
Because it is an unsupervised algorithm, it doesn't use training labels. y_train is ignored (see the parameter documentation), and it is stupid that sklearn will allow you to pass it at all! It will output sets of points that are clusters. Many tools will enumerate these sets as 1, 2, ... But it won't name a set "dogs".
Furthermore it can't predict on new data either - which you need for predicting on "test" data. So it can't work with a train-test split, but that does not really matter because it does not use labels anyway.
The accepted answer in the question you linked is a pretty good one for you, too: you want to perform classification, not discover structure (which is what clustering does).
DBSCAN, as implemented in scikit-learn, is a transductive algorithm, meaning you can't do predictions on new data. There's an old discussion from 2012 on the scikit-learn repository about this.
Suffice to say, when you're using a clustering algorithm, the concept of train/test splits is less defined. Cross-validation usually involves a different metric; for example, in K-means, the cross-validation is often over the hyperparameter k, rather than mutually exclusive subsets of the data, and the metric that is optimized is the intra-vs-inter cluster variance, rather than F1 accuracy.
Bottom line: trying to perform classification using a clustering technique is effectively square-peg-round-hole. You can jam it through if you really want to, but it'd be considerably easier to just use an off-the-shelf classifier.
I've been going through Andrew Ng's machine learning course and just got done with the learning curve lecture. I created a learning curve for a logistic regression model I created, and it looks like the training and CV scores converge, which means my model could benefit from more features. How could I do a similar analysis for something like a random forest? When I create a learning curve for a random forest classifier with the same data in sklearn my training score just stays very close to 1. Do I need to use a different method of getting the training error?
Learning Curves is a tool to learn about bias-variance-trade-off. Since your random forest model training score stays very close to 1, your random forest mode l is able to learn underlying function. If your underlying function was more non-linear, more complex, you would have had to add more features. See following example, figure Learning Curves.
Start with only 2 features and train your random forests model. Then use all of your features and train random forests your model.
You should see similar graph for your example.