Sklearn Countvectorizer on custom vocabulary - machine-learning

I have a set of webpages and i was in the process of getting the webpage count matrix. I tried to use the standard Countvectorizer from sklearn but not getting the required results. The sample code is as below:
from sklearn.feature_extraction.text import CountVectorizer
corpus = ['www.google.com www.google.com', 'www.google.com www.facebook.com', 'www.google.com', 'www.facebook.com']
vocab = {'www.google.com':0, 'www.facebook.com':1}
vectorizer = CountVectorizer(vocabulary=vocab)
X = vectorizer.fit_transform(corpus)
print(vectorizer.get_feature_names())
print(X.toarray())
It gives
['www.google.com', 'www.facebook.com']
[[0 0]
[0 0]
[0 0]
[0 0]]
But the required result is
['www.google.com', 'www.facebook.com']
[[2 0]
[1 1]
[1 0]
[0 1]]
How do we apply countvectorizer on such a custom vocabulary?

As per the input from a related question, the issue occured because of the tokenizer.
A customer tokenizer was written and now it works.
def mytokenizer(text):
return text.split()
from sklearn.feature_extraction.text import CountVectorizer
corpus = ['www.google.com www.google.com', 'www.google.com www.facebook.com', 'www.google.com', 'www.facebook.com']
vocab = {'www.google.com':0, 'www.facebook.com':1}
vectorizer = CountVectorizer(vocabulary=vocab, tokenizer = mytokenizer)
X = vectorizer.fit_transform(corpus)
print(vectorizer.get_feature_names())
print(X.toarray())

Related

How to plot decision boundaries for Random Forest classifier

How to go about plotting the decision boundaries for a Random Forest analysis with 10 classes?
I get the error:
ValueError: X has 2 features, but RandomForestClassifier is expecting
240 features as input.
Can you help me get the decision boundaries for the 10 classes if possible? Thanks for your time!
Here is my code:
from sklearn.datasets import make_classification
import seaborn as sns
import numpy as np
from matplotlib import pyplot as plt
import pandas as pd
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import RandomizedSearchCV, GridSearchCV
f, (ax1,ax2) = plt.subplots(nrows=1, ncols=2,figsize=(20,8))
# Generate noisy Data
num_trainsamples = 500
num_testsamples = 50
X_train,y_train = make_classification(n_samples=num_trainsamples,
n_features=240,
n_informative=9,
n_redundant=0,
n_repeated=0,
n_classes=10,
n_clusters_per_class=1,
class_sep=9,
flip_y=0.2,
#weights=[0.5,0.5],
random_state=17)
X_test,y_test = make_classification(n_samples=50,
n_features=num_testsamples,
n_informative=9,
n_redundant=0,
n_repeated=0,
n_classes=10,
n_clusters_per_class=1,
class_sep=10,
flip_y=0.2,
#weights=[0.5,0.5],
random_state=17)
model = RandomForestClassifier()
parameter_space = {
'n_estimators': [10,50,100],
'criterion': ['gini', 'entropy'],
'max_depth': np.linspace(10,50,11),
}
clf = GridSearchCV(model, parameter_space, cv = 5, scoring = "accuracy", verbose = True) # model
my_model = clf.fit(X_train, y_train)
# define bounds of the domain
min1, max1 = X_train[:, 0].min()-1, X_train[:, 0].max()+1
min2, max2 = X_train[:, 1].min()-1, X_train[:, 1].max()+1
# define the x and y scale
x1grid = np.arange(min1, max1, 0.1)
x2grid = np.arange(min2, max2, 0.1)
# create all of the lines and rows of the grid
xx, yy = np.meshgrid(x1grid, x2grid)
# flatten each grid to a vector
r1, r2 = xx.flatten(), yy.flatten()
r1, r2 = r1.reshape((len(r1), 1)), r2.reshape((len(r2), 1))
# horizontal stack vectors to create x1,x2 input for the model
grid = np.hstack((r1,r2))
yhat = clf.predict(grid)
# reshape the predictions back into a grid
zz = yhat.reshape(xx.shape)
# plot the grid of x, y and z values as a surface
plt.contourf(xx, yy, zz, cmap='Paired')
# create scatter plot for samples from each class
for class_value in range(2):
# get row indexes for samples with this class
row_ix = np.where(y == class_value)
# create scatter of these samples
plt.scatter(X_train[row_ix, 0], X_train[row_ix, 1], cmap='Paired')

RandomForestRegressor in Julia

I'm trying to train a RandomForestRegressor using DecisionTree.jl
and RandomizedSearchCV (contained in ScikitLearn.jl) in Julia. Primary datasets like x_train and y_train etc. are provided in my google drive as well, So you can test it on your machine. The code is as follows:
using CSV
using DataFrames
using ScikitLearn: fit!, predict
using ScikitLearn.GridSearch: RandomizedSearchCV
using DecisionTree
x = CSV.read("x.csv", DataFrames.DataFrame)
x_test = CSV.read("x_test.csv", DataFrames.DataFrame)
y_train = CSV.read("y_train.csv", DataFrames.DataFrame)
mod = RandomForestRegressor()
param_dist = Dict("n_trees"=>[50 , 100, 200, 300],
"max_depth"=> [3, 5, 6 ,8 , 9 ,10])
model = RandomizedSearchCV(mod, param_dist, n_iter=10, cv=5)
fit!(model, Matrix(x), Matrix(DataFrames.dropmissing(y_train)))
predict(x_test)
This throws a MethodError like this:
ERROR: MethodError: no method matching fit!(::RandomForestRegressor, ::Matrix{Float64}, ::Matrix{Float64})
Closest candidates are:
fit!(::ScikitLearn.Models.FixedConstant, ::Any, ::Any) at C:\Users\Shayan\.julia\packages\ScikitLearn\ssekP\src\models\constant_model.jl:26
fit!(::ScikitLearn.Models.ConstantRegressor, ::Any, ::Any) at C:\Users\Shayan\.julia\packages\ScikitLearn\ssekP\src\models\constant_model.jl:10
fit!(::ScikitLearn.Models.LinearRegression, ::AbstractArray{XT}, ::AbstractArray{yT}) where {XT, yT} at C:\Users\Shayan\.julia\packages\ScikitLearn\ssekP\src\models\linear_regression.jl:27
...
Stacktrace:
[1] _fit!(self::RandomizedSearchCV, X::Matrix{Float64}, y::Matrix{Float64}, parameter_iterable::Vector{Any})
# ScikitLearn.Skcore C:\Users\Shayan\.julia\packages\ScikitLearn\ssekP\src\grid_search.jl:332
[2] fit!(self::RandomizedSearchCV, X::Matrix{Float64}, y::Matrix{Float64})
# ScikitLearn.Skcore C:\Users\Shayan\.julia\packages\ScikitLearn\ssekP\src\grid_search.jl:748
[3] top-level scope
# c:\Users\Shayan\Desktop\AUT\Thesis\test.jl:17
If you're curious about the shape of the data:
julia> size(x)
(1550, 71)
julia> size(y_train)
(1550, 10)
How can I solve this problem?
PS: Also I tried:
julia> fit!(model, Matrix{Any}(x), Matrix{Any}(DataFrames.dropmissing(y_train)))
ERROR: MethodError: no method matching fit!(::RandomForestRegressor, ::Matrix{Any}, ::Matrix{Any})
Closest candidates are:
fit!(::ScikitLearn.Models.FixedConstant, ::Any, ::Any) at C:\Users\Shayan\.julia\packages\ScikitLearn\ssekP\src\models\constant_model.jl:26
fit!(::ScikitLearn.Models.ConstantRegressor, ::Any, ::Any) at C:\Users\Shayan\.julia\packages\ScikitLearn\ssekP\src\models\constant_model.jl:10
fit!(::ScikitLearn.Models.LinearRegression, ::AbstractArray{XT}, ::AbstractArray{yT}) where {XT, yT} at C:\Users\Shayan\.julia\packages\ScikitLearn\ssekP\src\models\linear_regression.jl:27
...
Stacktrace:
[1] _fit!(self::RandomizedSearchCV, X::Matrix{Any}, y::Matrix{Any}, parameter_iterable::Vector{Any})
# ScikitLearn.Skcore C:\Users\Shayan\.julia\packages\ScikitLearn\ssekP\src\grid_search.jl:332
[2] fit!(self::RandomizedSearchCV, X::Matrix{Any}, y::Matrix{Any})
# ScikitLearn.Skcore C:\Users\Shayan\.julia\packages\ScikitLearn\ssekP\src\grid_search.jl:748
[3] top-level scope
# c:\Users\Shayan\Desktop\AUT\Thesis\MyWork\Thesis.jl:327
Looking at Random Forest Regression example docs in DecisionTree.jl, the example doesn't follow the fit!() / predict() design pattern. The error confirms that fit!() doesn't support RandomForestRegression. Alternatively, you might look at RandomForest.jl package which does follow fit!() / predict() pattern.
As stated here, DecisionTree.jl doesn't support Multi-output RF yet. So I gave up on using DecisionTree.jl, And ScikitLearn.jl is adequate in my case:
using ScikitLearn: #sk_import, fit!, predict
#sk_import ensemble: RandomForestRegressor
using ScikitLearn.GridSearch: RandomizedSearchCV
using CSV
using DataFrames
x = CSV.read("x.csv", DataFrames.DataFrame)
x_test = CSV.read("x_test.csv", DataFrames.DataFrame)
y_train = CSV.read("y_train.csv", DataFrames.DataFrame)
x_test = reshape(x_test, 1,length(x_test))
mod = RandomForestRegressor()
param_dist = Dict("n_estimators"=>[50 , 100, 200, 300],
"max_depth"=> [3, 5, 6 ,8 , 9 ,10])
model = RandomizedSearchCV(mod, param_dist, n_iter=10, cv=5)
fit!(model, Matrix(x), Matrix(DataFrames.dropmissing(y_train)))
predict(model, x_test)
This works fine for me, But it's super slow! Much slower than Python. I'll add the benchmarking with the same data sets across these two languages.
Benchmarking
Here I report the result of benchmarking with the same action, the same values, and the same data. All the data and code files are available in my Google Drive. So feel free to test it by yourself. First, I start with Julia.
Julia
using CSV
using DataFrames
using ScikitLearn: #sk_import, fit!, predict
#sk_import ensemble: RandomForestRegressor
using ScikitLearn.GridSearch: RandomizedSearchCV
using BenchmarkTools
x = CSV.read("x.csv", DataFrames.DataFrame)
y_train = CSV.read("y_train.csv", DataFrames.DataFrame)
mod = RandomForestRegressor(max_leaf_nodes=2)
param_dist = Dict("n_estimators"=>[50 , 100, 200, 300],
"max_depth"=> [3, 5, 6 ,8 , 9 ,10])
model = RandomizedSearchCV(mod, param_dist, n_iter=10, cv=5, n_jobs=1)
#btime fit!(model, Matrix(x), Matrix(DataFrames.dropmissing(y_train)))
# 52.123 s (6965 allocations: 44.34 MiB)
Python
>>> import cProfile, pstats
>>> import pandas as pd
>>> from sklearn.ensemble import RandomForestRegressor
>>> from sklearn.model_selection import RandomizedSearchCV
>>> x = pd.read_csv("x.csv")
>>> y_train = pd.read_csv("y_train.csv")
>>> mod = RandomForestRegressor(max_leaf_nodes=2)
>>> parameters = {
'n_estimators': [50 , 100, 200, 300],
'max_depth': [3, 5, 6 ,8 , 9 ,10]}
>>> model = RandomizedSearchCV(mod, param_distributions=parameters, cv=5, n_iter=10, n_jobs=1)
>>> pr = cProfile.Profile()
>>> pr.enable()
>>> model.fit(x , y_train)
>>> pr.disable()
>>> stats = pstats.Stats(pr).strip_dirs().sort_stats("cumtime")
>>> stats.print_stats(5)
12097437 function calls (11936452 primitive calls) in 73.452 seconds
Ordered by: cumulative time
List reduced from 736 to 5 due to restriction <5>
ncalls tottime percall cumtime percall filename:lineno(function)
1 0.000 0.000 73.445 73.445 _search.py:738(fit)
102/2 0.027 0.000 73.370 36.685 parallel.py:960(__call__)
12252/152 0.171 0.000 73.364 0.483 parallel.py:798(dispatch_one_batch)
12150/150 0.058 0.000 73.324 0.489 parallel.py:761(_dispatch)
12150/150 0.025 0.000 73.323 0.489 _parallel_backends.py:206(apply_async)
So I conclude that Julia performs better than Python in this specific problem in case of speed.

Multiple Linear Regression Machine Learning in Python --ValueError: shapes (8,15) and (390,) not aligned

I am trying to evaluate output based on certain input, using Multiple Linear Regression Machine Learning .I have trained the data and getting correct expected values while running below code:
# Importing the libraries
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
# Importing the dataset
#dataset = pd.read_csv('50_Startups.csv')
dataset = pd.read_excel('MAHI2.xlsx')
X = dataset.iloc[:, :-1].values
y = dataset.iloc[:, 5].values
# Encoding categorical data
from sklearn.preprocessing import LabelEncoder, OneHotEncoder
labelencoder = LabelEncoder()
X[:, 0] = labelencoder.fit_transform(X[:, 0])
labelencoder1 = LabelEncoder()
X[:, 1] = labelencoder.fit_transform(X[:, 1])
labelencoder2 = LabelEncoder()
X[:, 2] = labelencoder.fit_transform(X[:, 2])
labelencoder3 = LabelEncoder()
X[:, 3] = labelencoder.fit_transform(X[:, 3])
onehotencoder = OneHotEncoder(categorical_features = "all")
#X = onehotencoder.fit_transform(X).toarray()
X = onehotencoder.fit_transform(X).toarray()
# Avoiding the Dummy Variable Trap
X = X[:, 1:]
from sklearn.linear_model import LinearRegression
regressor = LinearRegression()
regressor.fit(X, y)
y_pred = regressor.predict(X)
df = pd.DataFrame({'Actual': y.flatten(), 'Predicted': y_pred.flatten()})
df
Now I am trying to use same model to evaluate another set of input data as below :
dataset1 = pd.read_excel('MAHI3.xlsx')
#dataset2 = pd.get_dummies(dataset1)
X1 = dataset1.iloc[:, :-1].values
y2 = dataset1.iloc[:, 5].values
# Encoding categorical data
#labelencoder3 = LabelEncoder()
X1[:, 0] = labelencoder.fit_transform(X1[:, 0])
#labelencoder4 = LabelEncoder()
X1[:, 1] = labelencoder.fit_transform(X1[:, 1])
#labelencoder5 = LabelEncoder()
X1[:, 2] = labelencoder.fit_transform(X1[:, 2])
#labelencoder6 = LabelEncoder()
X1[:, 3] = labelencoder.fit_transform(X1[:, 3])
#onehotencoder2 = OneHotEncoder(categorical_features = "all")
X1 = onehotencoder.fit_transform(X1).toarray()
output = regressor.predict(X1)
df1 = pd.DataFrame({'Actual1': y2.flatten(), 'Predicted1': output.flatten()})
df1
But while I am running this code getting below error:
ValueError: shapes (6,13) and (390,) not aligned: 13 (dim 1) != 390 (dim 0)
It will be great if anyone help me to resolve this issue.
I don't have access to your dataset but I seems that your problem is a dimensionality problem. The thing that seems to change dimensions is the "onehotencoder".
Try to use the same one hot encoder for both.
ohe = onehotencoder.fit(X)
X = ohe.transform(X).toarray()
X1 = ohe.transform(X1).toarray()
You should make sure that the number of features that the "regressor" model is receiving is the same that when it is trained.

CountVectorizer MultinomialNB ValueError: dimension mismatch

I am trying to make my MultinomialNB work. I use CountVectorizer on my training and test set and of course there are different words in both setzs. So I see, why the error
ValueError: dimension mismatch
occurs, but I dont know how to fix it. I tried CountVectorizer().transform instead of CountVectorizer().fit_transform as was suggested in an other post (SciPy and scikit-learn - ValueError: Dimension mismatch) but that just gives me
NotFittedError: CountVectorizer - Vocabulary wasn't fitted.
how can I use CountVectorizer right?
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.cross_validation import train_test_split
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import classification_report
import sklearn.feature_extraction
df = data
y = df["meal_parent_category"]
X = df['name_cleaned']
X_train,X_test,y_train,y_test=train_test_split(X,y,test_size=0.3)
X_train = CountVectorizer().fit_transform(X_train)
X_test = CountVectorizer().fit_transform(X_test)
algo = MultinomialNB()
algo.fit(X_train,y_train)
y = algo.predict(X_test)
print(classification_report(y_test,y_pred))
Ok, so after asking this question I figured it out :)
Here is the solution with vocabulary and such:
df = train
y = df["meal_parent_category_cleaned"]
X = df['name_cleaned']
X_train,X_test,y_train,y_test=train_test_split(X,y,test_size=0.3)
vectorizer_train = CountVectorizer()
X_train = vectorizer_train.fit_transform(X_train)
vectorizer_test = CountVectorizer(vocabulary=vectorizer_train.vocabulary_)
X_test = vectorizer_test.transform(X_test)
algo = MultinomialNB()
algo.fit(X_train,y_train)
y_pred = algo.predict(X_test)
print(classification_report(y_test,y_pred))

How to encode categorical data for use with Semi-supervised algorithm LabelPropagation

I am attempting to use the anneal.arff dataset with Python scikit-learn's semisupervised algorithm LabelPropagation. The anneal dataset is categorical data, so I preprocessed it so that the output class for each item of instance
looks like [0. 0. 1. 0. 0.]. This is a numeric list that encodes the output class
as 5 possible values with 0's everywhere, and 1. in the position of the corresponding class. This is what I would expect.
For semi-supervised learning, most of the training data must be unlabeled, so
I modified the training set so that the unlabeled data has output [-1, -1, -1, -1, -1]. I previously tried just using -1, but the code emits the same error as shown below.
I train the classifier as follows, Y_train includes labeled and "unlabeled" data:
lp_model = LabelSpreading(gamma=0.25, max_iter=5)
lp_model.fit(X, Y_train)
I receive the error shown below after calling the fit method:
File "C:\ProgramData\Anaconda3\lib\site-packages\sklearn\semi_supervised\label_propagation.py", line 221, in fit
X, y = check_X_y(X, y)
File "C:\ProgramData\Anaconda3\lib\site-packages\sklearn\utils\validation.py", line 526, in check_X_y
y = column_or_1d(y, warn=True)
File "C:\ProgramData\Anaconda3\lib\site-packages\sklearn\utils\validation.py", line 562, in column_or_1d
raise ValueError("bad input shape {0}".format(shape))
ValueError: bad input shape (538, 5)
This suggests that something is wrong with the shape of my Y_train list,
but this is the correct shape. What am I doing wrong?
Can LabelPropagation take as training data in this form, or does it only
accept unlabeled data as a scalar -1?
--- edit ---
Here is the code that generates the error. I'm sorry about the confusion over algorithms--I want to use both LabelSpreading and LabelPropagation, and choosing one or the other doesn't fix this error.
from scipy.io import arff
import pandas as pd
import numpy as np
import math
from pandas.tools.plotting import scatter_matrix
import matplotlib.pyplot as plt
from sklearn import model_selection
from sklearn.metrics import classification_report
from sklearn.metrics import confusion_matrix
from sklearn.metrics import accuracy_score
from sklearn.tree import DecisionTreeClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.svm import SVC
from copy import deepcopy
from sklearn.semi_supervised import LabelPropagation
from sklearn.semi_supervised import LabelSpreading
f = "../../Documents/UCI/anneal.arff"
dataAsRecArray, meta = arff.loadarff(f)
dataset_raw = pd.DataFrame.from_records(dataAsRecArray)
dataset = pd.get_dummies(dataset_raw)
class_names = [col for col in dataset.columns if 'class_' in col]
print (dataset.shape)
number_of_output_columns = len(class_names)
print (number_of_output_columns)
def run(name, model, dataset, percent):
# Split-out validation dataset
array = dataset.values
X = array[:, 0:-number_of_output_columns]
Y = array[:, -number_of_output_columns:]
validation_size = 0.40
seed = 7
X_train, X_validation, Y_train, Y_validation = model_selection.train_test_split(X, Y, test_size=validation_size, random_state=seed)
num_samples = len(Y_train)
num_labeled_points = math.floor(percent*num_samples)
indices = np.arange(num_samples)
unlabeled_set = indices[num_labeled_points:]
Y_train[unlabeled_set] = [-1, -1, -1, -1, -1]
lp_model = LabelSpreading(gamma=0.25, max_iter=5)
lp_model.fit(X_train, Y_train)
"""
predicted_labels = lp_model.transduction_[unlabeled_set]
print(predicted_labels[:10])
"""
if __name__ == "__main__":
#percentages = [0.1, 0.2, 0.3, 0.4]
percentages = [0.1]
models = []
models.append(('LS', LabelSpreading()))
#models.append(('CART', DecisionTreeClassifier()))
#models.append(('NB', GaussianNB()))
#models.append(('SVM', SVC()))
# evaluate each model in turn
results = []
names = []
for name, model in models:
for percent in percentages:
run(name, model, dataset, percent)
print ("bye")
Your Y_train has shape (538, 5) but should be 1d. LabelPropagation doesn't support multi-label or multi-output multi-class right now.
The error message could be more informative, though :-/

Resources