how to split datasets into training and test data with sklearn - image-processing

I'm using at&t faces dataset, a main directory of contains 40 sub-directories, each sub-directory contains different images of a specific person. I created a list that contains the sub-directories names. I want to use the data to train a neural network so I want to split the data into 80% training and 20% testing. Here is what I have done so far :
import os
import cv2
path = r"C:\Users\Desktop\att_faces"
directory = []
directory = [x[1] for x in os.walk(path)]
non_empty_dirs = [x for x in directory if x]
directory = [item for subitem in non_empty_dirs for item in subitem]
How should I proceed after this step?

You want to split your data in to train and test sets. For that, you can either
Manually or using a script separate train and test to folders and load them to train with the help of a Data Generator.
Load whole data and split them to train and test in memory.
Let's discuss the second option.
main directory contains 40 sub-directories
Let assume your main directory is Train// and there are 40 subfolders namely 1-40. Also, I assume class label is the folder name.
# imports
import cv2
import numpy as np
import os
from sklearn.model_selection import train_test_split
# seed for reproducibility
SEED = 44000
# lists to store data
data = []
label = []
# folder where data is placed
BASE_FOLDER = 'Train//'
folders = os.listdir(BASE_FOLDER)
# loading data to lists
for folder in folders:
for file in os.listdir(BASE_FOLDER + folder + '//'):
img = cv2.imread(BASE_FOLDER + folder + '//' + file)
# do any pre-processing if needed like resize, sharpen etc.
data = data.append(img)
label = label.append(folder)
# now split the data in to train and test with the help of train_test_split
train_data, test_data, train_label, test_label = train_test_split(data, label, test_size=0.2, random_state=SEED)

Related

How to save/serializing a glm model as zip/pickle file?

I built a tweedie glm model using statsmodels.
Just wondering how to save/serializing it as zip file or pkl file?
I tried
from statsmodels.formula.api import glm
formula4 = "y ~ x1 + C(x2)"
mod4 = glm(formula=formula4, var_weights = 'one', data=train, family=sm.families.Tweedie())
res4 = mod4.fit()
import pickle
filename = 'test.pkl'
#Use pickle to save your object to a file:
pickle.dump(mod4, open(filename, 'wb'))
But the saved pickle file is too large.
Any idea?
--
Answer:
not to use formula directly in the model building process. use dmatrices to process the data ahead. then save the model, the result is around 10kb.

Saving tensors to a .pt file in order to create a dataset

I was tasked with the creation of a dataset to test the functionality of the code we're working on.
The dataset must have a group of tensors that will be used later on in a generative model.
I'm trying to save the tensors to a .pt file, but I'm overwriting the tensors thus creating a file with only one. I've read about torch.utils.data.dataset but I'm not able to figure out by my own how to use it.
Here is my code:
import torch
import numpy as np
from torch.utils.data import Dataset
#variables that will be used to create the size of the tensors:
num_jets, num_particles, num_features = 1, 30, 3
for i in range(100):
#tensor from a gaussian dist with mean=5,std=1 and shape=size:
tensor = torch.normal(5,1,size=(num_jets, num_particles, num_features))
#We will need the tensors to be of the cpu type
tensor = tensor.cpu()
#save the tensor to 'tensor_dataset.pt'
torch.save(tensor,'tensor_dataset.pt')
#open the recently created .pt file inside a list
tensor_list = torch.load('tensor_dataset.pt')
#prints the list. Just one tensor inside .pt file
print(tensor_list)
Reason: You overwrote tensor x each time in a loop, therefore you did not get your list, and you only had x at the end.
Solution: you have the size of the tensor, you can initialize a tensor first and iterate through lst_tensors:
import torch
import numpy as np
from torch.utils.data import Dataset
num_jets, num_particles, num_features = 1, 30, 3
lst_tensors = torch.empty(size=(100,num_jets, num_particles, num_features))
for i in range(100):
lst_tensors[i] = torch.normal(5,1,size=(num_jets, num_particles, num_features))
lst_tensors[i] = lst_tensors[i].cpu()
torch.save(lst_tensors,'tensor_dataset.pt')
tensor_list = torch.load('tensor_dataset.pt')
print(tensor_list.shape) # [100,1,30,3]

AttributeError: 'DecisionTreeRegressor' object has no attribute 'save' in GCS

I was trying to deploy my custom DecisionTreeRegressor for house price prediction to GCS Vertex AI. The tutorial I followed was tutorial for MPG dataset tutorial
However, when I tried to build and test the container locally using commands:
docker build ./ -t $IMAGE_URI
docker run $IMAGE_URI
The error message came out:
AttributeError: 'DecisionTreeRegressor' object has no attribute 'save'
The code I run as train.py:
# Import libraries necessary for this project
import numpy as np
import pandas as pd
from sklearn.model_selection import ShuffleSplit
# Load the Boston housing dataset
data = pd.read_csv('trainer/housing.csv')
prices = data['MEDV']
features = data.drop('MEDV', axis = 1)
# Import 'train_test_split'
from sklearn.model_selection import train_test_split
# Shuffle and split the data into training and testing subsets
X_train, X_test, y_train, y_test = train_test_split(features, prices, test_size=0.2, random_state = 42)
#Defining model fitting and tuning functions
# Import 'make_scorer', 'DecisionTreeRegressor', and 'GridSearchCV'
from sklearn.tree import DecisionTreeRegressor
from sklearn.metrics import make_scorer
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import r2_score # Import 'r2_score'
from sklearn.metrics import accuracy_score
# TODO: replace `your-gcs-bucket` with the name of the Storage bucket you created earlier
BUCKET = 'gs://gardena-dps-bucket'
def performance_metric(y_true, y_predict):
""" Calculates and returns the performance score between
true (y_true) and predicted (y_predict) values based on the metric chosen. """
score = r2_score(y_true, y_predict)
# Return the score
return score
def fit_model(X, y):
""" Performs grid search over the 'max_depth' parameter for a
decision tree regressor trained on the input data [X, y]. """
# Create cross-validation sets from the training data
cv_sets = ShuffleSplit(n_splits = 10, test_size = 0.20, random_state = 0)
# Create a decision tree regressor object
regressor = DecisionTreeRegressor()
# Create a dictionary for the parameter 'max_depth' with a range from 1 to 10
params = {'max_depth':[1,2,3,4,5,6,7,8,9,10]}
# Transform 'performance_metric' into a scoring function using 'make_scorer'
scoring_fnc = make_scorer(performance_metric)
# Create the grid search cv object --> GridSearchCV()
# Make sure to include the right parameters in the object:
# (estimator, param_grid, scoring, cv) which have values 'regressor', 'params', 'scoring_fnc', and 'cv_sets' respectively.
grid = GridSearchCV(estimator=regressor, param_grid=params, scoring=scoring_fnc, cv=cv_sets)
# Fit the grid search object to the data to compute the optimal model
grid = grid.fit(X, y)
# Return the optimal model after fitting the data
return grid.best_estimator_
# Fit the training data to the model using grid search
reg = fit_model(X_train, y_train)
# Produce a matrix for client data
client_data = [[12, 26.3, 16.99885]] # Client data in 2D array
# Show predictions
reprice = reg.predict(client_data).astype(int)
reprice
# Export model and save to GCS
reg.save(BUCKET + '/housing/model')
Scikit-learn estimators do not provide any method to save their states directly. From the Google documentation, the best way to store a fitted model to GCS is to use joblib to locally serialize your model and then upload it to GCS.
As follow:
from google.cloud import storage
from sklearn.externals import joblib
# Export the model to a file
model = 'model.joblib'
joblib.dump(pipeline, model)
# Upload the model to GCS
bucket = storage.Client().bucket(BUCKET_NAME)
blob = bucket.blob('{}/{}'.format(
datetime.datetime.now().strftime('model_%Y%m%d_%H%M%S'),
model))
blob.upload_from_filename(model)

How to save sentence-Bert output vectors to a file?

I am using Bert to get similarity between multi term words.here is my code that I used for embedding :
from sentence_transformers import SentenceTransformer
model = SentenceTransformer('bert-large-uncased-whole-word-masking')
words = [
"Artificial intelligence",
"Data mining",
"Political history",
"Literature book"]
I also have a dataset which contains 540000 other words.
Vocabs = [
"Winter flooding",
"Cholesterol diet", ....]
the problem is when I want to embed Vocabs to vectors it takes time forever.
words_embeddings = model.encode(words)
Vocabs_embeddings = model.encode(Vocabs)
is there any way to make it faster? or I want to embed Vocabs in for loops and save the output vectors in a file so I don't have to embed 540000 vocabs every time I need it. is there a way to save embeddings to a file and use it again?
I will really appreciate you for your time trying help me.
You can pickle your corpus and embeddings like this, you can also pickle a dictionary instead, or write them to file in any other format you prefer.
import pickle
with open("my-embeddings.pkl", "wb") as fOut:
pickle.dump({'sentences': words, 'embeddings': word_embeddings},fOut)
Or more generally like below, so you encode when the embeddings don't exist but after that any time you need them you load from file, instead of re-encoding your corpus:
if not os.path.exists(embedding_cache_path):
# read your corpus etc
corpus_sentences = ...
print("Encoding the corpus. This might take a while")
corpus_embeddings = model.encode(corpus_sentences, show_progress_bar=True, convert_to_numpy=True)
corpus_embeddings = corpus_embeddings / np.linalg.norm(corpus_embeddings, axis=1, keepdims=True)
print("Storing file on disc")
with open(embedding_cache_path, "wb") as fOut:
pickle.dump({'sentences': corpus_sentences, 'embeddings': corpus_embeddings}, fOut)
else:
print("Loading pre-computed embeddings from disc")
with open(embedding_cache_path, "rb") as fIn:
cache_data = pickle.load(fIn)
corpus_sentences = cache_data['sentences']
corpus_embeddings = cache_data['embeddings']

How to display categorical values on export tree image of decision tree classifier?

I am trying to export the decision tree as an image with the original labels of all categorical fields.
The current data I have is like so:
I transformed the categorical features into numerical:
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
# Importing the dataset
dataset = pd.read_csv('data.csv')
X = dataset.iloc[:, 0:4]
y = dataset.iloc[:, 4]
from sklearn.preprocessing import LabelEncoder
lb = LabelEncoder()
X['Outlook'] = lb.fit_transform(X['Outlook'])
X['Temp'] = lb.fit_transform(X['Temp'])
X['Humidity'] = lb.fit_transform(X['Humidity'])
X['Windy'] = lb.fit_transform(X['Windy'])
y = lb.fit_transform(y)
Afterwards, I applied the DecisionTreeClassifier:
from sklearn.tree import DecisionTreeClassifier
dtc = DecisionTreeClassifier(criterion="entropy")
dtc.fit(X, y)
At the end, I needed to check the tree generated from the model using the following:
# Import tools needed for visualization
from sklearn.tree import export_graphviz
import pydot
# Pull out one tree from the forest
# Export the image to a dot file
export_graphviz(dtc, out_file = 'tree.dot', feature_names = X.columns, rounded = True, precision = 1)
# Use dot file to create a graph
(graph, ) = pydot.graph_from_dot_file('tree.dot')
# Write graph to a png file
graph.write_png('tree.png')
The tree.png:
But what I really need, is to see the main labels of each feature inside the nodes or at each branch, instead of true or false or a numeric representation.
I tried the following:
y=lb.inverse_transform(y)
And the same for X features, but the tree is being generated the same as above.

Resources