Coiled: Use local file to train XGBoost classifier - dask

I want to train an XGBoost classifier with coiled and dask.
The problem is that my training data is really big and is stored in an h5py file on my computer. Is there a way to upload the h5py file directly to the workers?
To show my problem I created an example. For this example, I create some random data and store it in an h5py file so you can see what my data looks like. In my real work case, the data has 7245346 features and 2157 samples.
import coiled
import h5py
import numpy as np
import dask.array as da
from dask.distributed import Client
import xgboost as xgb
input_path = "test.h5"
# create some random data
n_features = 500
n_samples = 200
X = np.random.randint(0,3,size=[n_samples, n_features])
y = np.random.randint(0,5,size=[n_samples])
with h5py.File(input_path, mode='w') as file:
file.create_dataset('X', data=X)
file.create_dataset('y', data=y)
rows_per_chunk = 100
coiled.create_software_environment(
name="xgboost-on-coiled",
pip=["coiled", "h5py", "dask", "xgboost"])
with coiled.Cluster(
name="xgboost-cluster",
n_workers=2,
worker_cpu=8,
worker_memory="16GiB",
software="xgboost-on-coiled") as cluster:
with Client(cluster) as client:
file = h5py.File(input_path, mode='r')
n_features = file["X"].shape[1]
X = da.from_array(file["X"], chunks=(rows_per_chunk, n_features))
X = X.rechunk(chunks=(rows_per_chunk, n_features))
X.astype("int8")
X.persist()
y = da.from_array(file["y"], chunks=rows_per_chunk)
n_class = np.unique(y.compute()).size
y = y.astype("int8")
y.persist()
dtrain = xgb.dask.DaskDMatrix(
client,
X,
y,
feature_names=['%i' % i for i in range(n_features)])
model_params = {
'objective': 'multi:softprob',
'eval_metric': 'mlogloss',
'num_class': n_class}
# train model
output = xgb.dask.train(
client,
params=model_params,
dtrain=dtrain)
booster = output["booster"]
The Error message:
FileNotFoundError: [Errno 2] Unable to open file (unable to open file: name = 'test.h5', errno = 2, error message = 'No such file or directory', flags = 0, o_flags = 0)
For smaller amounts of data, I can load the data directly out of RAM. But for more data, this does not work anymore. Just that you know what I am talking about:
input_path = "test.h5"
n_features = 500
n_samples = 200
X = np.random.randint(0,3,size=[n_samples, n_features])
y = np.random.randint(0,5,size=[n_samples])
with h5py.File(input_path, mode='w') as file:
file.create_dataset('X', data=X)
file.create_dataset('y', data=y)
rows_per_chunk = 100
coiled.create_software_environment(
name="xgboost-on-coiled",
pip=["coiled", "h5py", "dask", "xgboost"])
with coiled.Cluster(
name="xgboost-cluster",
n_workers=2,
worker_cpu=8,
worker_memory="16GiB",
software="xgboost-on-coiled") as cluster:
with Client(cluster) as client:
file = h5py.File(input_path, mode='r')
n_features = file["X"].shape[1]
X = file["X"][:]
X = da.from_array(X, chunks=(rows_per_chunk, n_features))
y = file["y"][:]
n_class = np.unique(y).size
y = da.from_array(y, chunks=rows_per_chunk)
dtrain = xgb.dask.DaskDMatrix(
client,
X,
y,
feature_names=['%i' % i for i in range(n_features)])
model_params = {
'objective': 'multi:softprob',
'eval_metric': 'mlogloss',
'num_class': n_class}
# train model
output = xgb.dask.train(
client,
params=model_params,
dtrain=dtrain)
booster = output["booster"]
If this code is used with large amounts of data, no error message is displayed. In this case, simply nothing happens. I do not see the data being uploaded.
I have tried so many things and nothing has worked. I would be very grateful if you have some advice for me on how to do this.
(Just in case you are wondering why I am trying to train a model on 7 million features: I want to get the feature importance for feature selection)

Is there a way to upload the h5py file directly to the workers?
When using Coiled, the recommended way is to upload the data to an AWS S3 bucket (or similar), and read it directly from there. This is because Coiled provisions Dask clusters on the cloud, and there is a cost to moving data (e.g., from your local machine to the cloud). It's more effcient to have your data on the cloud, and if possible, in the same AWS region. Also, see the Coiled Documentation: How do I access my data from Coiled?.

Related

Saving tensors to a .pt file in order to create a dataset

I was tasked with the creation of a dataset to test the functionality of the code we're working on.
The dataset must have a group of tensors that will be used later on in a generative model.
I'm trying to save the tensors to a .pt file, but I'm overwriting the tensors thus creating a file with only one. I've read about torch.utils.data.dataset but I'm not able to figure out by my own how to use it.
Here is my code:
import torch
import numpy as np
from torch.utils.data import Dataset
#variables that will be used to create the size of the tensors:
num_jets, num_particles, num_features = 1, 30, 3
for i in range(100):
#tensor from a gaussian dist with mean=5,std=1 and shape=size:
tensor = torch.normal(5,1,size=(num_jets, num_particles, num_features))
#We will need the tensors to be of the cpu type
tensor = tensor.cpu()
#save the tensor to 'tensor_dataset.pt'
torch.save(tensor,'tensor_dataset.pt')
#open the recently created .pt file inside a list
tensor_list = torch.load('tensor_dataset.pt')
#prints the list. Just one tensor inside .pt file
print(tensor_list)
Reason: You overwrote tensor x each time in a loop, therefore you did not get your list, and you only had x at the end.
Solution: you have the size of the tensor, you can initialize a tensor first and iterate through lst_tensors:
import torch
import numpy as np
from torch.utils.data import Dataset
num_jets, num_particles, num_features = 1, 30, 3
lst_tensors = torch.empty(size=(100,num_jets, num_particles, num_features))
for i in range(100):
lst_tensors[i] = torch.normal(5,1,size=(num_jets, num_particles, num_features))
lst_tensors[i] = lst_tensors[i].cpu()
torch.save(lst_tensors,'tensor_dataset.pt')
tensor_list = torch.load('tensor_dataset.pt')
print(tensor_list.shape) # [100,1,30,3]

FastAI PyTorch Train_loss and valid_loss look very good, but the model recognize nothing

Update 1
I’m thinking that it might be the mistake in my detector code.
So, here is my code for using the trained learner/model to predict images.
import requests
import cv2
bytes = b''
stream = requests.get(url, stream=True)
bytes = bytes + stream.raw.read(1024) # I have my mobile video streaming to this url. the resolution for the video streaming is: 2048 x 1080
a = bytes.find(b'\xff\xd8')
b = bytes.find(b'\xff\xd9')
if a != -1 and b != -1:
jpg = bytes[a:b+2]
bytes = bytes[b+2:]
img = cv2.imdecode(np.fromstring(jpg, dtype=np.uint8), cv2.IMREAD_COLOR)
processedImg = Image(pil2tensor(img, np.float32).div_(255))
predict = learn.predict(processedImg)
self.objectClass = predict[0].obj
and I read the document of imdecode() method, it returns image in B G R order.
Could it because of different channel data used when in training and detecting?
Or
Could it because that I trained with image size 299 x 450, but when detecting the input image size from the video streaming is 2048 x 1080 without resizing it?
new to FastAi, ML and Python. I trained my “Birds Or Not-Birds” model. The train_loss, valid_loss and error_rate were improving. If I only trained 3 epochs, then the model worked(meaning it can recognize whether there are birds or no birds in images), then I increased to 30 epochs, all metrics look very good, but the model does not recognize things anymore, whatever images I input, the model always return Not-Birds.
here is the training output:
Here are the plots of learn.recorder
Here is my code:
from fastai.vision import *
from fastai.metrics import error_rate
from fastai.callbacks import EarlyStoppingCallback,SaveModelCallback
from datetime import datetime as dt
from functools import partial
path_img = '/minidata'
train_folder = 'train'
valid_folder = 'validation'
tunedTransform = partial(get_transforms, max_zoom=1.5)
data = ImageDataBunch.from_folder(path=path_img, train=train_folder, valid=valid_folder, ds_tfms=tunedTransform(),
size=(299, 450), bs=40, classes=['birds', 'others'],
resize_method=ResizeMethod.SQUISH)
data = data.normalize(imagenet_stats)
learn = cnn_learner(data, models.resnet50, metrics=error_rate)
learn.fit_one_cycle(30, max_lr=slice(5e-5,5e-4))
learn.recorder.plot_lr()
learn.recorder.plot()
learn.recorder.plot_losses()
Here is my dataset folder structure:
minidata
train
birds (7500 images)
others (around 7300 images)
validation
birds (1008 images)
others (around 872 images)
Your learning rate schedule is sub-optimal for this dataset. Try to first figure out the best learning rate for this network and dataset with
LRFinder. This can be done by exploring the loss behavior for different learning rates with
learn.lr_find()
learn.recorder.plot()
Edit:
It looks like you are re-training the last layer in your network. Instead try training more layers from scratch. as:
learn.unfreeze(2)

My speaker recognition neural network doesn’t work well

I have a final project in my first degree and I want to build a Neural Network that gonna take the first 13 mfcc coeffs of a wav file and return who talked in the audio file from a banch of talkers.
I want you to notice that:
My audio files are text independent, therefore they have different length and words
I have trained the machine on about 35 audio files of 10 speaker ( the first speaker had about 15, the second 10, and the third and fourth about 5 each )
I defined :
X=mfcc(sound_voice)
Y=zero_array + 1 in the i_th position ( where i_th position is 0 for the first speaker, 1 for the second, 2 for the third... )
And than trained the machine, and than checked the output of the machine for some files...
So that’s what I did... but unfortunately it’s look like the results are completely random...
Can you help me understand why?
This is my code in python -
from sklearn.neural_network import MLPClassifier
import python_speech_features
import scipy.io.wavfile as wav
import numpy as np
from os import listdir
from os.path import isfile, join
from random import shuffle
import matplotlib.pyplot as plt
from tqdm import tqdm
winner = [] # this array count how much Bingo we had when we test the NN
for TestNum in tqdm(range(5)): # in every round we build NN with X,Y that out of them we check 50 after we build the NN
X = []
Y = []
onlyfiles = [f for f in listdir("FinalAudios/") if isfile(join("FinalAudios/", f))] # Files in dir
names = [] # names of the speakers
for file in onlyfiles: # for each wav sound
# UNESSECERY TO UNDERSTAND THE CODE
if " " not in file.split("_")[0]:
names.append(file.split("_")[0])
else:
names.append(file.split("_")[0].split(" ")[0])
names = list(dict.fromkeys(names)) # names of speakers
vector_names = [] # vector for each name
i = 0
vector_for_each_name = [0] * len(names)
for name in names:
vector_for_each_name[i] += 1
vector_names.append(np.array(vector_for_each_name))
vector_for_each_name[i] -= 1
i += 1
for f in onlyfiles:
if " " not in f.split("_")[0]:
f_speaker = f.split("_")[0]
else:
f_speaker = f.split("_")[0].split(" ")[0]
(rate, sig) = wav.read("FinalAudios/" + f) # read the file
try:
mfcc_feat = python_speech_features.mfcc(sig, rate, winlen=0.2, nfft=512) # mfcc coeffs
for index in range(len(mfcc_feat)): # adding each mfcc coeff to X, meaning if there is 50000 coeffs than
# X will be [first coeff, second .... 50000'th coeff] and Y will be [f_speaker_vector] * 50000
X.append(np.array(mfcc_feat[index]))
Y.append(np.array(vector_names[names.index(f_speaker)]))
except IndexError:
pass
Z = list(zip(X, Y))
shuffle(Z) # WE SHUFFLE X,Y TO PERFORM RANDOM ON THE TEST LEVEL
X, Y = zip(*Z)
X = list(X)
Y = list(Y)
X = np.asarray(X)
Y = np.asarray(Y)
Y_test = Y[:50] # CHOOSE 50 FOR TEST, OTHERS FOR TRAIN
X_test = X[:50]
X = X[50:]
Y = Y[50:]
clf = MLPClassifier(solver='lbfgs', alpha=1e-2, hidden_layer_sizes=(5, 3), random_state=2) # create the NN
clf.fit(X, Y) # Train it
for sample in range(len(X_test)): # add 1 to winner array if we correct and 0 if not, than in the end it plot it
if list(clf.predict([X[sample]])[0]) == list(Y_test[sample]):
winner.append(1)
else:
winner.append(0)
# plot winner
plot_x = []
plot_y = []
for i in range(1, len(winner)):
plot_y.append(sum(winner[0:i])*1.0/len(winner[0:i]))
plot_x.append(i)
plt.plot(plot_x, plot_y)
plt.xlabel('x - axis')
# naming the y axis
plt.ylabel('y - axis')
# giving a title to my graph
plt.title('My first graph!')
# function to show the plot
plt.show()
This is my zip file that contains the code and the audio file : https://ufile.io/eggjm1gw
You have a number of issues in your code and it will be close to impossible to get it right in one go, but let's give it a try. There are two major issues:
Currently you're trying to teach your neural network with very few training examples, as few as a single one per speaker (!). It's impossible for any machine learning algorithm to learn anything.
To make matters worse, what you do is that you feed to the ANN only MFCC for the first 25 ms of each recording (25 comes from winlen parameter of python_speech_features). In each of these recordings, first 25 ms will be close to identical. Even if you had 10k recordings per speaker, with this approach you'd not get anywhere.
I will give you concrete advise, but won't do all the coding - it's your homework after all.
Use all MFCC, not just first 25 ms. Many of these should be skipped, simply because there's no voice activity. Normally there should be VOD (Voice Activity Detector) telling you which ones to take, but in this exercise I'd skip it for starter (you need to learn basics first).
Don't use dictionaries. Not only it won't fly with more than one MFCC vector per speaker, but also it's very inefficient data structure for your task. Use numpy arrays, they're much faster and memory efficient. There's a ton of tutorials, including scikit-learn that demonstrate how to use numpy in this context. In essence, you create two arrays: one with training data, second with labels. Example: if omersk speaker "produces" 50000 MFCC vectors, you will get (50000, 13) training array. Corresponding label array would be 50000 with single constant value (id) that corresponds to the speaker (say, omersk is 0, lucas is 1 and so on). I'd consider taking longer windows (perhaps 200 ms, experiment!) to reduce the variance.
Don't forget to split your data for training, validation and test. You will have more than enough data. Also, for this exercise I'd watch for not feeding too much of data for any single speaker - ot taking steps to make sure algorithm is not biased.
Later, when you make prediction, you will again compute MFCCs for the speaker. With 10 sec recording, 200 ms window and 100 ms overlap, you'll get 99 MFCC vectors, shape (99, 13). The model should run on each of the 99 vectors, for each producing probability. When you sum it (and normalise, to make it nice) and take top value, you'll get the most likely speaker.
There's a dozen of other things that typically would be taken into account, but in this case (homework) I'd focus on getting the basics right.
EDIT: I decided to take a stab at creating the model with your idea at heart, but basics fixed. It's not exactly clean Python, all because it's adapted from Jupyter Notebook I was running.
import python_speech_features
import scipy.io.wavfile as wav
import numpy as np
import glob
import os
from collections import defaultdict
from sklearn.neural_network import MLPClassifier
from sklearn import preprocessing
from sklearn.model_selection import cross_validate
from sklearn.ensemble import RandomForestClassifier
audio_files_path = glob.glob('audio/*.wav')
win_len = 0.04 # in seconds
step = win_len / 2
nfft = 2048
mfccs_all_speakers = []
names = []
data = []
for path in audio_files_path:
fs, audio = wav.read(path)
if audio.size > 0:
mfcc = python_speech_features.mfcc(audio, samplerate=fs, winlen=win_len,
winstep=step, nfft=nfft, appendEnergy=False)
filename = os.path.splitext(os.path.basename(path))[0]
speaker = filename[:filename.find('_')]
data.append({'filename': filename,
'speaker': speaker,
'samples': mfcc.shape[0],
'mfcc': mfcc})
else:
print(f'Skipping {path} due to 0 file size')
speaker_sample_size = defaultdict(int)
for entry in data:
speaker_sample_size[entry['speaker']] += entry['samples']
person_with_fewest_samples = min(speaker_sample_size, key=speaker_sample_size.get)
print(person_with_fewest_samples)
max_accepted_samples = int(speaker_sample_size[person_with_fewest_samples] * 0.8)
print(max_accepted_samples)
training_idx = []
test_idx = []
accumulated_size = defaultdict(int)
for entry in data:
if entry['speaker'] not in accumulated_size:
training_idx.append(entry['filename'])
accumulated_size[entry['speaker']] += entry['samples']
elif accumulated_size[entry['speaker']] < max_accepted_samples:
accumulated_size[entry['speaker']] += entry['samples']
training_idx.append(entry['filename'])
X_train = []
label_train = []
X_test = []
label_test = []
for entry in data:
if entry['filename'] in training_idx:
X_train.append(entry['mfcc'])
label_train.extend([entry['speaker']] * entry['mfcc'].shape[0])
else:
X_test.append(entry['mfcc'])
label_test.extend([entry['speaker']] * entry['mfcc'].shape[0])
X_train = np.concatenate(X_train, axis=0)
X_test = np.concatenate(X_test, axis=0)
assert (X_train.shape[0] == len(label_train))
assert (X_test.shape[0] == len(label_test))
print(f'Training: {X_train.shape}')
print(f'Testing: {X_test.shape}')
le = preprocessing.LabelEncoder()
y_train = le.fit_transform(label_train)
y_test = le.transform(label_test)
clf = MLPClassifier(solver='lbfgs', alpha=1e-2, hidden_layer_sizes=(5, 3), random_state=42, max_iter=1000)
cv_results = cross_validate(clf, X_train, y_train, cv=4)
print(cv_results)
{'fit_time': array([3.33842635, 4.25872731, 4.73704267, 5.9454329 ]),
'score_time': array([0.00125694, 0.00073504, 0.00074005, 0.00078583]),
'test_score': array([0.40380048, 0.52969121, 0.48448687, 0.46043165])}
The test_score isn't stellar. There's a lot to improve (for starter, choice of algorithm), but the basics are there. Notice for starter how I get the training samples. It's not random, I only consider recordings as whole. You can't put samples from a given recording to both training and test, as test is supposed to be novel.
What was not working in your code? I'd say a lot. You were taking 200ms samples and yet very short fft. python_speech_features likely complained to you that the fft is should be longer than the frame you're processing.
I leave to you testing the model. It won't be good, but it's a starter.

How to save a Tensorflow model ( which doesnt contain any variable ) to port it in OpenCV

I wanted to know what is the correct way to save a tensorflow model that I have trained in python so that I can import it in OpenCV using the dnn module of opencv. This is my Tensorflow graph
X = tf.placeholder(tf.float32, [None,training_set.shape[1]],name = 'X')
Y = tf.placeholder(tf.float32,[None,training_labels.shape[1]], name = 'Y')
A1 = tf.contrib.layers.fully_connected(X, num_outputs = 50, activation_fn = tf.nn.relu)
A1 = tf.nn.dropout(A1, 0.8)
A2 = tf.contrib.layers.fully_connected(A1, num_outputs = 2, activation_fn = None)
cost = tf.reduce_mean(tf.nn.softmax_cross_entropy_with_logits(logits = A2, labels = Y))
global_step = tf.Variable(0, trainable=False)
start_learning_rate = 0.001
learning_rate = tf.train.exponential_decay(start_learning_rate, global_step, 100, 0.1, True )
optimizer = tf.train.AdamOptimizer(learning_rate=learning_rate).minimize(cost)
As you can see it doesn't contain any variables. So my question is how should this graph be saved in Tensorflow so that it can be loaded using cv::dnn::readNetFromTensorflow. Should I save the model as .pb or .pbtxt file. And will the .pb or .pbtxt contain the graph as well as the weights or just the graph ??. How to load both the graph and the weights in OpenCV ??.
The code that belongs to OP posted link is posted here. URL may change, code renamed or vanished. Therefore I've posted the code to where it is referred by OP.
I guess a first question is how to save the graph at least to load it in TensorFlow again? Because you need to find a way to restore it. There is some way to do it:
-- Save
# Save a graph definition (once)
tf.train.write_graph(sess.graph.as_graph_def(), "", "graph.pb")
# Weights initialization
sess.run(tf.global_variables_initializer())
# Training
...
# Save a checkpoint (weights only, no graph definition)
saver = tf.train.Saver()
saver.save(sess, 'tmp.ckpt')
-- Freeze (merge graph definition with weights, remove training-only nodes)
python ~/tensorflow/tensorflow/python/tools/freeze_graph.py \
--input_graph=graph.pb \
--input_checkpoint=tmp.ckpt \
--output_graph=frozen_graph.pb \
--output_node_names="NameOfOutputNode"
Only after these steps you might load frozen_graph.pb contains both graph definition and weights using OpenCV.

Labeling Images using Inception Getting ValueError: GraphDef cannot be larger than 2GB

I am using the TensorFlow for Poets code lab to guide me as I retrain the Inceptionv3 CNN to classify a list of images. I have successfully trained the model, and it works when i employ the given code to classify individual images. But when i try and use it on a large batch of images, then i get the GraphDef cannot be larger than 2GB. Please advise.
import pandas as pd
import os, sys
import tensorflow as tf
test_images = pd.read_csv('test_images.csv')
testid = test_images['Id']
listx= list(range(4320))
predlist=[]
output = pd.DataFrame({'Id': listx})
os.environ['TF_CPP_MIN_LOG_LEVEL'] = '2'
for x in listx:
path = 'test/'+str(x+1)+'.jpg'
# change this as you see fit
image_path = path
# Read in the image_data
image_data = tf.gfile.FastGFile(image_path, 'rb').read()
# Loads label file, strips off carriage return
label_lines = [line.rstrip() for line
in tf.gfile.GFile("retrained_labels.txt")]
# Unpersists graph from file
with tf.gfile.FastGFile("retrained_graph.pb", 'rb') as f:
graph_def = tf.GraphDef()
graph_def.ParseFromString(f.read())
tf.import_graph_def(graph_def, name='')
with tf.Session() as sess:
# Feed the image_data as input to the graph and get first prediction
with tf.Graph().as_default():
softmax_tensor = sess.graph.get_tensor_by_name('final_result:0')
predictions = sess.run(softmax_tensor, \
{'DecodeJpeg/contents:0': image_data})
# Sort to show labels of first prediction in order of confidence
top_k = predictions[0].argsort()[-len(predictions[0]):][::-1]
# print('the top result is' + label_lines[node_id])
flag = 0
for node_id in top_k:
while flag == 0:
human_string = label_lines[node_id]
score = predictions[0][node_id]
predlist.append(int(human_string[:3]))
print('%s' % (human_string))
flag = 1 # we only want the top prediction
output['Prediction']=predlist
output.to_csv('outputtest.csv')
One way by which this error can e solved is by placing
with tf.Graph().as_default():
after for loop.
This is the piece of code that worked for me while trying to read bulk image:
for filename in os.listdir(image_path):
with tf.Graph().as_default():
# Read in the image_data
image_data = tf.gfile.FastGFile(image_path + filename, 'rb').read()

Resources