Correct way to use librosa mfcc with Random Forest - random-forest

I would like to know what is the best approach in order to use librosa.feature.mfcc feature extraction in a Random Forest classifier?
2 cases is as follows:
Case 1:
I have 1000 audio files and use the librosa mfcc feature extraction as is:
def extract_features(file_name):
try:
durationSeconds = 1
audio, sample_rate = librosa.load(file_name, res_type='kaiser_fast')
trimmed = librosa.util.fix_length(audio, size=int(sample_rate * durationSeconds))
mfccs = librosa.feature.mfcc(y=trimmed, sr=sample_rate, n_mfcc=40)
pad_width = max_pad_len - mfccs.shape[1]
mfccs = np.pad(mfccs, pad_width=((0, 0), (0, pad_width)), mode='constant')
except Exception as e:
print("Error encountered while parsing file: ", file_name)
return None
return mfccs
I would then flatten the 3D array generated by this feature extraction via:
X_train_flat = np.array([features_2d.flatten() for features_2d in X_train])
and then send this to the Random Forest classifier.
Case 2:
For the same 1000 audio files, I use:
def extract_features(file_name):
try:
durationSeconds = 1
audio, sample_rate = librosa.load(file_name, res_type='kaiser_fast')
trimmed = librosa.util.fix_length(audio, size=int(sample_rate * durationSeconds))
mfccs = librosa.feature.mfcc(y=trimmed, sr=sample_rate, n_mfcc=40)
pd.Series(np.hstack((np.mean( mfccs, axis=1), np.std( mfccs, axis=1)))) #returns mfcc features with mean and standard deviation along time
except Exception as e:
print("Error encountered while parsing file: ", file_name)
return None
return pd.Series([0]*40)
I would then pass this X_train data to the Random Forest Classifier.
rfc.fit(X_train, y_train)
Note that in the first use case, when I flatten the data, I get X_train of size 1000 x 6920. This, in effect, have me parsing 6920 features to the Random Forest classifier for analysis compared to the 2nd use case of 40.
Can you tell me which approach is correct?
Thanks!

Related

R: Error in predict.xgboost: Feature names stored in `object` and `newdata` are different

I wrote a script using xgboost to predict soil class for a certain area using data from field and satellite images. The script as below:
`
rm(list=ls())
library(xgboost)
library(caret)
library(raster)
library(sp)
library(rgeos)
library(ggplot2)
setwd("G:/DATA")
data <- read.csv('96PointsClay02finalone.csv')
head(data)
summary(data)
dim(data)
ras <- stack("Allindices04TIFF.tif")
names(ras) <- c("b1", "b2", "b3", "b4", "b5", "b6", "b7", "b10", "b11","DEM",
"R1011", "SCI", "SAVI", "NDVI", "NDSI", "NDSandI", "MBSI",
"GSI", "GSAVI", "EVI", "DryBSI", "BIL", "BI","SRCI")
set.seed(27) # set seed for generating random data.
# createDataPartition() function from the caret package to split the original dataset into a training and testing set and split data into training (80%) and testing set (20%)
parts = createDataPartition(data$Clay, p = .8, list = F)
train = data[parts, ]
test = data[-parts, ]
#define predictor and response variables in training set
train_x = data.matrix(train[, -1])
train_y = train[,1]
#define predictor and response variables in testing set
test_x = data.matrix(test[, -1])
test_y = test[, 1]
#define final training and testing sets
xgb_train = xgb.DMatrix(data = train_x, label = train_y)
xgb_test = xgb.DMatrix(data = test_x, label = test_y)
#defining a watchlist
watchlist = list(train=xgb_train, test=xgb_test)
#fit XGBoost model and display training and testing data at each iteartion
model = xgb.train(data = xgb_train, max.depth = 3, watchlist=watchlist, nrounds = 100)
#define final model
model_xgboost = xgboost(data = xgb_train, max.depth = 3, nrounds = 86, verbose = 0)
summary(model_xgboost)
#use model to make predictions on test data
pred_y = predict(model_xgboost, xgb_test)
# performance metrics on the test data
mean((test_y - pred_y)^2) #mse - Mean Squared Error
caret::RMSE(test_y, pred_y) #rmse - Root Mean Squared Error
y_test_mean = mean(test_y)
rmseE<- function(error)
{
sqrt(mean(error^2))
}
y = test_y
yhat = pred_y
rmseresult=rmseE(y-yhat)
(r2 = R2(yhat , y, form = "traditional"))
cat('The R-square of the test data is ', round(r2,4), ' and the RMSE is ', round(rmseresult,4), '\n')
#use model to make predictions on satellite image
result <- predict(model_xgboost, ras[1:(nrow(ras)*ncol(ras))])
#create a result raster
res <- raster(ras)
#fill in results and add a "1" to them (to get back to initial class numbering! - see above "Prepare data" for more information)
res <- setValues(res,result+1)
#Save the output .tif file into saved directory
writeRaster(res, "xgbmodel_output", format = "GTiff", overwrite=T)
`
The script works well till it reachs
result <- predict(model_xgboost, ras[1:(nrow(ras)*ncol(ras))])
it takes some time then gives this error:
Error in predict.xgb.Booster(model_xgboost, ras[1:(nrow(ras) * ncol(ras))]) :
Feature names stored in `object` and `newdata` are different!
I realize that I am doing something wrong in that line. However, I do not know how to apply the xgboost model to a raster image that represents my study area.
It would be highly appreciated if someone give a hand, enlightened me, and helped me solve this problem....
My data as csv and raster image can be found here.
Finally, I got the reason for this error.
It was my mistake as the number of columns in the traning data was not the same as in the number of layers in the satellite image.

XGBoost Hyperparameter Tuning using Hyperopt

I am trying to tune my XGBClassifier model. But I am failing to do so. Please find the code below and please help me clean and edit the code.
import csv
from hyperopt import STATUS_OK
from timeit import default_timer as timer
MAX_EVALS = 200
N_FOLDS = 10
def objective(params, n_folds = N_FOLDS):
"""Objective function for Gradient Boosting Machine Hyperparameter Optimization"""
# Keep track of evals
global ITERATION
ITERATION += 1
# Retrieve the subsample if present otherwise set to 1.0
subsample = params['boosting_type'].get('subsample', 1.0)
# Extract the boosting type
params['boosting_type'] = params['boosting_type']['boosting_type']
params['subsample'] = subsample
# Make sure parameters that need to be integers are integers
for parameter_name in ['num_leaves', 'subsample_for_bin',
'min_child_samples']:
params[parameter_name] = int(params[parameter_name])
start = timer()
# Perform n_folds cross validation
cv_results = lgb.cv(params, train_set, num_boost_round = 10000,
nfold = n_folds, early_stopping_rounds = 100,
metrics = 'auc', seed = 50)
run_time = timer() - start
# Extract the best score
best_score = np.max(cv_results['auc-mean'])
# Loss must be minimized
loss = 1 - best_score
# Boosting rounds that returned the highest cv score
n_estimators = int(np.argmax(cv_results['auc-mean']) + 1)
# Write to the csv file ('a' means append)
of_connection = open(out_file, 'a')
writer = csv.writer(of_connection)
writer.writerow([loss, params, ITERATION, n_estimators,
run_time])
# Dictionary with information for evaluation
return {'loss': loss, 'params': params, 'iteration': ITERATION,
'estimators': n_estimators, 'train_time': run_time,
'status': STATUS_OK}
I believe I am doing something wrong in the objective function, as I am trying to edit the objective function of LightGBM.
Please help me.
I created the hgboost library which provides XGBoost Hyperparameter Tuning using Hyperopt.
pip install hgboost
Examples can be found here

seq2seq - Encoder backpropagation

I have a trouble understanding how back-propagation works in Encoder in seq2seq model. There are no labels, therefore it's not possible to calculate error, which is back-propagated, however weights of LSTM layer are somehow updated.
l_enc_input = Input(batch_shape=(batch_size, None, embedding_size))
l_enc_lstm = LSTM(encoding_size, return_sequences=False, return_state=True, stateful=True, dropout=0.2)
l_dec_input = Input(batch_shape=(batch_size, None, embedding_size))
l_dec_lstm = LSTM(encoding_size, return_sequences=False, stateful=True, dropout=0.2)
l_dec_dense = Dense(embedding_size, activation="softmax")
t_enc_out = l_enc_lstm(l_enc_input)
state = t_enc_out[1:]
t_dec_out = l_dec_dense(l_dec_lstm(l_dec_input, initial_state=state))
model_train = Model(inputs=[l_enc_input, l_dec_input], outputs=[t_dec_out])
model_train.compile(optimizer="adam", loss="categorical_crossentropy")
A seq2seq/autoencoder consists of an encoder that processes the input and a decoder that generates the output. During training, input is provided to the encoder and the output of the encoder is provided to the decoder. The goal is that the output of the decoder should be close to the input. So this is how the loss is computed, between the output of the decoder and the input.
In high level pseudo-code:
Let x be the input.
x' = decoder(encoder(x))
loss = f(x', x)
Hope that helps!
There is a great explanation here.
Also the wikipedia page is very detailed.

How to get class labels from TensorFlow prediction

I have a classification model in TF and can get a list of probabilities for the next class (preds). Now I want to select the highest element (argmax) and display its class label.
This may seems silly, but how can I get the class label that matches a position in the predictions tensor?
feed_dict={g['x']: current_char}
preds, state = sess.run([g['preds'],g['final_state']], feed_dict)
prediction = tf.argmax(preds, 1)
preds gives me a vector of predictions for each class. Surely there must be an easy way to just output the most likely class (label)?
Some info about my model:
x = tf.placeholder(tf.int32, [None, num_steps], name='input_placeholder')
y = tf.placeholder(tf.int32, [None, 1], name='labels_placeholder')
batch_size = batch_size = tf.shape(x)[0]
x_one_hot = tf.one_hot(x, num_classes)
rnn_inputs = [tf.squeeze(i, squeeze_dims=[1]) for i in
tf.split(x_one_hot, num_steps, 1)]
tmp = tf.stack(rnn_inputs)
print(tmp.get_shape())
tmp2 = tf.transpose(tmp, perm=[1, 0, 2])
print(tmp2.get_shape())
rnn_inputs = tmp2
with tf.variable_scope('softmax'):
W = tf.get_variable('W', [state_size, num_classes])
b = tf.get_variable('b', [num_classes], initializer=tf.constant_initializer(0.0))
rnn_outputs = rnn_outputs[:, num_steps - 1, :]
rnn_outputs = tf.reshape(rnn_outputs, [-1, state_size])
y_reshaped = tf.reshape(y, [-1])
logits = tf.matmul(rnn_outputs, W) + b
predictions = tf.nn.softmax(logits)
A prediction is an array of n types of classes(labels). It represents the model's "confidence" that the image corresponds to each of its classes(labels). You can check which label has the highest confidence value by using:
prediction = np.argmax(preds, 1)
After getting this highest element index using (argmax function) out of other probabilities, you need to place this index into class labels to find the exact class name associated with this index.
class_names[prediction]
Please refer to this link for more understanding.
You can use tf.reduce_max() for this. I would refer you to this answer.
Let me know if it works - will edit if it doesn't.
Mind that there are sometimes several ways to load a dataset. For instance with fashion MNIST the tutorial could lead you to use load_data() and then to create your own structure to interpret a prediction. However you can also load these data by using tensorflow_datasets.load(...) like here after installing tensorflow-datasets which gives you access to some DatasetInfo. So for instance if your prediction is 9 you can tell it's a boot with:
import tensorflow_datasets as tfds
_, ds_info = tfds.load('fashion_mnist', with_info=True)
print(ds_info.features['label'].names[9])
When you use softmax, the labels you train the model on are either numbers 0..n or one-hot encoded values. So if original labels of your data are let's say string names, you must map them to integers first and keep the mapping as a variable (such as 0 -> "apple", 1 -> "orange", 2 -> "pear" ...).
When using integers (with loss='sparse_categorical_crossentropy'), you get predictions as an array of probabilities, you just find the array index with the max value. You can use this predicted index to reverse-map to your label:
predictedIndex = np.argmax(predictions) // 2
predictedLabel = indexToLabelMap[predictedIndex] // "pear"
If you use one-hot encoded labels (with loss='categorical_crossentropy'), the predicted index corresponds with the "hot" index of your label.
Just for reference, I needed this info when I was working with MNIST dataset used in Google's Machine learning crash course. There is also a good classification tutorial in the Tensorflow docs.

kNN Consistently Overusing One Label

I am using a kNN to do some classification of labeled images. After my classification is done, I am outputting a confusion matrix. I noticed that one label, bottle was being applied incorrectly more often.
I removed the label and tested again, but then noticed that another label, shoe was being applied incorrectly, but was fine last time.
There should be no normalization, so I'm unsure what is causing this behavior. Testing showed it continued no matter how many labels I removed.
Not totally sure how much code to post, so I'll put some things that should be relevant and pastebin the rest.
def confusionMatrix(classifier, train_DS_X, train_DS_y, test_DS_X, test_DS_y):
# Will output a confusion matrix graph for the predicion
y_pred = classifier.fit(train_DS_X, train_DS_y).predict(test_DS_X)
labels = set(set(train_DS_y) | set(test_DS_y))
def plot_confusion_matrix(cm, title='Confusion matrix', cmap=plt.cm.Blues):
plt.imshow(cm, interpolation='nearest', cmap=cmap)
plt.title(title)
plt.colorbar()
tick_marks = np.arange(len(labels))
plt.xticks(tick_marks, labels, rotation=45)
plt.yticks(tick_marks, labels)
plt.tight_layout()
plt.ylabel('True label')
plt.xlabel('Predicted label')
# Compute confusion matrix
cm = confusion_matrix(test_DS_y , y_pred)
np.set_printoptions(precision=2)
print('Confusion matrix, without normalization')
#print(cm)
plt.figure()
plot_confusion_matrix(cm)
# Normalize the confusion matrix by row (i.e by the number of samples
# in each class)
cm_normalized = cm.astype('float') / cm.sum(axis=1)[:, np.newaxis]
print('Normalized confusion matrix')
#print(cm_normalized)
plt.figure()
plot_confusion_matrix(cm_normalized, title='Normalized confusion matrix')
plt.show()
Relevant Code from Main Function:
# Select training and test data
PCA = decomposition.PCA(n_components=.95)
zscorer = ZScoreMapper(param_est=('targets', ['rest']), auto_train=False)
DS = getVoxels (1, .5)
train_DS = DS[0]
test_DS = DS[1]
# Apply PCA and ZScoring
train_DS = processVoxels(train_DS, True, zscorer, PCA)
test_DS = processVoxels(test_DS, False, zscorer, PCA)
print 3*"\n"
# Select the desired features
# If selecting samples or PCA, that must be the only feature
featuresOfInterest = ['pca']
trainDSFeat = selectFeatures(train_DS, featuresOfInterest)
testDSFeat = selectFeatures(test_DS, featuresOfInterest)
train_DS_X = trainDSFeat[0]
train_DS_y = trainDSFeat[1]
test_DS_X = testDSFeat[0]
test_DS_y = testDSFeat[1]
# Optimization of neighbors
# Naively searches for local max starting at numNeighbors
lastScore = 0
lastNeightbors = 1
score = .0000001
numNeighbors = 5
while score > lastScore:
lastScore = score
lastNeighbors = numNeighbors
numNeighbors += 1
#Classification
neigh = neighbors.KNeighborsClassifier(n_neighbors=numNeighbors, weights='distance')
neigh.fit(train_DS_X, train_DS_y)
#Testing
score = neigh.score(test_DS_X,test_DS_y )
# Confusion Matrix Output
neigh = neighbors.KNeighborsClassifier(n_neighbors=lastNeighbors, weights='distance')
confusionMatrix(neigh, train_DS_X, train_DS_y, test_DS_X, test_DS_y)
Pastebin: http://pastebin.com/U7yTs3vs
The issue was in part the result of my axis being mislabeled, when I thought I was removing the faulty label I was in actuality just removing a random label, meaning the faulty data was still being analyzed. Fixing the axis and removing the faulty label which was actually rest yielded:
The code I changed is:
cm = confusion_matrix(test_DS_y , y_pred, labels)
Basically I manually set the ordering based on my list of ordered labels.

Resources