CountVectorizer() : AttributeError: 'numpy.float64' object has no attribute 'lower' - countvectorizer

I am trying to fit a dataset which has event_type and notes (free text) columns. before calling the MultinomialNB model , I processed the text and converted it to array to vectorize it and calculate the tfidf here below the code provided:
Convert Event types from string to integer for easy processing
ACLED['category_id'] = ACLED['event_type'].factorize()[0]
category_id_ACLED = ACLED[['event_type', 'category_id']].drop_d
uplicates().sort_values('category_id')
category_to_id = dict(category_id_ACLED.values)
id_to_category = dict(category_id_ACLED[['category_id', 'event_type']].values)
Text Representation
I also converted notes and category_id into features and labels as follows:
tfidf = TfidfVectorizer(sublinear_tf=True, min_df=5, norm='l2', encoding='latin-1', ngram_range=(1, 2), stop_words='english')
features = tfidf.fit_transform(ACLED.notes).toarray()
labels = ACLED.category_id
print(features.shape)
Then I split the dataset to training and testing sets using features and labels :
X_train, X_test, y_train, y_test = train_test_split(features ,labels, random_state=0)
print('Original dataset shape {}'.format(Counter(y_train)))
output
Original dataset shape Counter({1: 1280, 2: 819, 0: 676, 3: 593, 4: 138, 5: 53, 7: 50, 6: 21, 8: 10})
Since the classes are imbalanced , I used SMOTE to resolve the minority issue and creating synthethic copies
Apply the random over-sampling to overcome imbalanced classes
sm = SMOTE(random_state=42)
X_resampled, y_resampled = sm.fit_sample(X_train, y_train)
print('Resampled dataset shape {}'.format(Counter(y_resampled)))
output after oversampling
Resampled dataset shape Counter({3: 1280, 1: 1280, 2: 1280, 0: 1280, 7: 1280, 6: 1280, 4: 1280, 5: 1280, 8: 1280})
Everything until now is working fine , until I tried to calculate the terms frequency using CountVectorizer() as follows :
count_vect = CountVectorizer()
X_train_counts = count_vect.fit_transform(X_resampled)
tfidf_transformer = TfidfTransformer()
X_train_tfidf = tfidf_transformer.fit_transform(X_train_counts)
Output error :
'numpy.ndarray' object has no attribute 'lower'
I tried to use ravel() function to flatten the array but the error persists, any ideas, thanks in advance

I found the solution for this issue, instead of using features and labels I performed a subset on the dataset directly :
X_train, X_test, y_train, y_test = train_test_split(ACLED['notes'] ,ACLED['event_type'], random_state=0)
then I moved SMOTE after the counVectorizer since SMOTE has it is own pipeline :
Vactorize the notes column of the training set
count_vect = CountVectorizer()
X_train_counts = count_vect.fit_transform(X_train)
tfidf_transformer = TfidfTransformer()
X_train_tfidf = tfidf_transformer.fit_transform(X_train_counts)
Apply the random over-sampling to overcome imbalanced classes
sm = SMOTE(random_state=42)
X_resampled, y_resampled = sm.fit_sample(X_train_tfidf, y_train)
print('Resampled dataset shape {}'.format(Counter(y_resampled)))
output
Original dataset shape Counter({'Riots/Protests': 1280, 'Battle-No change of territory': 819, 'Remote violence': 676, 'Violence against civilians': 593, 'Strategic development': 138, 'Battle-Government regains territory': 53, 'Battle-Non-state actor overtakes territory': 50, 'Non-violent transfer of territory': 21, 'Headquarters or base established': 10})
Resampled dataset shape Counter({'Violence against civilians': 1280, 'Riots/Protests': 1280, 'Battle-No change of territory': 1280, 'Remote violence': 1280, 'Battle-Non-state actor overtakes territory': 1280, 'Non-violent transfer of territory': 1280, 'Strategic development': 1280, 'Battle-Government regains territory': 1280, 'Headquarters or base established': 1280})

Related

Errors with LSTM input shapes with time series data

I'm trying to predict torque from 8 features with an LSTM layer in my neural network. I'm having trouble with the input shape and have looked around on many sites for a solution. I'm quite new to machine learning and am having trouble understanding the problem and how I can fix this. Here is my code, dataset, and error message.
file = r'/content/drive/MyDrive/only_force_pt1.csv'
df = pd.read_csv(file)
X = df.iloc[:, 1:9]
y = df.iloc[:,9]
print(X)
print(y)
df.head()
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.1, shuffle = True)
X_train, X_val, y_train, y_val = train_test_split(X, y, test_size = 0.1, shuffle = True)
[verbose, epochs, batch_size] = [1, 200, 32]
input_shape = (X_train.shape[0],X_train.shape[1])
model = Sequential()
# LSTM
model.add(LSTM(64, input_shape=input_shape, return_sequences = True))
model.add(Dense(32, activation='relu', kernel_regularizer=keras.regularizers.l2(0.001)))
#model.add(Dropout(0.2))
#model.add(Dense(32, activation='relu', kernel_regularizer=keras.regularizers.l2(0.001)))
model.add(Dense(1,activation='relu'))
earlystopper = EarlyStopping(monitor='val_loss', min_delta=0, patience = 20, verbose =1, mode = 'auto')
model.summary()
model.compile(loss = 'mse', optimizer = Adam(learning_rate = 0.001), metrics=[tf.keras.metrics.RootMeanSquaredError()])
history = model.fit(X_train, y_train, batch_size = batch_size, epochs = epochs, verbose = verbose, validation_data=(X_val,y_val), callbacks = [earlystopper])
ValueError: Input 0 of layer "sequential_17" is incompatible with the layer: expected shape=(None, 3634, 8), found shape=(None, 8)
dataset: https://drive.google.com/drive/folders/1BQOXffFYioCiPug2VcBZEZVD-u3y9bcl?usp=sharing][1]
As I understand your problem, I think that you are passing the number of data points as an additional dimension on the input shape of the LSTM layer. Your data dimensionality is 8 and 3634(=X_train.shape[0]) is the number of data points, which should match the first dimension (with None) of the input tensors, and should not be passed as a dimension to the LSTM because it is determined by the batch size.
If that's the case, change the input_shape definition to:
input_shape = (X_train.shape[1],)
and it should work.

How to handle stateful=True in hyperparameter tuning of a LSTM

Hello I'm using RandomizedSearchCV for hyperparameter tuning of my LSTM. The code works fine with stateful=False. However I also want so try this with stateful on but I'm not sure how to.
I arranged my data in a sliding window with shape (211845 datapoints, 4 window size, 16 features).
Following function for creating the architecture of the model.
def create_lstm(dropout_rate=0.0, neurons=32, lr=1e-3):
lstm = Sequential()
lstm.add(InputLayer((4, 16)))
lstm.add(LSTM(neurons, return_sequences=True))
lstm.add(Dropout(dropout_rate))
lstm.add(LSTM(neurons)
lstm.add(Dropout(dropout_rate))
lstm.add(Dense(neurons/4, activation='relu'))
lstm.add(Dense(1))
lstm.compile(loss='mse',
optimizer=Adam(learning_rate=lr),
metrics=['mean_squared_error']
)
return lstm
I pass the function to the wrapper
lstm_estimator = KerasRegressor(build_fn=create_lstm, verbose=1)
The following code is for my param grid and RandomSearch
lstm_param_grid = {
'dropout_rate': [0, 0.2, 0.4],
'neurons': [32, 64, 128],
'batch_size': [100, 200, 400],
'epochs': [50, 100, 150],
'lr': [1e-3, 1e-4, 1e-5]
}
lstm_RandomGrid = RandomizedSearchCV(estimator = lstm_estimator,
param_distributions = lstm_param_grid,
n_iter = 10,
verbose = 10,
n_jobs = -1,
cv = 5
)
In my create_lstm function the input shape of my data is equal to the window size and the number of features. After the RandomSearch i pass the arguments epochs and batch size when I fit the model. However with stateful=True you have to define batch_input_shape=(x, y, z).
I'm really not sure what exactly to do now. How can I change my code so that the RandomSearch still tests multiple batch sizes? And what exactly are (x,y,z) in my example? I tried (batch_size=100, window size, num of features) but that didn't work out.

Mel Spectrogram feature extraction to CNN

This question is in line with the question posted here but with a slight nuance of the CNN. Using the feature extraction definition:
max_pad_len = 174
n_mels = 128
def extract_features(file_name):
try:
audio, sample_rate = librosa.core.load(file_name, res_type='kaiser_fast')
mely = librosa.feature.melspectrogram(y=audio, sr=sample_rate, n_mels=n_mels)
#pad_width = max_pad_len - mely.shape[1]
#mely = np.pad(mely, pad_width=((0, 0), (0, pad_width)), mode='constant')
except Exception as e:
print("Error encountered while parsing file: ", file_name)
return None
return mely
How do you go about getting the correct dimension of the num_rows, num_columns and num_channels to be input to the train and test data?
In constructing the CNN Model, how to determine the correct shape to input?
model = Sequential()
model.add(Conv2D(filters=16, kernel_size=2, input_shape=(num_rows, num_columns, num_channels), activation='relu'))
model.add(MaxPooling2D(pool_size=2))
model.add(Dropout(0.2))
I dont know if it is exactly your problem but I also have to use a MEL as an input to a CNN.
Short answer:
input_shape = (x_train.shape[1], x_train.shape[2], 1)
x_train = x_train.reshape(x_train.shape[0], x_train.shape[1], x_train.shape[2], 1)
or
x_train = x_train.reshape(x_train.shape[0], x_train.shape[1], x_train.shape[2], 1)
input_shape = x_train.shape[1:]
Long answer
In my case I have a DataFrame with speakers_id and mel spectrograms (previously calculated with librosa).
The Keras CNN models are prepared for images with width, height and channels of colors (grayscale - RGB)
The Mel Spectrograms given by librosa are image-like arrays with width and height, so you need to do a reshape to add the channel dimension.
Define the input and expected output
# It looks stupid but that way i could convert the panda.Series to a np.array
x = np.array(list(df.mel))
y = df.speaker_id
print('X shape:', x.shape)
X shape: (2204, 128, 24)
2204 Mels, 128x24
Split in train-test
x_train, x_test, y_train, y_test = train_test_split(x, y)
print(f'Train: {len(x_train)}', f'Test: {len(x_test)}')
Train: 1653 Test: 551
Reshape to add the extra dimension
x_train = x_train.reshape(x_train.shape[0], x_train.shape[1], x_train.shape[2], 1)
x_test = x_test.reshape(x_test.shape[0], x_test.shape[1], x_test.shape[2], 1)
print('Shapes:', x_train.shape, x_test.shape)
Shapes: (1653, 128, 24, 1) (551, 128, 24, 1)
Set input_shape
# The input shape is independent of the amount of inputs
input_shape = x_train.shape[1:]
print('Input shape:', input_shape)
Input shape: (128, 24, 1)
Put it into the model
model = Sequential()
model.add(Conv2D(32, (3, 3), activation='relu', input_shape=input_shape))
model.add(MaxPooling2D())
# More layers...
model.compile(optimizer='adam',loss=tf.keras.losses.SparseCategoricalCrossentropy(from_logits=False),metrics=['accuracy'])
Run model
model.fit(x_train, y_train, epochs=20, validation_data=(x_test, y_test))
Hope this is helpfull

Error in Input Dimensions - Keras

I'm training a neural net with keras with input data that has a shape of (116, 2, 3, 58) and output data that has a shape of (116, 2). I'm getting this error:
ValueError: Error when checking target: expected dense_3 to have 4 dimensions, but got array with shape (116, 2)
What could I be doing wrong? Here is my code:
trainingInput = np.load("trainingInput.npy")
trainingOutput = np.load("trainingOutput.npy")
inp = Input(batch_shape=(116, 2, 3, 58))
d1 = Dense(16, activation='relu')(inp)
d2 = Dense(32, activation='relu')(d1)
out = Dense(2, activation='softmax')(d2)
model = Model(inputs=inp, outputs=out)
lrSet = SGD(lr=0.01)
model.compile(loss='categorical_crossentropy', optimizer=lrSet, metrics=['accuracy'])
model.fit(trainingInput, trainingOutput, batch_size=16, epochs=50, verbose=1, validation_split=0.1)
Dense is a full-connected layer. It will not change the shape of Input. If you want use Dense, you should resize (116,2,3,68) -> (116,2*3*68)

TFLearn model evaluation

I am new to the machine learning and TensorFlow. I am trying to train a simple model to recognize gender. I use small data-set of height, weight, and shoe size. However, I have encountered a problem with evaluating model's accuracy.
Here's the entire code:
import tflearn
import tensorflow as tf
import numpy as np
# [height, weight, shoe_size]
X = [[181, 80, 44], [177, 70, 43], [160, 60, 38], [154, 54, 37], [166, 65, 40],
[190, 90, 47], [175, 64, 39], [177, 70, 40], [159, 55, 37], [171, 75, 42],
[181, 85, 43], [170, 52, 39]]
# 0 - for female, 1 - for male
Y = [1, 1, 0, 0, 1, 1, 0, 0, 0, 1, 1, 0]
data = np.column_stack((X, Y))
np.random.shuffle(data)
# Split into train and test set
X_train, Y_train = data[:8, :3], data[:8, 3:]
X_test, Y_test = data[8:, :3], data[8:, 3:]
# Build neural network
net = tflearn.input_data(shape=[None, 3])
net = tflearn.fully_connected(net, 32)
net = tflearn.fully_connected(net, 32)
net = tflearn.fully_connected(net, 1, activation='linear')
net = tflearn.regression(net, loss='mean_square')
# fix for tflearn with TensorFlow 12:
col = tf.get_collection(tf.GraphKeys.TRAINABLE_VARIABLES)
for x in col:
tf.add_to_collection(tf.GraphKeys.VARIABLES, x)
# Define model
model = tflearn.DNN(net)
# Start training (apply gradient descent algorithm)
model.fit(X_train, Y_train, n_epoch=100, show_metric=True)
score = model.evaluate(X_test, Y_test)
print('Training test score', score)
test_male = [176, 78, 42]
test_female = [170, 52, 38]
print('Test male: ', model.predict([test_male])[0])
print('Test female:', model.predict([test_female])[0])
Even though model's prediction is not very accurate
Test male: [0.7158362865447998]
Test female: [0.4076206684112549]
The model.evaluate(X_test, Y_test) always returns 1.0. How do I calculate real accuracy on the test data-set using TFLearn?
You want to do binary classification in this case. Your network is set to perform linear regression.
First, transform the labels (gender) to categorical features:
from tflearn.data_utils import to_categorical
Y_train = to_categorical(Y_train, nb_classes=2)
Y_test = to_categorical(Y_test, nb_classes=2)
The output layer of your network needs two output units for the two classes you want to predict. Also the activation needs to be softmax for classification. The tf.learn default loss is cross-entropy and the default metric is accuracy, so this is already correct.
# Build neural network
net = tflearn.input_data(shape=[None, 3])
net = tflearn.fully_connected(net, 32)
net = tflearn.fully_connected(net, 32)
net = tflearn.fully_connected(net, 2, activation='softmax')
net = tflearn.regression(net)
The output will now be a vector with the probability for each gender. For example:
[0.991, 0.009] #female
Bear in mind that you will hopelessly overfit the network with your tiny data set. This means that during training the accuracy will approach 1 while, the accuracy on your test set will be quite poor.

Resources