one classifier for multi-labels with autokeras, data table - machine-learning

i have a data table with 5 labels. i want to use autokeras to Build one classifier that predict all the labels by same X.
i tried:
clf0 = ak.StructuredDataClassifier(multi_label=True, max_trials=15)
clf0.fit(X_train,[labels_train[0],labels_train[1],labels_train[2],labels_train[3],labels_train[4]],epochs=10)
model0 = clf0.export_model()
a = model0.predict(X_test)
but get:
`
ValueError: Expected y to have 1 arrays, but got
`
how can i fix it?
thanks.

Related

Creating a new dataset of hidden state probabilities using a HMM results in different shapes after each run

I'm trying to create a new dataset of hidden state probabilities using a hidden Markov model. Everything works fine unless each time the output dataset comes up with different values (sometimes the same values) for hidden_states_train and hidden_states_test hence resulting a different column sizes in the columns stack/ a feature mismatch. e.g New dataset size (15261, 197) (5087, 194), New dataset size (15261, 197) (5087, 197) etc.
I can't figure out why this is happening each time I run the code. I tried to give same number of samples for both X_train_st and X_test_st but this keeps happening. If I set n_comp in range a smaller range e.g for n_comp in range(1,6) then often it results the same shapes.
Can someone shed some light to what's going on and a possible fix, please?
newX = X_train_st
newXtest = X_test_st
for n_comp in range(1,16):
print("fitting to HMM and decoding %d ..." % n_comp , end="")
modelHMM = GaussianHMM(n_components=n_comp, covariance_type="diag").fit(X_train_st)
hidden_states_train = to_categorical(modelHMM.predict(X_train_st))
hidden_states_test = to_categorical(modelHMM.predict(X_test_st))
print("done")
newX = np.column_stack((newX,hidden_states_train))
newXtest = np.column_stack((newXtest,hidden_states_test))
print('New dataset size',newX.shape,newXtest.shape)

Scikit-Learn issues error for RandomForestClassifier for multilabel classification - Jagged arrays

Scikit-Learn RandomForestClassifier throws an error for a multilabel classification problem.
This code creates a RandomForestClassifier multilabel object, given predictors C and multi-labels out with no error.
C = np.array([[2,4,6],[4,2,1],[8,3,1]])
out = np.array([[0,1],[0,1],[1,0]])
rf = RandomForestClassifier(n_estimators=100, oob_score=True)
rf.fit(C,out)
If I modify the multilabels, so that all the elements at a certain index are the same, say (where all the first components of the multilabels equals zero)
out = np.array([[0,1],[0,1],[0,0]])
I get an error and traceback:
VisibleDeprecationWarning: Creating an ndarray from ragged nested sequences (which is a
list-or-tuple of lists-or-tuples-or ndarrays with different lengths or shapes) is deprecated.
If you meant to do this, you must specify 'dtype=object' when creating the ndarray.
y_pred = np.array(y_pred, copy=False)
raise ValueError(
507 "The type of target cannot be used to compute OOB "
508 f"estimates. Got {y_type} while only the following are "
509 "supported: continuous, continuous-multioutput, binary, "
510 "multiclass, multilabel-indicator."
511 )
ValueError: could not broadcast input array from shape (2,1) into shape (2,)
Not requesting OOB predictions does not result in an error:
rf_err = RandomForestClassifier(n_estimators=100, oob_score=False)
I cannot figure out why keeping the OOB predictions would trigger such an error, when all the n-component of a multilabel are equal.
In your setup out_err = np.array([[0,1],[0,1],[0,0]]) you do not have any examples of the second class, so you only have elements of 1 class.
That means that there is no 'class label' dimension and it can be omitted. That's why you see (2,) shape.
Please, describe your initial intent: why would you need to set a particular position in labels to 0. If you try to go with N-1 classes instead of N classes I suggest removing the position itself and the elements of the class from the dataset, not putting all zeros:
out=[[1,0,0],[0,1,0],[0,1,0],[0,0,1],[1,0,0]] # 3 classes
# remove the second class:
out=[[1,0],[0,1],[1,0]] # 2 classes

How to add cluster label columns back into original dataframe- python, for supervised learning

I have a column in my data frame which contains Url information. It has 1200+ unique values. I wanted to use text mining to generate features from these values. I have used tfidfvectorizer to generate vectors and then used kmeans to identify clusters. I now want to assign these cluster labels back into my original dataframe, so that I can bin the URL information into these clusters.
Below code to generate vectors and cluster labels
from scipy.spatial.distance import cdist
vectorizer = TfidfVectorizer(min_df = 1,lowercase = False, ngram_range = (1,1), use_idf = True, stop_words='english')
X = vectorizer.fit_transform(sample\['lead_lead_source_modified'\])
X = X.toarray()
distortions=\[\]
K = range(1,10)
for k in K:
kmeanModel = KMeans(n_clusters=k).fit(X)
kmeanModel.fit(X)
distortions.append(sum(np.min(cdist(X, kmeanModel.cluster_centers_, 'euclidean'), axis=1)) / X.shape\[0\])
#append cluster labels
km = KMeans(n_clusters=4, random_state=0)
km.fit_transform(X)
cluster_labels = km.labels_
cluster_labels = pd.DataFrame(cluster_labels, columns=\['ClusterLabel_lead_lead_source'\])
cluster_labels
Through the elbow method, I decided on 4 clusters. I now have cluster labels, but I am not sure how to add them bank to dataframe on its respective index. Concatenating along axis=1 is creating Nans due to indexing issues. Below is the sample output after concatenation.
lead_lead_source_modified ClusterLabel_lead_lead_source
0 NaN 3.0
1 NaN 0.0
2 NaN 0.0
3 ['direct', 'salesline', 'website', ''] 0.0
I want to know if this approach is the right way to do, if so then how to solve this issue. If not, is there a better way to do.
Adding index value during dataframe conversion solved the issue.
But it still want to know if this is the right approach

if (freq) x$counts else x$density length > 1 and only the first element will be used

for my thesis I have to calculate the number of workers at risk of substitution by machines. I have calculated the probability of substitution (X) and the number of employee at risk (Y) for each occupation category. I have a dataset like this:
X Y
1 0.1300 0
2 0.1000 0
3 0.0841 1513
4 0.0221 287
5 0.1175 3641
....
700 0.9875 4000
I tried to plot a histogram with this command:
hist(dataset1$X,dataset1$Y,xlim=c(0,1),ylim=c(0,30000),breaks=100,main="Distribution",xlab="Probability",ylab="Number of employee")
But I get this error:
In if (freq) x$counts else x$density
length > 1 and only the first element will be used
Can someone tell me what is the problem and write me the right command?
Thank you!
It is worth pointing out that the message displayed is a Warning message, and should not prevent the results being plotted. However, it does indicate there are some issues with the data.
Without the full dataset, it is not 100% obvious what may be the problem. I believe it is caused by the data not being in the correct format, with two potential issues. Firstly, some values have a value of 0, and these won't be plotted on the histogram. Secondly, the observations appear to be inconsistently spaced.
Histograms are best built from one of two datasets:
A dataframe which has been aggregated grouped into consistently sized bins.
A list of values X which in the data
I prefer the second technique. As originally shown here The expandRows() function in the package splitstackshape can be used to repeat the number of rows in the dataframe by the number of observations:
set.seed(123)
dataset1 <- data.frame(X = runif(900, 0, 1), Y = runif(900, 0, 1000))
library(splitstackshape)
dataset2 <- expandRows(dataset1, "Y")
hist(dataset2$X, xlim=c(0,1))
dataset1$bins <- cut(dataset1$X, breaks = seq(0,1,0.01), labels = FALSE)

Using Keras ImageDataGenerator in a regression model

I want to use the flow_from_directory method of the ImageDataGenerator
to generate training data for a regression model, where the target value can be any float value between 1 and -1. flow_from_directory has a "class_mode" parameter with the description
class_mode: one of "categorical", "binary", "sparse" or None. Default:
"categorical". Determines the type of label arrays that are returned:
"categorical" will be 2D one-hot encoded labels, "binary" will be 1D
binary labels, "sparse" will be 1D integer labels.
Which of these values should I take? None of them seems to really fit...
With Keras 2.2.4 you can use flow_from_dataframe which solves what you want to do, allowing you to flow images from a directory for regression problems. You should store all your images in a folder and load a dataframe containing in one column the image IDs and in the other column the regression score (labels) and set class_mode='other' in flow_from_dataframe.
Here you can find an example where the images are in image_dir, the dataframe with the image IDs and the regression scores is loaded with pandas from the "train file"
train_label_df = pd.read_csv(train_file, delimiter=' ', header=None, names=['id', 'score'])
train_datagen = ImageDataGenerator(rescale = 1./255, horizontal_flip = True,
fill_mode = "nearest", zoom_range = 0.2,
width_shift_range = 0.2, height_shift_range=0.2,
rotation_range=30)
train_generator = train_datagen.flow_from_dataframe(dataframe=train_label_df, directory=image_dir,
x_col="id", y_col="score", has_ext=True,
class_mode="other", target_size=(img_width, img_height),
batch_size=bs)
I think that organizing your data differently, using a DataFrame (without necessarily moving your images to new locations) will allow you to run a regression model. In short, create columns in your DataFrame containing the file path of each image and the target value. This allows your generator to keep regression values and images properly synced even when you shuffle your data at each epoch.
Here is an example showing how to link images with binomial targets, multinomial targets and regression targets just to show that "a target is a target is a target" and only the model might change:
df['path'] = df.object_id.apply(file_path_from_db_id)
df
object_id bi multi path target
index
0 461756 dog white /path/to/imgs/756/61/blah_461756.png 0.166831
1 1161756 cat black /path/to/imgs/756/61/blah_1161756.png 0.058793
2 3303651 dog white /path/to/imgs/651/03/blah_3303651.png 0.582970
3 3367756 dog grey /path/to/imgs/756/67/blah_3367756.png -0.421429
4 3767756 dog grey /path/to/imgs/756/67/blah_3767756.png -0.706608
5 5467756 cat black /path/to/imgs/756/67/blah_5467756.png -0.415115
6 5561756 dog white /path/to/imgs/756/61/blah_5561756.png -0.631041
7 31255756 cat grey /path/to/imgs/756/55/blah_31255756.png -0.148226
8 35903651 cat black /path/to/imgs/651/03/blah_35903651.png -0.785671
9 44603651 dog black /path/to/imgs/651/03/blah_44603651.png -0.538359
10 49557622 cat black /path/to/imgs/622/57/blah_49557622.png -0.295279
11 58164756 dog grey /path/to/imgs/756/64/blah_58164756.png 0.407096
12 95403651 cat white /path/to/imgs/651/03/blah_95403651.png 0.790274
13 95555756 dog grey /path/to/imgs/756/55/blah_95555756.png 0.060669
I describe how to do this in great detail with examples here:
https://techblog.appnexus.com/a-keras-multithreaded-dataframe-generator-for-millions-of-image-files-84d3027f6f43
At this moment (newest version of Keras from January 21st 2017) the flow_from_directory could only work in a following manner:
You need to have a directories structured in a following manner:
directory with images\
1st label\
1st picture from 1st label
2nd picture from 1st label
3rd picture from 1st label
...
2nd label\
1st picture from 2nd label
2nd picture from 2nd label
3rd picture from 2nd label
...
...
flow_from_directory returns batches of a fixed size in a format of (picture, label).
So as you can see it could only be used for a classification case and all options provided in a documentation specify only a way in which the class is provided to your classifier. But, there is a neat hack which could make a flow_from_directory useful for a regression task:
You need to structure your directory in a following manner:
directory with images\
1st value (e.g. -0.95423)\
1st picture from 1st value
2nd picture from 1st value
3rd picture from 1st value
...
2nd value (e.g. - 0.9143242)\
1st picture from 2nd value
2nd picture from 2nd value
3rd picture from 2nd value
...
...
You also need to have a list list_of_values = [1st value, 2nd value, ...]. Then your generator is defined in a following manner:
def regression_flow_from_directory(flow_from_directory_gen, list_of_values):
for x, y in flow_from_directory_gen:
yield x, list_of_values[y]
And it's crucial for a flow_from_directory_gen to have a class_mode='sparse' to make this work. Of course this is a little bit cumbersome but it works (I used this solution :) )
There's just one glitch in the accepted answer that I would like to point out. The above code fails with an error message like:
TypeError: only integer scalar arrays can be converted to a scalar index
This is because y is an array. The fix is simple:
def regression_flow_from_directory(flow_from_directory_gen,
list_of_values):
for x, y in flow_from_directory_gen:
values = [list_of_values[y[i]] for i in range(len(y))]
yield x, values
The method to generate the list_of_values can be found in https://stackoverflow.com/a/47944082/4082092

Resources