how create char dataset like minst_digits dataset - machine-learning

I have 62000 font images( 0-9,A-Z and a-z images) data set in which for single character have 1000 image.I have created csv file of 62000 row of images normalized pixel value and labels. I don't know to extract this csv file in training,validation and testing dataset so that i can get better accuracy.
enter image description here

You can use SciKit-Learn's train_test_split.
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
X, y = your.data, your.target #input your own data here
train, test = train_test_split(X, test_size = 0.2, random_state=0)
Also, read this sklearn tutorial

Related

Clustering on 97 features of categorical data

I am trying to apply unsupervised learning on a data with 97 features and around 6500 rows/samples. All features have discrete data (mostly from 1-10) with some being binary (0/1). What are some of the best clustering algorithms to apply on this data. Thank You!
It's impossible to say which clustering algo will perform best on your given dataset. You just have to try several methodologies and inspect the final results that you get. Here are several clustering algos that you can try.
https://github.com/ASH-WICUS/Notebooks/blob/master/Clustering%20Algorithms%20Compared.ipynb
Here is a small sample.
import statsmodels.api as sm
import numpy as np
import pandas as pd
mtcars = sm.datasets.get_rdataset("mtcars", "datasets", cache=True).data
df_cars = pd.DataFrame(mtcars)
df_cars.head()
from numpy import unique
from numpy import where
from sklearn.datasets import make_classification
from sklearn.cluster import KMeans
from matplotlib import pyplot
# define dataset
X = df_cars[['mpg','hp']]
# define the model
model = KMeans(n_clusters=8)
# fit the model
model.fit(X)
# assign a cluster to each example
yhat = model.predict(X)
X['kmeans']=yhat
pyplot.scatter(X['mpg'], X['hp'], c=X['kmeans'], cmap='rainbow', s=50, alpha=0.8)
# plot X & Y coordinates and color by cluster number
import plotly.express as px
df = px.data.iris()
fig = px.scatter(df_cars, x="hp", y="mpg", color="kmeans", size='mpg', hover_data=['kmeans'])
fig.show()

How to know if my data has been scaled by StandardScaler?

"I have scaled my dataset by using Standard Scaler , Now how to know it has been scaled, I am sure it has been scaled but how to see it"
As #Coderji said you can always find out the mean and standard deviation, which should be equal to 0 and 1 respectively.
However, there is another method to visualize it.
from sklearn import datasets
import numpy as np
from sklearn.preprocessing import StandardScaler
I am using iris dataset for this example.
iris = datasets.load_iris()
X = iris.data
sc = StandardScaler()
sc.fit(X)
x = sc.transform(X)
import matplotlib.pyplot as plt
import seaborn as sns
sns.distplot(x[:,1])
See this Output for sepel length
Similarly you can see for all variables or a simple pairplot will do the job.
This gives an idea that the data is standardised visually.

How to prevent simple keras autoencoder from over compressing data?

I am trying to use the keras frontend with tensorflow backend for a simple autoencoder as a multidimensional scaling technique to plot multidimensional data into 2 dimensions. Many times when I run it (not sure how to set random seed for keras btw) one of the dimensions is collapsed to yield a 1 dimensional embedding (the plot should help explain). Why is this happening? How can I make sure the dimensions are preserved and utilized by the autoencoder? I realize this is the most simple and basic form of an autoencoder that I have implemented but I would like to build on this to make better autoencoders for this task.
from sklearn.datasets import load_iris
from sklearn import model_selection
import tensorflow as tf
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
# Load data
X = load_iris().data
Y = pd.get_dummies(load_iris().target).as_matrix()
X_tr, X_te, Y_tr, Y_te = model_selection.train_test_split(X,Y, test_size=0.3, stratify=Y.argmax(axis=1))
dims = X_tr.shape[1]
n_classes = Y_tr.shape[1]
# Autoencoder
encoding_dim = 2
# this is our input placeholder
input_data = tf.keras.Input(shape=(4,))
# "encoded" is the encoded representation of the input
encoded = tf.keras.layers.Dense(encoding_dim,
activation='relu',
)(input_data)
# "decoded" is the lossy reconstruction of the input
decoded = tf.keras.layers.Dense(4, activation='sigmoid')(encoded)
# this model maps an input to its reconstruction
autoencoder = tf.keras.models.Model(input_data, decoded)
# this model maps an input to its encoded representation
encoder = tf.keras.models.Model(input_data, encoded)
autoencoder.compile(optimizer='adam', loss='binary_crossentropy')
network_training = autoencoder.fit(X_tr, X_tr,
epochs=100,
batch_size=5,
shuffle=True,
verbose=False,
validation_data=(X_te, X_te))
# Plot data
embeddings = encoder.predict(X_te)
plt.scatter(embeddings[:,0], embeddings[:,1], c=Y_te.argmax(axis=1), edgecolor="black", linewidth=1)
Run algorithm once
Run algorithm again

how to use string data for svm (smo) in weka

I have an arff file containing some sentences (Persian language) and a word in front of each sentence which shows its class in #data part. I need to use smo for classification. The questions:
1) Is it necessary to change the sentences to vectors ?
2) I selected "string to word vector", but the smo is inactive and still doesn't work. (and of course other algorithms like naive bayes).
How can I use this text data with smo ?
The above picture is a very small sample file.
file sample:
https://www.dropbox.com/s/ohpyortve8jbwhe/shoor.arff?dl=0
First, you need apply "string to word vector" filter. After, on classify tab, you need to change the target class to "(Nom) class". This is enought to enable the naive bayes and SVM algorithms. I downloaded the dataset, and it worked well.
You can follow this tutorial:
https://www.youtube.com/watch?v=zlVJ2_N_Olo
Hope it can help you
from sklearn.feature_extraction.text import TfidfVectorizer
import arff
from sklearn import svm
import numpy as np
from sklearn.model_selection import train_test_split
data=list(arff.load('shoor.arff'))
text=[]
label=[]
for r in data:
if (len(r)>1):
text.append(r[0])
label.append(r[1])
tfidf = TfidfVectorizer().fit_transform(text)
features = (tfidf * tfidf.T).A
X_train, X_test, y_train, y_test = train_test_split(features, label, test_size=0.5, random_state=0)
clf = svm.SVC(kernel='linear', C=1).fit(X_train, y_train)
clf.score(X_test, y_test)
1.0

Bagging using random forest classifier in sklearn

I built a random forest and I want to find the out of bag score.But my out of bag score is coming out to be 1.0,but it should be less than 1.My sample size consists of 20000 elements.Here is the python code.Please tell the changes to be done.Here X is a numpy array of datasets and Z contains true labels.
import csv
import numpy as np
from sklearn import preprocessing
from sklearn import cross_validation
from sklearn.ensemble import RandomForestClassifier
with open('C:\Users\Harsh Bhandari\Desktop\letter.csv') as f:
reader = csv.reader(f, delimiter='\t')
data = [(col1, int(col2), int(col3), int(col4),int(col5),int(col6),int(col7),int(col8),int(col9),int(col10),int(col11),int(col12),int(col13),int(col14),int(col15),int(col16),int(col17))
for col1,col2,col3,col4,col5,col6,col7,col8,col9,col10,col11,col12,col13,col14,col15,col16,col17 in reader]
X=[]
Y=[]
i=0
while i<20000:
t=data[i][1:]
X.append(t)
t=data[i][0]
Y.append(t)
i=1+i
X=np.asarray(X)
Y=np.asarray(Y)
le = preprocessing.LabelEncoder()
Z=le.fit_transform(Y)
clf = RandomForestClassifier(n_estimators=100,oob_score=True)
clf=clf.fit(X,Z)
a=clf.predict(X)
scores=clf.score(X,a)
print scores
http://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html
In score you send the Test Data and its actual labels, here you are passing the predicted labels itself which match the prediction hence you are
getting 1.0 score.
i see a couple things here.
you are doing clf.score(X, a)
but you should be doing clf.score(X, Z)
where Z is the true label for X
the score parameter is defined as such clf.score(X, true_labels_for_X)
you instead put the values that you predicted as y_true which dosen't make sense. since Sklearn will already run predict on X, you don't need to pass a.
Also, you can find the oobscore of by doing
print clf.oob_score_

Resources