How to pre processed the nominal data attributes in data mining? [closed]

How to pre processed the nominal data attributes in data mining? [closed] - machine-learning

Closed. This question needs details or clarity. It is not currently accepting answers.
Want to improve this question? Add details and clarify the problem by editing this post.
Closed 5 years ago.
Improve this question
Dataset imageI have done the one hot encoding on nominal data attributes but later i want to do clustering on the data so please suggest feasible solution. i am new to data mining

considering that you did not provide a proper information of your problem and it seems you are new to these concepts I provide an entire solution on a fake data. you can learn from it and get the points to work on your solution.
I have implemented in python and assume that you are familiar with skit-learn :
from numpy import array
from sklearn.preprocessing import LabelEncoder
from sklearn.preprocessing import OneHotEncoder
from sklearn import cluster
# define example
data = ['cold', 'cold', 'warm', 'cold', 'hot', 'hot', 'warm', 'cold', 'warm', 'hot']
values = array(data)
print(values)
# integer encode
label_encoder = LabelEncoder()
integer_encoded = label_encoder.fit_transform(values)
print(integer_encoded)
# binary encode
onehot_encoder = OneHotEncoder(sparse=False)
integer_encoded = integer_encoded.reshape(len(integer_encoded), 1)
onehot_encoded = onehot_encoder.fit_transform(integer_encoded)
print(onehot_encoded)
# Clustering step:
kmeans = cluster.KMeans(n_clusters=3)
kmeans.fit(onehot_encoded)
print(kmeans.labels_)
and here is the result :
['cold' 'cold' 'warm' 'cold' 'hot' 'hot' 'warm' 'cold' 'warm' 'hot']
[[1. 0. 0.]
[1. 0. 0.]
[0. 0. 1.]
[1. 0. 0.]
[0. 1. 0.]
[0. 1. 0.]
[0. 0. 1.]
[1. 0. 0.]
[0. 0. 1.]
[0. 1. 0.]]
cluster label for the above data is :
[1 1 2 1 0 0 2 1 2 0]

Related

Feature Hashing of zip codes with Scikit in machine learning

I am working on a machine learning problem, where I have a lot of zipcodes (~8k unique values) in my data set. Thus I decided to hash the values into a smaller feature space instead of using something like OHE.
The problem I encountered was a very small percentage (20%) of unique rows in my hash, which basically means from my understanding, that I have a lot of duplicates/collisions. Even though I increased the features in my hash table to ~200, I never got more than 20% of unique values. This does not make sense to me, since with a growing number of columns in my hash, more unique combinations should be possible
I used the following code to hash my zip codes with scikit and calculate the collisions based on unique vales in the last array:
from sklearn.feature_extraction import FeatureHasher
D = pd.unique(Daten["PLZ"])
print("Zipcode Data:", D,"\nZipcode Shape:", D.shape)
h = FeatureHasher(n_features=2**5, input_type="string")
f = h.transform(D)
f = f.toarray()
print("Feature Array:\n",f ,"\nFeature Shape:", f.shape)
unq = np.unique(f, axis=0)
print("Unique values:\n",unq,"\nUnique Shape:",unq.shape)
print("Percentage of unique values in hash array:",unq.shape[0]/f.shape[0]*100)
For Output I received:
Zipcode Data: ['86916' '01445' '37671' ... '82387' '83565' '83550']
Zipcode Shape: (8158,)
Feature Array:
[[ 2. 1. 0. ... 0. 0. 0.]
[ 0. -1. 0. ... 0. 0. 0.]
[ 1. 0. 0. ... 0. 0. 0.]
...
[ 0. 0. 0. ... 0. 0. 0.]
[ 1. 0. 0. ... 0. 0. 0.]
[ 0. -1. 0. ... 0. 0. 0.]]
Feature Shape: (8158, 32)
Unique values:
[[ 0. -3. 0. ... 0. 0. 0.]
[ 0. -2. 0. ... 0. 0. 0.]
[ 0. -2. 0. ... 0. 0. 0.]
...
[ 4. 0. 0. ... 0. 0. 0.]
[ 4. 0. 0. ... 0. 0. 0.]
[ 4. 0. 0. ... 0. 0. 0.]]
Unique Shape: (1707, 32)
Percentage of unique values in hash array: 20.9242461387595
Any help and insights are greatly appreciated.

That very first 2 in the transformed data should be a clue. I think you'll also find that many of the columns are all-zero.
From the documentation,
Each sample must be iterable...
So the hasher is treating the zip code '86916' as the collection of elements 8, 6, 9, 1, 6, and you only get ten nonzero columns (the first column presumably being the 6, which appears twice, as noted at the beginning). You should be able to rectify this by reshaping the input to be 2-dimensional.

How to turn a frequency domain graph to time domain graph

I wanted to convert the graph in the top red box for the frequency domain to the graph in the bottom red box for the time domain. What should I do?

Assuming you are using python, you can use the scipy library irfft function as explained here to get back the time domain signal.
Example:
from scipy.fftpack import fft,irfft,rfft
data = [0, 1, 2, 3, 4, 5]
fft = rfft(data)
print("FFT : " , fft)
original_data = irfft(fft)
print("Original : " , original_data)
Output :
FFT : [15. -3. 5.2 -3. 1.73 -3. ]
Original : [0. 1. 2. 3. 4. 5.]

How to handle multiple values of features in a particular record in machine learning?

I have an use case for which I'm trying to solve it using Machine Learning. Let's say I have an input data in the form of (X1, X2, X3, X4, X5, X6) and output value Y. Consider the following scenario where you have multiple values of (X5 and X6 and each set of them are correlated) for same fixed set of (X1,X2,X3,X4) and you have Y value changing for each set of (X5 and X6), how do you formulate the data in training for a machine learning model?
I could only think of following ways to address the problem:
i. Have each row for each set of (X5 and X6) value and introduce a rank categorical column to say these input data are correlated i.e :
X1 X2 X3 X4  X5  X6  Rank Y
1.5 2 3.4 5.4  6.7  7.8   1    2.3
1.5 2 3.4 5.4  4.32 6.3  1    7.4
1.5 2 3.4 5.4  2.1  2.3  1    3.24
2.1 1 12  34  2  3.23   2    1.24
1.5 2 3.4 5.4 6.7   7.8   3    24.4
so on......
ii. Explode X5 and X6 features into multiple columns for each value of them, but the problem here is we have to limit the number of columns and the correlation is missing b/w X5 and X6.
Below links are the code file and the input file attached for the existing realtime usecase with actual feature names and output variable.
https://drive.google.com/open?id=178XEzd_5iPXGMBJUrqI5kvlwnDspwMhP
https://drive.google.com/file/d/18SA42kDlQto-PnR5fUpAcvXKlimGidOj/view

I can't see your code. But, I got your data.
I think PRIORITY and SCHEDQT are y data.
And, I uses LEADTIME, BOMINV, BOMINFSW, SKUINV, SKUINFSW and COQTY as x data. You must know the data is normalized simply in between 0~1.
It's better than before.
But, It's not predicted well. Please refer the result as below:
[`LEADTIME`, `BOMINV`, `BOMINFSW`, `SKUINV`, `SKUINFSW`, `COQTY`] [Pre. PRIORITY] [Real PRIORITY]
[0.03333333 0.33333333 0. 0. 0. 0.05666667] [8.221004] [18.]
[0.03333333 0.33333333 0. 0. 0. 0.26666667] [8.221004] [19.]
[0.03333333 0.33333333 0. 0. 0. 0.16666667] [8.221004] [20.]
[1. 0. 1. 0. 0. 0.16666667] [8.221004] [1.]
[1. 0. 1. 0. 0. 0.26666667] [8.221004] [2.]
I think Each field values can not make enough difference of PRIORITY result.
From the example as above, LEADTIME, BOMINV, BOMINFSW, SKUINV and SKUINFSW are same.
Then, I tried to remove some records if LEADTIME or BOMINV or SKUINV is 0.
[0.2 0.33333333 0. 0.66666667 0. 0.33333333] [20.035915] [36.]
[0.2 0.33333333 0. 0.66666667 0. 0.46666667] [20.035915] [38.]
[0.2 0.33333333 0. 0.66666667 0. 0.6 ] [20.0352] [40.]
[0.2 0.33333333 0. 0.33333333 0. 0.16666667] [11.69006] [1.]
[0.2 0.33333333 0. 0.33333333 0. 0.26666667] [11.5476885] [2.]
But, You can see The result also very par from real because x data does not show enough difference.
Now, I can just say that you need more features of data to get enough learning.

Neural Network for Regression with tflearn

My question is about coding a neural network which does regression (and NOT classification) using tflearn.
Dataset:
fixed acidity volatile acidity citric acid ... alcohol quality
7.4 0.700 0.00 ... 9.4 5
7.8 0.880 0.00 ... 9.8 5
7.8 0.760 0.04 ... 9.8 5
11.2 0.280 0.56 ... 9.8 6
7.4 0.700 0.00 ... 9.4 5
I want to build a neural network which takes in 11 features (chemical values in wine) and outputs or predicts a score i.e., quality(out of 10). I DON'T want to classify the wine like quality_1, quality_2,... I want the model to perform a regression function for my features and predict a value out of 10(could be even a float).
The quality column in my data only has values = [3, 4, 5, 6, 7, 8, 9].
It does not contain 1, 2, and 10.
Due to the lack in experience, I could only code a neural network that CLASSIFIES the wine into classes like [score_3, score_4,...] and I used one hot encoding to do so.
Processed Data:
Features:
[[ 7.5999999 0.23 0.25999999 ..., 3.02999997 0.44
9.19999981]
[ 6.9000001 0.23 0.34999999 ..., 2.79999995 0.54000002
11. ]
[ 6.69999981 0.17 0.37 ..., 3.25999999 0.60000002
10.80000019]
...,
[ 6.30000019 0.28 0.47 ..., 3.11999989 0.50999999
9.5 ]
[ 5.19999981 0.64499998 0. ..., 3.77999997 0.61000001
12.5 ]
[ 8. 0.23999999 0.47999999 ..., 3.23000002 0.69999999
10. ]]
Labels:
[[ 0. 1. 0. ..., 0. 0. 0.]
[ 0. 0. 0. ..., 0. 0. 0.]
[ 0. 0. 0. ..., 1. 0. 0.]
...,
[ 0. 0. 0. ..., 0. 0. 0.]
[ 0. 0. 0. ..., 0. 0. 0.]
[ 0. 0. 1. ..., 0. 0. 0.]]
Code for a neural network which CLASSIFIES into different classes:
import pandas as pd
import numpy as np
import tflearn
from tflearn.layers.core import input_data, fully_connected
from tflearn.layers.estimator import regression
from sklearn.model_selection import train_test_split
def preprocess():
data_source_red = 'F:\Gautam\...\Datasets\winequality-red.csv'
data_red = pd.read_csv(data_source_red, index_col=False, sep=';')
data = pd.get_dummies(data, columns=['quality'], prefix=['score'])
x = data[data.columns[0:11]].values
y = data[data.columns[11:18]].values
x = np.float32(x)
y = np.float32(y)
return (x, y)
x, y = preprocess()
train_x, test_x, train_y, test_y = train_test_split(x, y, test_size = 0.2)
network = input_data(shape=[None, 11], name='Input_layer')
network = fully_connected(network, 10, activation='relu', name='Hidden_layer_1')
network = fully_connected(network, 10, activation='relu', name='Hidden_layer_2')
network = fully_connected(network, 7, activation='softmax', name='Output_layer')
network = regression(network, batch_size=2, optimizer='adam', learning_rate=0.01)
model = tflearn.DNN(network)
model.fit(train_x, train_y, show_metric=True, run_id='wine_regression',
validation_set=0.1, n_epoch=1000)
The neural network above is a poor one(accuracy=0.40). Moreover, it classifies the data into different classes. I would like to know how to code a regression neural network which gives a score out of 10 for the input features (and NOT CLASSIFICATION). I would also prefer tflearn as I'm quite comfortable with it.

This is the line in your code which makes your network a classifier with seven categories, instead of a regressor:
network = fully_connected(network, 7, activation='softmax', name='Output_layer')
I don't use TFLearn any more, I have switched over to Keras (which is similar, and has better support). However, I will suggest that you want the following output layer instead:
network = fully_connected(network, 1, activation='linear', name='Output_layer')
Also, your training data will need to change. If you want to perform a regression, you want a one-dimensional scalar label instead. I assume that you still have the original data, which you say that you altered? If not, the UC Irvine Machine Learning Data Repository has the wine quality data with a single, numerical Quality column.

Keras and the input layer

So I'm trying to learn ANN's with Keras as I heard it is simpler that Theano or TensorFlow. I have a number of questions the first is to do with the input layer.
So far I have this line of code as the input:
model.add(Dense(3 ,input_shape=(2,), batch_size=50 ,activation='relu'))
Now the data I want to add into the model is of the following shape:
Index(['stock_price', 'stock_volume', 'sentiment'], dtype='object')
[[ 3.01440000e+02 7.87830000e+04 0.00000000e+00]
[ 3.01440000e+02 7.87830000e+04 0.00000000e+00]
[ 3.01440000e+02 7.87830000e+04 1.42857143e-01]
[ 3.01440000e+02 7.87830000e+04 5.88235294e-02]
[ 3.01440000e+02 7.87830000e+04 0.00000000e+00]
[ 3.01440000e+02 7.87830000e+04 0.00000000e+00]
[ 3.01440000e+02 7.87830000e+04 0.00000000e+00]
[ 3.01440000e+02 7.87830000e+04 0.00000000e+00]
[ 3.01440000e+02 7.87830000e+04 0.00000000e+00]
[ 3.01440000e+02 7.87830000e+04 5.26315789e-02]]
I want to make a model see if I can find a correlation between stock prices and tweet sentiment and I just threw volume in there because eventually, I want to see if it can find a pattern with that as well.
So my second question is after running my input layer with several different parameters I get this problem which I can't explain. So when I run this line:
model.add(Dense(3 ,input_shape=(2,), batch_size=50 ,activation='relu'))
with the following line I get this output error:
ValueError: Error when checking model input: expected dense_1_input to have shape (50, 2) but got array with shape (50, 3)
But when I change the input shape to the requested '3' I get this error:
ValueError: Error when checking model target: expected dense_2 to have shape (50, 1) but got array with shape (50, 302)
Why has the 2 changed into '302' on the error message?
I'm probably overlooking some really basic problems since this is the first neural net I've tried to implement because I've only used the application for of Weka before.
Anyway here is a copy of my full code:
from keras.models import Sequential, Model
from keras.layers import Dense, Activation, Input
from keras.optimizers import SGD
from keras.utils import np_utils
import pymysql as mysql
import pandas as pd
import config
import numpy
import pprint
model = Sequential()
try:
sql = "SELECT stock_price, stock_volume, sentiment FROM tweets LIMIT 50"
con = mysql.connect(config.dbhost, config.dbuser, config.dbpassword, config.dbname, charset='utf8mb4', autocommit=True)
results = pd.read_sql(sql=sql, con=con, columns=['stock_price', 'stock_volume', 'sentiment'])
finally:
con.close()
npResults = results.as_matrix()
cols = np_utils.to_categorical(results['stock_price'].values)
data = results.values
print(cols)
# inputs:
# 1st = stock price
# 2nd = tweet sentiment
# 3rd = volume
model.add(Dense(3 ,input_shape=(3,), batch_size=50 ,activation='relu'))
model.add(Dense(20, activation='linear'))
sgd = SGD(lr=0.3, decay=0.01, momentum=0.2)
model.compile(loss='sparse_categorical_crossentropy', optimizer=sgd, metrics=['accuracy'])
model.summary()
model.fit(x=data, y=cols, epochs=100, batch_size=100, verbose=2)
EDIT:
Here is all the output I get fom the console:
C:\Users\Def\Anaconda3\python.exe C:/Users/Def/Dropbox/Dissertation/ann.py
Using Theano backend.
C:\Users\Def\Dropbox\Dissertation
[[ 0. 0. 0. ..., 0. 0. 1.]
[ 0. 0. 0. ..., 0. 0. 1.]
[ 0. 0. 0. ..., 0. 0. 1.]
...,
[ 0. 0. 0. ..., 0. 0. 1.]
[ 0. 0. 0. ..., 0. 0. 1.]
[ 0. 0. 0. ..., 0. 0. 1.]]
_________________________________________________________________
Layer (type) Output Shape Param #
=================================================================
dense_1 (Dense) (50, 3) 12
_________________________________________________________________
dense_2 (Dense) (50, 20) 80
=================================================================
Traceback (most recent call last):
File "C:/Users/Def/Dropbox/Dissertation/ann.py", line 38, in <module>
model.fit(x=data, y=cols, epochs=100, batch_size=100, verbose=2)
File "C:\Users\Def\Anaconda3\lib\site-packages\keras\models.py", line 845, in fit
initial_epoch=initial_epoch)
File "C:\Users\Def\Anaconda3\lib\site-packages\keras\engine\training.py", line 1405, in fit
batch_size=batch_size)
File "C:\Users\Def\Anaconda3\lib\site-packages\keras\engine\training.py", line 1299, in _standardize_user_data
exception_prefix='model target')
File "C:\Users\Def\Anaconda3\lib\site-packages\keras\engine\training.py", line 133, in _standardize_input_data
str(array.shape))
ValueError: Error when checking model target: expected dense_2 to have shape (50, 20) but got array with shape (50, 302)
Total params: 92.0
Trainable params: 92
Non-trainable params: 0.0
_________________________________________________________________
Process finished with exit code 1

I think you are using the wrong metric: sparse_categorical_crossentropy
Is there a reason you prefer this over the normal: categorical_crossentropy ?
When using categorical_crossentropy, you should encode your targets in 1-hot coding fasion (using for instance: cols = np_utils.to_categorical(results['stock_price'].values)).
On the other hand, sparse_categorical_crossentropy uses integer-based labels.
So either use:
cols = np_utils.to_categorical(results['stock_price'].values)
with
model.compile(loss='categorical_crossentropy', optimizer=sgd, metrics=['accuracy'])
and an output layer of (num-categories) neurons
or use:
cols = results['stock_price'].values.astype(np.int32)
with
model.compile(loss='sparse_categorical_crossentropy', optimizer=sgd, metrics=['accuracy'])
and an single-neuron output layer.

Develop Reference

ios ruby-on-rails asp.net-mvc docker delphi jenkins grails google-sheets machine-learning dart

How to pre processed the nominal data attributes in data mining? [closed] - machine-learning

Related

Feature Hashing of zip codes with Scikit in machine learning

How to turn a frequency domain graph to time domain graph

How to handle multiple values of features in a particular record in machine learning?

Neural Network for Regression with tflearn

Keras and the input layer

Categories

Resources