Feature Hashing of zip codes with Scikit in machine learning - machine-learning

I am working on a machine learning problem, where I have a lot of zipcodes (~8k unique values) in my data set. Thus I decided to hash the values into a smaller feature space instead of using something like OHE.
The problem I encountered was a very small percentage (20%) of unique rows in my hash, which basically means from my understanding, that I have a lot of duplicates/collisions. Even though I increased the features in my hash table to ~200, I never got more than 20% of unique values. This does not make sense to me, since with a growing number of columns in my hash, more unique combinations should be possible
I used the following code to hash my zip codes with scikit and calculate the collisions based on unique vales in the last array:
from sklearn.feature_extraction import FeatureHasher
D = pd.unique(Daten["PLZ"])
print("Zipcode Data:", D,"\nZipcode Shape:", D.shape)
h = FeatureHasher(n_features=2**5, input_type="string")
f = h.transform(D)
f = f.toarray()
print("Feature Array:\n",f ,"\nFeature Shape:", f.shape)
unq = np.unique(f, axis=0)
print("Unique values:\n",unq,"\nUnique Shape:",unq.shape)
print("Percentage of unique values in hash array:",unq.shape[0]/f.shape[0]*100)
For Output I received:
Zipcode Data: ['86916' '01445' '37671' ... '82387' '83565' '83550']
Zipcode Shape: (8158,)
Feature Array:
[[ 2. 1. 0. ... 0. 0. 0.]
[ 0. -1. 0. ... 0. 0. 0.]
[ 1. 0. 0. ... 0. 0. 0.]
...
[ 0. 0. 0. ... 0. 0. 0.]
[ 1. 0. 0. ... 0. 0. 0.]
[ 0. -1. 0. ... 0. 0. 0.]]
Feature Shape: (8158, 32)
Unique values:
[[ 0. -3. 0. ... 0. 0. 0.]
[ 0. -2. 0. ... 0. 0. 0.]
[ 0. -2. 0. ... 0. 0. 0.]
...
[ 4. 0. 0. ... 0. 0. 0.]
[ 4. 0. 0. ... 0. 0. 0.]
[ 4. 0. 0. ... 0. 0. 0.]]
Unique Shape: (1707, 32)
Percentage of unique values in hash array: 20.9242461387595
Any help and insights are greatly appreciated.

That very first 2 in the transformed data should be a clue. I think you'll also find that many of the columns are all-zero.
From the documentation,
Each sample must be iterable...
So the hasher is treating the zip code '86916' as the collection of elements 8, 6, 9, 1, 6, and you only get ten nonzero columns (the first column presumably being the 6, which appears twice, as noted at the beginning). You should be able to rectify this by reshaping the input to be 2-dimensional.

Related

Deterministic action collision handling in population simlulations

I'm working on the theoretical framework for my own simulation environment. I want to simulate evolutionary algorithms on populations, but I don't know how to handle conflicting actions between multiple individuals.
The simulation has discrete time steps and takes place on a board (or tile grid) with a random population of different individuals. Each individual has some inputs, internal state and a list of possible actions to take each step. For example, an individual can read its position on the grid (input) and move one tile in a certain direction (action). Now, lets say I have two individuals A and B. Both perform a certain action during the same simulation step which would result in both individuals ending up on the same tile. This, however, is forbidden by the rules of the environment.
In more abstract terms: my simulation is in a valid state S1. Due to independent actions taken by multiple individuals, the next state S2 (after one simulation step) would be an invalid state.
How does one resolve these conflicts / collisions so I never end up in an invalid state?
I want the simulation to be replayable so the behavior should be deterministic.
Another question is fairness. Lets say I resolve conflicts by whoever comes first passes. Because, in theory, all actions happen at the same time (discrete time steps), "whoever comes first" isn't a measurement of time but data layout. Individuals that are processed earlier now have an advantage because they happen to be in favorable locations in the internal data structures (i.e. lower index in the array).
Is there a way to guarantee fairness? If not, how can I reduce unfairness?
I know these are very broad questions but since I haven't worked out all the constraints and rules of the simulation I wanted to get an overview of what's even possible, or perhaps common practice in these systems. I'm happy about any pointers for further research.
The question is overwhelmingly broad but here is my answer for the following case:
Agents move on a square (grid) board with cyclic boundaries conditions
Possible moves are: a) stay where you are, b) move to one of the 9 adjacent positions
The possible moves are assigned a probability
Conflicts will happen for every cell targeted by two agents since we assume no two agents can occupy the same cell (exclusion). We'll solve the conflict by re-rolling the dice until we obtain no conflict.
The general idea:
Roll the dices for every agent i --> targeted cell by agent i
Any two agents targeting the same cell ?
if yes:
for every pair of conflicting agents:
re-roll the dice for BOTH agents
Notes:
a) Conflicts are detected when an agent is missing because it has been "crushed" by an other agent (because of relative positions in agents' list).
b) Here, I assume that re-rolling the dice for both is a "fair" treatment since no arbitrary decision has to be taken.
Did it solve the problem? If not, go back to 2)
Move the agents to the new positions and go back to 1)
I provide a python program. No fancy graphics. Runs in a terminal.
The default parameters are:
Board_size = 4
Nb_of_agents=8 (50 % occupation)
If you want to see how it scales with problem size, then put VERBOSE=False
otherwise you'll be flooded with output. Note: -1 means an empty cell.
EXAMPLES OF OUTPUT:
Note: I used a pretty high occupancies (50% and 25%) for
examples 1 and 2. Much lower occupancies will result in
no conflict most of the time.
##################################################################
EXAMPLE 1:
VERBOSE=True
Board = 4 x 4
Nb. of agents: 8 (occupation 50%)
==============================================================
Turn: 0
Old board:
[[-1. 7. 3. 0.]
[ 6. -1. 4. 2.]
[-1. -1. 5. -1.]
[-1. 1. -1. -1.]]
Proposed new board:
[[ 1. -1. -1. -1.]
[-1. 4. -1. -1.]
[-1. 6. -1. 2.]
[-1. 7. 5. -1.]]
# of conflicts to solve: 2
Conflicts to solve: [agent_a, agent_b, targeted cell]:
[0, 1, array([0, 0])]
[3, 5, array([2, 3])]
Proposed new board:
[[-1. -1. 1. 3.]
[-1. 4. -1. 5.]
[-1. 6. -1. 2.]
[-1. 7. -1. 0.]]
No conflicts
<<< OUTPUT >>>
Old board:
[[-1. 7. 3. 0.]
[ 6. -1. 4. 2.]
[-1. -1. 5. -1.]
[-1. 1. -1. -1.]]
Definitive new board:
[[-1. -1. 1. 3.]
[-1. 4. -1. 5.]
[-1. 6. -1. 2.]
[-1. 7. -1. 0.]]
==============================================================
Turn: 1
Old board:
[[-1. -1. 1. 3.]
[-1. 4. -1. 5.]
[-1. 6. -1. 2.]
[-1. 7. -1. 0.]]
Proposed new board:
[[ 3. -1. -1. -1.]
[ 5. -1. 4. -1.]
[ 7. -1. -1. -1.]
[ 6. 1. -1. 2.]]
# of conflicts to solve: 1
Conflicts to solve: [agent_a, agent_b, targeted cell]:
[0, 6, array([0, 3])]
Proposed new board:
[[ 3. -1. -1. -1.]
[ 5. -1. 4. -1.]
[ 7. -1. -1. -1.]
[-1. 6. -1. 2.]]
# of conflicts to solve: 2
Conflicts to solve: [agent_a, agent_b, targeted cell]:
[0, 7, array([0, 2])]
[1, 6, array([1, 3])]
Proposed new board:
[[ 3. 1. -1. -1.]
[ 5. -1. 4. -1.]
[ 0. 7. -1. -1.]
[ 6. -1. -1. 2.]]
No conflicts
<<< OUTPUT >>>
Old board:
[[-1. -1. 1. 3.]
[-1. 4. -1. 5.]
[-1. 6. -1. 2.]
[-1. 7. -1. 0.]]
Definitive new board:
[[ 3. 1. -1. -1.]
[ 5. -1. 4. -1.]
[ 0. 7. -1. -1.]
[ 6. -1. -1. 2.]]
==============================================================
##################################################################
EXAMPLE 2:
VERBOSE=False
Board = 200 x 200
Nb. of agents: 10000 (occupation 25%)
==============================================================
Turn: 0
# of conflicts to solve: 994
# of conflicts to solve: 347
# of conflicts to solve: 137
# of conflicts to solve: 63
# of conflicts to solve: 24
# of conflicts to solve: 10
# of conflicts to solve: 6
# of conflicts to solve: 4
# of conflicts to solve: 2
No conflicts
==============================================================
Turn: 1
# of conflicts to solve: 1002
# of conflicts to solve: 379
# of conflicts to solve: 150
# of conflicts to solve: 62
# of conflicts to solve: 27
# of conflicts to solve: 9
# of conflicts to solve: 2
No conflicts
==============================================================
The program (in python):
#!/usr/bin/env python
# coding: utf-8
import numpy
import numpy as np
np.random.seed(1) # will reproduce the examples
# Verbose: if True: show the boards (ok for small boards)
Verbose=True
# max nb of turns
MaxTurns=2
Board_size= 4
Nb_of_cells=Board_size**2
Nb_of_agents=8 # should be < Board_size**2
agent_health=np.ones(Nb_of_agents) # Example 1: All agents move (see function choose_move)
#agent_health=np.random.rand(Nb_of_agents) # With this: the probability of moving is given by health
#agent_health=0.8*np.ones(Nb_of_agents) # With this: 80% of the time they move, 20% the stay in place
possible_moves=np.array([[0,0], [-1,-1],[-1, 0],[-1,+1], [ 0,-1], [ 0,+1], [+1,-1],[+1, 0],[+1,+1]])
Nb_of_possible_moves=len(possible_moves)
def choose_move(agent, health):
# Each agent chooses randomly a move among the possible moves
# with a mobility proportional to health.
prob[0]=1-agent_health[agent] # low health --> low mobility
prob[1:9]=(1-prob[0])/8
move=np.random.choice(Nb_of_possible_moves,1,p=prob)
return move
def identify_conflicts_to_solve(missing_agents, Nb_of_agents, Old_X, Old_Y):
# 1) Identify conflicts to solve:
target_A=[]
target_B=[]
[target_A.append([a,(Old_X[a]+possible_moves[move[a]][0])%Board_size, (Old_Y[a]+possible_moves[move[a]][1])%Board_size]) for a in missing_agents];
[target_B.append([a,(Old_X[a]+possible_moves[move[a]][0])%Board_size, (Old_Y[a]+possible_moves[move[a]][1])%Board_size]) for a in range(Nb_of_agents) if not a in missing_agents];
target_A=np.array(target_A)
target_B=np.array(target_B)
conflicts_to_solve=[]
for m in range(len(target_A)):
for opponent in range(len(target_B[:,0])):
if all(target_A[m,1:3] == target_B[opponent,1:3]): # they target the same cell
conflicts_to_solve.append([target_A[m,0], target_B[opponent,0], target_A[m,1:3]])
return conflicts_to_solve
# Fill the board with -1 (-1 meaning: empty cell)
Old_Board=-np.ones(len(np.arange(0,Board_size**2)))
# Choose a cell on the board for each agent:
# position = index of the occupied cell
Old_indices = np.random.choice(Nb_of_cells, size=Nb_of_agents, replace=False)
# We populate the board
for i in range(Nb_of_agents):
Old_Board[Old_indices[i]]=i
New_Board=Old_Board
# Coordinates: We assume a cyclic board
Old_X=np.array([Old_indices[i] % Board_size for i in range(len(Old_indices))]) # X position of cell i
Old_Y=np.array([Old_indices[i] // Board_size for i in range(len(Old_indices))])# Y position of cell i
# Define other properties
move=np.zeros(Nb_of_agents,dtype=int)
prob=np.zeros(Nb_of_possible_moves)
print('==============================================================')
for turn in range(MaxTurns):
print("Turn: ",turn)
if Verbose:
print('Old board:')
print(New_Board.reshape(Board_size,Board_size))
Nb_of_occupied_cells_before_the_move=len(Old_Board[Old_Board>-1])
Legal_move=False
while not Legal_move:
for i in range(0,Nb_of_agents):
move[i]=choose_move(agent=i, health=agent_health[i])
conflicts_to_solve=-1
while conflicts_to_solve!=[]:
# New coordinates (with cyclic boundary conditions):
New_X=np.array([(Old_X[i]+possible_moves[move[i]][0]) % Board_size for i in range(Nb_of_agents)])
New_Y=np.array([(Old_Y[i]+possible_moves[move[i]][1]) % Board_size for i in range(Nb_of_agents)])
# New board
New_indices=New_Y*Board_size+New_X
New_Board=-np.ones(Board_size**2) # fill the board with -1 (-1 meaning: empty cell)
for i in range(Nb_of_agents): # Populate new board
New_Board[New_indices[i]]=i
# Look for missing agents: an agent is missing if it has been "overwritten" by another agent,
# indicating conflicts in reaching a particular cell
missing_agents=[agent for agent in range(Nb_of_agents) if not agent in New_Board]
# 1) identify conflicts to solve:
conflicts_to_solve = identify_conflicts_to_solve(missing_agents, Nb_of_agents, Old_X, Old_Y)
if Verbose:
print('Proposed new board:')
print(New_Board.reshape(Board_size,Board_size))
if len(conflicts_to_solve)>0:
print("# of conflicts to solve: ", len(conflicts_to_solve))
if Verbose:
print('Conflicts to solve: [agent_a, agent_b, targeted cell]: ')
for c in conflicts_to_solve:
print(c)
else:
print("No conflicts")
# 2) Solve conflicts
# The way we solve conflicting agents is "fair" since we re-roll the dice for all of them
# Without making arbitrary decisions
for c in conflicts_to_solve:
# re-choose a move for "a"
move[c[0]]=choose_move(c[0], agent_health[c[0]])
# re-choose a move for "b"
move[c[1]]=choose_move(c[1], agent_health[c[1]])
Nb_of_occupied_cells_after_the_move=len(New_Board[New_Board>-1])
Legal_move = Nb_of_occupied_cells_before_the_move == Nb_of_occupied_cells_after_the_move
if not Legal_move:
# Note: in principle, it should never happen but,
# better to check than being sorry...
print("Problem: Illegal move")
Turn=MaxTurns
# We stop there
if Verbose:
print("<<< OUTPUT >>>")
print("Old board:")
print(Old_Board.reshape(Board_size,Board_size))
print()
print("Definitive new board:")
print(New_Board.reshape(Board_size,Board_size))
print('==============================================================')
Old_X=New_X
Old_Y=New_Y
Old_indices=New_indices
Old_Board=New_Board
Due to the "independent actions taken by multiple individuals" I suppose there is no way to avoid potential collisions and hence you need some mechanism for resolving those.
A fair version of your "whoever comes first" approach could involve shuffling the individuals randomly at the beginning of each time step, e.g. choose a new and random processing order for you individuals in each time step.
If you fix the random seed the simulation results would still be deterministic.
If the individuals aquire some type of score / fitness during the simulation this could also be used to resolve conflicts. E.g. conflict is always won by whoever has the highest fitness (you would need an additional rule for ties then).
Or choose a random winner with winning probability proportional to fitness: If individuals 1 and 2 have fitness f1 and f2, then the probability of 1 winning would be f1/(f1+f2) and the probability of 2 winning would be f2/(f1+f2). Ties (f1 = f2) would also be resolved automatically.
I guess those fitness based rules could be called fair, as long as
Every individual has the same starting fitness (or starting fitness is also set randomly)
Every individual has the same chance of aquiring a high fitness, e.g. all starting positions have the same outlook or starting positions are set randomly

How to handle multiple values of features in a particular record in machine learning?

I have an use case for which I'm trying to solve it using Machine Learning. Let's say I have an input data in the form of (X1, X2, X3, X4, X5, X6) and output value Y. Consider the following scenario where you have multiple values of (X5 and X6 and each set of them are correlated) for same fixed set of (X1,X2,X3,X4) and you have Y value changing for each set of (X5 and X6), how do you formulate the data in training for a machine learning model?
I could only think of following ways to address the problem:
i. Have each row for each set of (X5 and X6) value and introduce a rank categorical column to say these input data are correlated i.e :
X1 X2 X3 X4  X5  X6  Rank Y
1.5 2 3.4 5.4  6.7  7.8   1    2.3
1.5 2 3.4 5.4  4.32 6.3  1    7.4
1.5 2 3.4 5.4  2.1  2.3  1    3.24
2.1 1 12  34  2  3.23   2    1.24
1.5 2 3.4 5.4 6.7   7.8   3    24.4
so on......
ii. Explode X5 and X6 features into multiple columns for each value of them, but the problem here is we have to limit the number of columns and the correlation is missing b/w X5 and X6.
Below links are the code file and the input file attached for the existing realtime usecase with actual feature names and output variable.
https://drive.google.com/open?id=178XEzd_5iPXGMBJUrqI5kvlwnDspwMhP
https://drive.google.com/file/d/18SA42kDlQto-PnR5fUpAcvXKlimGidOj/view
I can't see your code. But, I got your data.
I think PRIORITY and SCHEDQT are y data.
And, I uses LEADTIME, BOMINV, BOMINFSW, SKUINV, SKUINFSW and COQTY as x data. You must know the data is normalized simply in between 0~1.
It's better than before.
But, It's not predicted well. Please refer the result as below:
[`LEADTIME`, `BOMINV`, `BOMINFSW`, `SKUINV`, `SKUINFSW`, `COQTY`] [Pre. PRIORITY] [Real PRIORITY]
[0.03333333 0.33333333 0. 0. 0. 0.05666667] [8.221004] [18.]
[0.03333333 0.33333333 0. 0. 0. 0.26666667] [8.221004] [19.]
[0.03333333 0.33333333 0. 0. 0. 0.16666667] [8.221004] [20.]
[1. 0. 1. 0. 0. 0.16666667] [8.221004] [1.]
[1. 0. 1. 0. 0. 0.26666667] [8.221004] [2.]
I think Each field values can not make enough difference of PRIORITY result.
From the example as above, LEADTIME, BOMINV, BOMINFSW, SKUINV and SKUINFSW are same.
Then, I tried to remove some records if LEADTIME or BOMINV or SKUINV is 0.
[0.2 0.33333333 0. 0.66666667 0. 0.33333333] [20.035915] [36.]
[0.2 0.33333333 0. 0.66666667 0. 0.46666667] [20.035915] [38.]
[0.2 0.33333333 0. 0.66666667 0. 0.6 ] [20.0352] [40.]
[0.2 0.33333333 0. 0.33333333 0. 0.16666667] [11.69006] [1.]
[0.2 0.33333333 0. 0.33333333 0. 0.26666667] [11.5476885] [2.]
But, You can see The result also very par from real because x data does not show enough difference.
Now, I can just say that you need more features of data to get enough learning.

Neural Network for Regression with tflearn

My question is about coding a neural network which does regression (and NOT classification) using tflearn.
Dataset:
fixed acidity volatile acidity citric acid ... alcohol quality
7.4 0.700 0.00 ... 9.4 5
7.8 0.880 0.00 ... 9.8 5
7.8 0.760 0.04 ... 9.8 5
11.2 0.280 0.56 ... 9.8 6
7.4 0.700 0.00 ... 9.4 5
I want to build a neural network which takes in 11 features (chemical values in wine) and outputs or predicts a score i.e., quality(out of 10). I DON'T want to classify the wine like quality_1, quality_2,... I want the model to perform a regression function for my features and predict a value out of 10(could be even a float).
The quality column in my data only has values = [3, 4, 5, 6, 7, 8, 9].
It does not contain 1, 2, and 10.
Due to the lack in experience, I could only code a neural network that CLASSIFIES the wine into classes like [score_3, score_4,...] and I used one hot encoding to do so.
Processed Data:
Features:
[[ 7.5999999 0.23 0.25999999 ..., 3.02999997 0.44
9.19999981]
[ 6.9000001 0.23 0.34999999 ..., 2.79999995 0.54000002
11. ]
[ 6.69999981 0.17 0.37 ..., 3.25999999 0.60000002
10.80000019]
...,
[ 6.30000019 0.28 0.47 ..., 3.11999989 0.50999999
9.5 ]
[ 5.19999981 0.64499998 0. ..., 3.77999997 0.61000001
12.5 ]
[ 8. 0.23999999 0.47999999 ..., 3.23000002 0.69999999
10. ]]
Labels:
[[ 0. 1. 0. ..., 0. 0. 0.]
[ 0. 0. 0. ..., 0. 0. 0.]
[ 0. 0. 0. ..., 1. 0. 0.]
...,
[ 0. 0. 0. ..., 0. 0. 0.]
[ 0. 0. 0. ..., 0. 0. 0.]
[ 0. 0. 1. ..., 0. 0. 0.]]
Code for a neural network which CLASSIFIES into different classes:
import pandas as pd
import numpy as np
import tflearn
from tflearn.layers.core import input_data, fully_connected
from tflearn.layers.estimator import regression
from sklearn.model_selection import train_test_split
def preprocess():
data_source_red = 'F:\Gautam\...\Datasets\winequality-red.csv'
data_red = pd.read_csv(data_source_red, index_col=False, sep=';')
data = pd.get_dummies(data, columns=['quality'], prefix=['score'])
x = data[data.columns[0:11]].values
y = data[data.columns[11:18]].values
x = np.float32(x)
y = np.float32(y)
return (x, y)
x, y = preprocess()
train_x, test_x, train_y, test_y = train_test_split(x, y, test_size = 0.2)
network = input_data(shape=[None, 11], name='Input_layer')
network = fully_connected(network, 10, activation='relu', name='Hidden_layer_1')
network = fully_connected(network, 10, activation='relu', name='Hidden_layer_2')
network = fully_connected(network, 7, activation='softmax', name='Output_layer')
network = regression(network, batch_size=2, optimizer='adam', learning_rate=0.01)
model = tflearn.DNN(network)
model.fit(train_x, train_y, show_metric=True, run_id='wine_regression',
validation_set=0.1, n_epoch=1000)
The neural network above is a poor one(accuracy=0.40). Moreover, it classifies the data into different classes. I would like to know how to code a regression neural network which gives a score out of 10 for the input features (and NOT CLASSIFICATION). I would also prefer tflearn as I'm quite comfortable with it.
This is the line in your code which makes your network a classifier with seven categories, instead of a regressor:
network = fully_connected(network, 7, activation='softmax', name='Output_layer')
I don't use TFLearn any more, I have switched over to Keras (which is similar, and has better support). However, I will suggest that you want the following output layer instead:
network = fully_connected(network, 1, activation='linear', name='Output_layer')
Also, your training data will need to change. If you want to perform a regression, you want a one-dimensional scalar label instead. I assume that you still have the original data, which you say that you altered? If not, the UC Irvine Machine Learning Data Repository has the wine quality data with a single, numerical Quality column.

How to pre processed the nominal data attributes in data mining? [closed]

Closed. This question needs details or clarity. It is not currently accepting answers.
Want to improve this question? Add details and clarify the problem by editing this post.
Closed 5 years ago.
Improve this question
Dataset imageI have done the one hot encoding on nominal data attributes but later i want to do clustering on the data so please suggest feasible solution. i am new to data mining
considering that you did not provide a proper information of your problem and it seems you are new to these concepts I provide an entire solution on a fake data. you can learn from it and get the points to work on your solution.
I have implemented in python and assume that you are familiar with skit-learn :
from numpy import array
from sklearn.preprocessing import LabelEncoder
from sklearn.preprocessing import OneHotEncoder
from sklearn import cluster
# define example
data = ['cold', 'cold', 'warm', 'cold', 'hot', 'hot', 'warm', 'cold', 'warm', 'hot']
values = array(data)
print(values)
# integer encode
label_encoder = LabelEncoder()
integer_encoded = label_encoder.fit_transform(values)
print(integer_encoded)
# binary encode
onehot_encoder = OneHotEncoder(sparse=False)
integer_encoded = integer_encoded.reshape(len(integer_encoded), 1)
onehot_encoded = onehot_encoder.fit_transform(integer_encoded)
print(onehot_encoded)
# Clustering step:
kmeans = cluster.KMeans(n_clusters=3)
kmeans.fit(onehot_encoded)
print(kmeans.labels_)
and here is the result :
['cold' 'cold' 'warm' 'cold' 'hot' 'hot' 'warm' 'cold' 'warm' 'hot']
[[1. 0. 0.]
[1. 0. 0.]
[0. 0. 1.]
[1. 0. 0.]
[0. 1. 0.]
[0. 1. 0.]
[0. 0. 1.]
[1. 0. 0.]
[0. 0. 1.]
[0. 1. 0.]]
cluster label for the above data is :
[1 1 2 1 0 0 2 1 2 0]

Keras and the input layer

So I'm trying to learn ANN's with Keras as I heard it is simpler that Theano or TensorFlow. I have a number of questions the first is to do with the input layer.
So far I have this line of code as the input:
model.add(Dense(3 ,input_shape=(2,), batch_size=50 ,activation='relu'))
Now the data I want to add into the model is of the following shape:
Index(['stock_price', 'stock_volume', 'sentiment'], dtype='object')
[[ 3.01440000e+02 7.87830000e+04 0.00000000e+00]
[ 3.01440000e+02 7.87830000e+04 0.00000000e+00]
[ 3.01440000e+02 7.87830000e+04 1.42857143e-01]
[ 3.01440000e+02 7.87830000e+04 5.88235294e-02]
[ 3.01440000e+02 7.87830000e+04 0.00000000e+00]
[ 3.01440000e+02 7.87830000e+04 0.00000000e+00]
[ 3.01440000e+02 7.87830000e+04 0.00000000e+00]
[ 3.01440000e+02 7.87830000e+04 0.00000000e+00]
[ 3.01440000e+02 7.87830000e+04 0.00000000e+00]
[ 3.01440000e+02 7.87830000e+04 5.26315789e-02]]
I want to make a model see if I can find a correlation between stock prices and tweet sentiment and I just threw volume in there because eventually, I want to see if it can find a pattern with that as well.
So my second question is after running my input layer with several different parameters I get this problem which I can't explain. So when I run this line:
model.add(Dense(3 ,input_shape=(2,), batch_size=50 ,activation='relu'))
with the following line I get this output error:
ValueError: Error when checking model input: expected dense_1_input to have shape (50, 2) but got array with shape (50, 3)
But when I change the input shape to the requested '3' I get this error:
ValueError: Error when checking model target: expected dense_2 to have shape (50, 1) but got array with shape (50, 302)
Why has the 2 changed into '302' on the error message?
I'm probably overlooking some really basic problems since this is the first neural net I've tried to implement because I've only used the application for of Weka before.
Anyway here is a copy of my full code:
from keras.models import Sequential, Model
from keras.layers import Dense, Activation, Input
from keras.optimizers import SGD
from keras.utils import np_utils
import pymysql as mysql
import pandas as pd
import config
import numpy
import pprint
model = Sequential()
try:
sql = "SELECT stock_price, stock_volume, sentiment FROM tweets LIMIT 50"
con = mysql.connect(config.dbhost, config.dbuser, config.dbpassword, config.dbname, charset='utf8mb4', autocommit=True)
results = pd.read_sql(sql=sql, con=con, columns=['stock_price', 'stock_volume', 'sentiment'])
finally:
con.close()
npResults = results.as_matrix()
cols = np_utils.to_categorical(results['stock_price'].values)
data = results.values
print(cols)
# inputs:
# 1st = stock price
# 2nd = tweet sentiment
# 3rd = volume
model.add(Dense(3 ,input_shape=(3,), batch_size=50 ,activation='relu'))
model.add(Dense(20, activation='linear'))
sgd = SGD(lr=0.3, decay=0.01, momentum=0.2)
model.compile(loss='sparse_categorical_crossentropy', optimizer=sgd, metrics=['accuracy'])
model.summary()
model.fit(x=data, y=cols, epochs=100, batch_size=100, verbose=2)
EDIT:
Here is all the output I get fom the console:
C:\Users\Def\Anaconda3\python.exe C:/Users/Def/Dropbox/Dissertation/ann.py
Using Theano backend.
C:\Users\Def\Dropbox\Dissertation
[[ 0. 0. 0. ..., 0. 0. 1.]
[ 0. 0. 0. ..., 0. 0. 1.]
[ 0. 0. 0. ..., 0. 0. 1.]
...,
[ 0. 0. 0. ..., 0. 0. 1.]
[ 0. 0. 0. ..., 0. 0. 1.]
[ 0. 0. 0. ..., 0. 0. 1.]]
_________________________________________________________________
Layer (type) Output Shape Param #
=================================================================
dense_1 (Dense) (50, 3) 12
_________________________________________________________________
dense_2 (Dense) (50, 20) 80
=================================================================
Traceback (most recent call last):
File "C:/Users/Def/Dropbox/Dissertation/ann.py", line 38, in <module>
model.fit(x=data, y=cols, epochs=100, batch_size=100, verbose=2)
File "C:\Users\Def\Anaconda3\lib\site-packages\keras\models.py", line 845, in fit
initial_epoch=initial_epoch)
File "C:\Users\Def\Anaconda3\lib\site-packages\keras\engine\training.py", line 1405, in fit
batch_size=batch_size)
File "C:\Users\Def\Anaconda3\lib\site-packages\keras\engine\training.py", line 1299, in _standardize_user_data
exception_prefix='model target')
File "C:\Users\Def\Anaconda3\lib\site-packages\keras\engine\training.py", line 133, in _standardize_input_data
str(array.shape))
ValueError: Error when checking model target: expected dense_2 to have shape (50, 20) but got array with shape (50, 302)
Total params: 92.0
Trainable params: 92
Non-trainable params: 0.0
_________________________________________________________________
Process finished with exit code 1
I think you are using the wrong metric: sparse_categorical_crossentropy
Is there a reason you prefer this over the normal: categorical_crossentropy ?
When using categorical_crossentropy, you should encode your targets in 1-hot coding fasion (using for instance: cols = np_utils.to_categorical(results['stock_price'].values)).
On the other hand, sparse_categorical_crossentropy uses integer-based labels.
So either use:
cols = np_utils.to_categorical(results['stock_price'].values)
with
model.compile(loss='categorical_crossentropy', optimizer=sgd, metrics=['accuracy'])
and an output layer of (num-categories) neurons
or use:
cols = results['stock_price'].values.astype(np.int32)
with
model.compile(loss='sparse_categorical_crossentropy', optimizer=sgd, metrics=['accuracy'])
and an single-neuron output layer.

Resources