Related
I am struggling with the problem I am facing:
I have a dataset of different products (Cars) that have certain Work Orders open at a given time. I know from historical data how much time this work in TOTAL has caused.
Now I want to predict it for another Car (e.g. Car 3).
Which type of algorithm, regression shall I use for this?
My idea was to transform this row based dataset into column based with binary values e.g. Brake: 0/1, Screen 0/1.. But then I will have lots of Inputs as the number of possible Inputs is 100-200..
Here's a quick idea using multi-factor regression for 30 jobs, each of which is some random accumulation of 6 tasks with a "true cost" for each task. We can regress against the task selections in each job to estimate the cost coefficients that best explain the total job costs.
First done w/ no "noise" in the system (tasks are exact), then with some random noise.
A "more thorough" job would include examining the R-squared value and plotting the residuals to ensure linearity.
In [1]: from sklearn import linear_model
In [2]: import numpy as np
In [3]: jobs = np.random.binomial(1, 0.6, (30, 6))
In [4]: true_costs = np.array([10, 20, 5, 53, 31, 42])
In [5]: jobs
Out[5]:
array([[0, 1, 1, 1, 1, 0],
[1, 0, 0, 1, 0, 1],
[1, 1, 0, 1, 0, 0],
[1, 0, 1, 1, 1, 1],
[1, 1, 0, 0, 1, 1],
[0, 1, 0, 0, 1, 0],
[1, 0, 0, 1, 1, 0],
[1, 1, 1, 1, 0, 1],
[1, 0, 0, 1, 0, 1],
[0, 1, 0, 1, 0, 0],
[0, 0, 1, 0, 1, 1],
[1, 0, 1, 1, 1, 1],
[0, 1, 1, 1, 1, 1],
[1, 0, 1, 1, 1, 1],
[0, 1, 1, 0, 1, 0],
[1, 0, 1, 0, 1, 0],
[1, 1, 1, 1, 1, 1],
[1, 0, 1, 0, 0, 1],
[0, 1, 0, 1, 1, 0],
[1, 1, 1, 0, 1, 0],
[1, 1, 1, 1, 1, 0],
[1, 0, 1, 0, 0, 1],
[0, 0, 0, 1, 1, 1],
[1, 1, 0, 1, 1, 1],
[1, 0, 1, 1, 0, 1],
[1, 1, 1, 1, 1, 1],
[1, 0, 1, 1, 1, 1],
[0, 0, 1, 1, 0, 0],
[1, 1, 0, 0, 1, 1],
[1, 1, 1, 1, 0, 0]])
In [6]: tot_job_costs = jobs # true_costs
In [7]: reg = linear_model.LinearRegression()
In [8]: reg.fit(jobs, tot_job_costs)
Out[8]: LinearRegression()
In [9]: reg.coef_
Out[9]: array([10., 20., 5., 53., 31., 42.])
In [10]: np.random.normal?
In [11]: noise = np.random.normal(0, scale=5, size=30)
In [12]: noisy_costs = tot_job_costs + noise
In [13]: noisy_costs
Out[13]:
array([113.94632664, 103.82109478, 78.73776288, 145.12778089,
104.92931235, 48.14676751, 94.1052639 , 134.64827785,
109.58893129, 67.48897806, 75.70934522, 143.46588308,
143.12160502, 147.71249157, 53.93020167, 44.22848841,
159.64772255, 52.49447057, 102.70555991, 69.08774251,
125.10685342, 45.79436364, 129.81354375, 160.92510393,
108.59837665, 149.1673096 , 135.12600871, 60.55375843,
107.7925208 , 88.16833899])
In [14]: reg.fit(jobs, noisy_costs)
Out[14]: LinearRegression()
In [15]: reg.coef_
Out[15]:
array([12.09045186, 19.0013987 , 3.44981506, 55.21114084, 33.82282467,
40.48642199])
In [16]:
I ran several benchmarks on a CPU cache simulation to count the number of the reuse distances (the distance between two accesses of the same cache entry during execution of the program), here are two examples of the data I got from two different benchmarks:
Benchmark 1:
"reuse_dist_counts": {
"-1": 340,
"0": 623,
"1": 930,
"100": 1,
"107": 1,
"114": 1,
"121": 1,
"128": 1,
"135": 1,
"142": 1,
"149": 1,
"156": 1,
"163": 1,
"170": 1,
"177": 1,
"184": 1,
"191": 1,
"198": 1,
"2": 617,
"205": 1,
"212": 1,
"219": 1,
"226": 1,
"233": 1,
"240": 1,
"247": 1,
"254": 1,
"261": 1,
"268": 1,
"275": 1,
"282": 1,
"289": 1,
"296": 1,
"3": 617,
"303": 1,
"310": 1,
"311": 1,
"314": 1,
"4": 1,
"48": 1,
"55": 1,
"62": 1,
"69": 1,
"76": 1,
"79": 1,
"86": 1,
"93": 1
}
Benchmark2:
"reuse_dist_counts": {
"-1": 58,
"0": 128,
"1": 320,
"11": 17,
"12": 1,
"13": 2,
"14": 4,
"15": 18,
"16": 14,
"17": 13,
"18": 13,
"19": 16,
"2": 256,
"20": 16,
"21": 17,
"22": 17,
"23": 2,
"24": 1,
"25": 3,
"26": 3,
"27": 2,
"28": 3,
"29": 2,
"3": 289,
"30": 2,
"31": 2,
"34": 2,
"35": 6,
"38": 2,
"39": 2,
"4": 198,
"40": 4,
"41": 1,
"43": 1,
"44": 2,
"45": 1,
"47": 1,
"48": 2,
"5": 63,
"50": 1,
"6": 81,
"7": 106,
"8": 1
}
it has the form a:b where a is a reuse distance and b is how many times that reuse distance has been found during the execution.
As I'm going to design a neural network based on this data, it's necessary to find a good normalization to represent it in a different way. Any idea on how to normalize such data with such low variations, and how to represent its features vectors?
I have text data as follows.
X_train_orignal= np.array(['OC(=O)C1=C(Cl)C=CC=C1Cl', 'OC(=O)C1=C(Cl)C=C(Cl)C=C1Cl',
'OC(=O)C1=CC=CC(=C1Cl)Cl', 'OC(=O)C1=CC(=CC=C1Cl)Cl',
'OC1=C(C=C(C=C1)[N+]([O-])=O)[N+]([O-])=O'])
As it is evident that different sequences have different length. How can I zero pad the sequence on both sides of the sequence to some maximum length. And then convert each sequence into one hot encoding based on each characters?
Try:
I used the following keras API but it doesn't work with strings sequence.
keras.preprocessing.sequence.pad_sequences(sequences, maxlen=None, dtype='int32', padding='pre', truncating='pre', value=0.0)
I might need to convert my sequence data into one hot vectors first and then zero pad it. For that I tried to use Tokanizeas follows.
tk = Tokenizer(nb_words=?, split=?)
But then, what should be the split value and nb_words as my sequence data doesn't have any space? How to use it for character based one hot?
MY overall goal is to zero pad my sequences and convert it to one hot before I feed it into RNN.
So i came across a way to do by using Tokenizer first and then pad_sequences to zero pad my sequence in the start as follows.
from keras.preprocessing.text import Tokenizer
tokenizer = Tokenizer(char_level=True)
tokenizer.fit_on_texts(X_train_orignal)
sequence_of_int = tokenizer.texts_to_sequences(X_train_orignal)
This gives me the output as follows.
[[3, 1, 4, 2, 3, 5, 1, 6, 2, 1, 4, 1, 7, 5, 1, 2, 1, 1, 2, 1, 6, 1, 7],
[3,
1,
4,
2,
3,
5,
1,
6,
2,
1,
4,
1,
7,
5,
1,
2,
1,
4,
1,
7,
5,
1,
2,
1,
6,
1,
7],
[3, 1, 4, 2, 3, 5, 1, 6, 2, 1, 1, 2, 1, 1, 4, 2, 1, 6, 1, 7, 5, 1, 7],
[3, 1, 4, 2, 3, 5, 1, 6, 2, 1, 1, 4, 2, 1, 1, 2, 1, 6, 1, 7, 5, 1, 7],
[3,
1,
6,
2,
1,
4,
1,
2,
1,
4,
1,
2,
1,
6,
5,
8,
10,
11,
9,
4,
8,
3,
12,
9,
5,
2,
3,
5,
8,
10,
11,
9,
4,
8,
3,
12,
9,
5,
2,
3]]
Now I do not understand why it is giving sequence_of_int[1], sequence_of_int[4] output in column format?
After getting the tokens, I applied the pad_sequences as follows.
seq=keras.preprocessing.sequence.pad_sequences(sequence_of_int, maxlen=None, dtype='int32', padding='pre', value=0.0)
and it gives me the output as follows.
array([[ 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 3, 1, 4, 2, 3, 5, 1, 6, 2, 1, 4, 1, 7, 5, 1,
2, 1, 1, 2, 1, 6, 1, 7],
[ 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 3, 1, 4,
2, 3, 5, 1, 6, 2, 1, 4, 1, 7, 5, 1, 2, 1, 4, 1,
7, 5, 1, 2, 1, 6, 1, 7],
[ 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 3, 1, 4, 2, 3, 5, 1, 6, 2, 1, 1, 2, 1, 1, 4,
2, 1, 6, 1, 7, 5, 1, 7],
[ 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 3, 1, 4, 2, 3, 5, 1, 6, 2, 1, 1, 4, 2, 1, 1,
2, 1, 6, 1, 7, 5, 1, 7],
[ 3, 1, 6, 2, 1, 4, 1, 2, 1, 4, 1, 2, 1, 6, 5, 8,
10, 11, 9, 4, 8, 3, 12, 9, 5, 2, 3, 5, 8, 10, 11, 9,
4, 8, 3, 12, 9, 5, 2, 3]], dtype=int32)
Then after that, I converted it into one hot as follows.
one_hot=keras.utils.to_categorical(seq)
I want to take a list of non-negative integers D=[d1,...,dm] and and generate a multidimensional array of indexed symbols A in the form of:
where 0<=i_j<=d_j. For example if D=[2,3] then A should be
[[a_[0,0],a_[0,1],a_[0,2]],
[a_[1,0],a_[1,1],a_[1,2]]]
For this case I could nest two for loops to generate the said array, however D does not necessarily have a length of 2 and I don't know how to nest an arbitrary number of for loops!
I would appreciate if you could help me know how I can generate A from D.
P.S. What I want to finally achieve is to create a multivariate polynomial as explained here.
Here's one way to do it. The essential part is that I called cartesian_product to construct the list of all combinations of indices, and then arrayapply to create the subscripted expressions.
(%i11) ii:setify(makelist(i, i, 0, n)), n=2;
(%o11) {0, 1, 2}
(%i12) apply (cartesian_product, makelist (ii, m)), m=3;
(%o12) {[0, 0, 0], [0, 0, 1], [0, 0, 2], [0, 1, 0], [0, 1, 1],
[0, 1, 2], [0, 2, 0], [0, 2, 1], [0, 2, 2], [1, 0, 0],
[1, 0, 1], [1, 0, 2], [1, 1, 0], [1, 1, 1], [1, 1, 2],
[1, 2, 0], [1, 2, 1], [1, 2, 2], [2, 0, 0], [2, 0, 1],
[2, 0, 2], [2, 1, 0], [2, 1, 1], [2, 1, 2], [2, 2, 0],
[2, 2, 1], [2, 2, 2]}
(%i13) map (lambda ([l], arrayapply (_a, l)), %);
(%o13) {_a , _a , _a , _a , _a ,
0, 0, 0 0, 0, 1 0, 0, 2 0, 1, 0 0, 1, 1
_a , _a , _a , _a , _a ,
0, 1, 2 0, 2, 0 0, 2, 1 0, 2, 2 1, 0, 0
_a , _a , _a , _a , _a ,
1, 0, 1 1, 0, 2 1, 1, 0 1, 1, 1 1, 1, 2
_a , _a , _a , _a , _a ,
1, 2, 0 1, 2, 1 1, 2, 2 2, 0, 0 2, 0, 1
_a , _a , _a , _a , _a ,
2, 0, 2 2, 1, 0 2, 1, 1 2, 1, 2 2, 2, 0
_a , _a }
2, 2, 1 2, 2, 2
(%i14) grind (%);
{_a[0,0,0],_a[0,0,1],_a[0,0,2],_a[0,1,0],_a[0,1,1],_a[0,1,2],
_a[0,2,0],_a[0,2,1],_a[0,2,2],_a[1,0,0],_a[1,0,1],_a[1,0,2],
_a[1,1,0],_a[1,1,1],_a[1,1,2],_a[1,2,0],_a[1,2,1],_a[1,2,2],
_a[2,0,0],_a[2,0,1],_a[2,0,2],_a[2,1,0],_a[2,1,1],_a[2,1,2],
_a[2,2,0],_a[2,2,1],_a[2,2,2]}$
(%o14) done
This is just working at the top-level interactive prompt; if you need to construct a function, I think you'll see how to do it.
EDIT: Here's a way to create the polynomial.
(%i16) S : {[0, 0, 0], [0, 0, 1], [0, 0, 2], [0, 1, 0], [0, 1, 1],
[0, 1, 2], [0, 2, 0], [0, 2, 1], [0, 2, 2], [1, 0, 0],
[1, 0, 1], [1, 0, 2], [1, 1, 0], [1, 1, 1], [1, 1, 2],
[1, 2, 0], [1, 2, 1], [1, 2, 2], [2, 0, 0], [2, 0, 1],
[2, 0, 2], [2, 1, 0], [2, 1, 1], [2, 1, 2], [2, 2, 0],
[2, 2, 1], [2, 2, 2]} $
(%i17) L : listify (S) $
(%i18) A : map (lambda ([l], arrayapply (_a, l)), L);
(%o18) [_a , _a , _a , _a , _a ,
0, 0, 0 0, 0, 1 0, 0, 2 0, 1, 0 0, 1, 1
_a , _a , _a , _a , _a ,
0, 1, 2 0, 2, 0 0, 2, 1 0, 2, 2 1, 0, 0
_a , _a , _a , _a , _a ,
1, 0, 1 1, 0, 2 1, 1, 0 1, 1, 1 1, 1, 2
_a , _a , _a , _a , _a ,
1, 2, 0 1, 2, 1 1, 2, 2 2, 0, 0 2, 0, 1
_a , _a , _a , _a , _a ,
2, 0, 2 2, 1, 0 2, 1, 1 2, 1, 2 2, 2, 0
_a , _a ]
2, 2, 1 2, 2, 2
(%i19) U : map (lambda ([l], product (u[i]^l[i], i, 1, length(l))), L);
2 2 2 2 2 2
(%o19) [1, u , u , u , u u , u u , u , u u , u u , u ,
3 3 2 2 3 2 3 2 2 3 2 3 1
2 2 2 2
u u , u u , u u , u u u , u u u , u u , u u u ,
1 3 1 3 1 2 1 2 3 1 2 3 1 2 1 2 3
2 2 2 2 2 2 2 2 2 2 2 2
u u u , u , u u , u u , u u , u u u , u u u , u u ,
1 2 3 1 1 3 1 3 1 2 1 2 3 1 2 3 1 2
2 2 2 2 2
u u u , u u u ]
1 2 3 1 2 3
(%i20) A.U;
2 2 2 2 2 2 2
(%o20) u u _a u + u u _a u + u _a u
1 2 2, 2, 2 3 1 2 2, 1, 2 3 1 2, 0, 2 3
2 2 2 2 2
+ u _a u u + _a u u + u _a u u
1 1, 2, 2 2 3 0, 2, 2 2 3 1 1, 1, 2 2 3
2 2 2
+ _a u u + u _a u + _a u
0, 1, 2 2 3 1 1, 0, 2 3 0, 0, 2 3
2 2 2 2
+ u u _a u + u u _a u + u _a u
1 2 2, 2, 1 3 1 2 2, 1, 1 3 1 2, 0, 1 3
2 2
+ u _a u u + _a u u + u _a u u
1 1, 2, 1 2 3 0, 2, 1 2 3 1 1, 1, 1 2 3
+ _a u u + u _a u + _a u
0, 1, 1 2 3 1 1, 0, 1 3 0, 0, 1 3
2 2 2 2
+ u u _a + u u _a + u _a
1 2 2, 2, 0 1 2 2, 1, 0 1 2, 0, 0
2 2
+ u _a u + _a u + u _a u
1 1, 2, 0 2 0, 2, 0 2 1 1, 1, 0 2
+ _a u + u _a + _a
0, 1, 0 2 1 1, 0, 0 0, 0, 0
Note that the ordering of terms within each product doesn't conform to what humans would consider the usual convention, e.g. [1]^2*u[2]^2*_a[2,2,2]*u[3]^2 is the first term. Maxima is ordering the terms according to the subscripts, therefore _a[2,2,2] comes after u[1] and before u[3]. In some contexts this coincides with what humans expect, but here it doesn't; in any event, Maxima is consistent in hope of making programmatic manipulation work better.
(%i21) grind (%);
u[1]^2*u[2]^2*_a[2,2,2]*u[3]^2+u[1]^2*u[2]*_a[2,1,2]*u[3]^2
+u[1]^2*_a[2,0,2]*u[3]^2
+u[1]*_a[1,2,2]*u[2]^2*u[3]^2
+_a[0,2,2]*u[2]^2*u[3]^2
+u[1]*_a[1,1,2]*u[2]*u[3]^2
+_a[0,1,2]*u[2]*u[3]^2
+u[1]*_a[1,0,2]*u[3]^2
+_a[0,0,2]*u[3]^2
+u[1]^2*u[2]^2*_a[2,2,1]*u[3]
+u[1]^2*u[2]*_a[2,1,1]*u[3]
+u[1]^2*_a[2,0,1]*u[3]
+u[1]*_a[1,2,1]*u[2]^2*u[3]
+_a[0,2,1]*u[2]^2*u[3]
+u[1]*_a[1,1,1]*u[2]*u[3]
+_a[0,1,1]*u[2]*u[3]
+u[1]*_a[1,0,1]*u[3]
+_a[0,0,1]*u[3]
+u[1]^2*u[2]^2*_a[2,2,0]
+u[1]^2*u[2]*_a[2,1,0]
+u[1]^2*_a[2,0,0]
+u[1]*_a[1,2,0]*u[2]^2
+_a[0,2,0]*u[2]^2
+u[1]*_a[1,1,0]*u[2]
+_a[0,1,0]*u[2]+u[1]*_a[1,0,0]
+_a[0,0,0]$
(%o21) done
I was trying to get the nullity and kernel of a matrix over the complex field in Maxima.
I get strange results, though.
I can define a matrix A:
M : matrix([0, 1, 1, 0], [-1, 0, 0, 1], [0, 0, 0, 1], [0, 0, -1, 0]);
A : M + %i * ident(4);
... for reference, it looks like this:
%i 1 1 0
-1 %i 0 1
0 0 %i 1
0 0 -1 %i
If I then compute the nullity with nullity(A), I get 3.
If I compute the rank with rank(A), I also get 3.
And if I compute the nullspace with nullspace(A), I get:
span([-1, %i, 0, 0], [-%i, -1, 0, 0], [2%i, 2, 0, 0])
But this is pretty weird, because -%i * second(...) is [-1, %i, 0, 0], which is the first vector.
And indeed, when I do NullSpace[{{i, 1, 1, 0}, {-1, i, 0, 1}, {0, 0, i, 1}, {0, 0, -1, i}}] in Mathematica, I get that the nullspace has basis [%i, 1, 0, 0] and is 1-dimensional (not 3-dimensional).
What am I doing wrong?
You are doing everything right, as far as I can tell. The problem is a bug in Maxima, which I have reported: https://sourceforge.net/p/maxima/bugs/3158/
I don't see any simple way to work around it. I am working on fixing the bug.