Normalizing data that has low variation - machine-learning

I ran several benchmarks on a CPU cache simulation to count the number of the reuse distances (the distance between two accesses of the same cache entry during execution of the program), here are two examples of the data I got from two different benchmarks:
Benchmark 1:
"reuse_dist_counts": {
"-1": 340,
"0": 623,
"1": 930,
"100": 1,
"107": 1,
"114": 1,
"121": 1,
"128": 1,
"135": 1,
"142": 1,
"149": 1,
"156": 1,
"163": 1,
"170": 1,
"177": 1,
"184": 1,
"191": 1,
"198": 1,
"2": 617,
"205": 1,
"212": 1,
"219": 1,
"226": 1,
"233": 1,
"240": 1,
"247": 1,
"254": 1,
"261": 1,
"268": 1,
"275": 1,
"282": 1,
"289": 1,
"296": 1,
"3": 617,
"303": 1,
"310": 1,
"311": 1,
"314": 1,
"4": 1,
"48": 1,
"55": 1,
"62": 1,
"69": 1,
"76": 1,
"79": 1,
"86": 1,
"93": 1
}
Benchmark2:
"reuse_dist_counts": {
"-1": 58,
"0": 128,
"1": 320,
"11": 17,
"12": 1,
"13": 2,
"14": 4,
"15": 18,
"16": 14,
"17": 13,
"18": 13,
"19": 16,
"2": 256,
"20": 16,
"21": 17,
"22": 17,
"23": 2,
"24": 1,
"25": 3,
"26": 3,
"27": 2,
"28": 3,
"29": 2,
"3": 289,
"30": 2,
"31": 2,
"34": 2,
"35": 6,
"38": 2,
"39": 2,
"4": 198,
"40": 4,
"41": 1,
"43": 1,
"44": 2,
"45": 1,
"47": 1,
"48": 2,
"5": 63,
"50": 1,
"6": 81,
"7": 106,
"8": 1
}
it has the form a:b where a is a reuse distance and b is how many times that reuse distance has been found during the execution.
As I'm going to design a neural network based on this data, it's necessary to find a good normalization to represent it in a different way. Any idea on how to normalize such data with such low variations, and how to represent its features vectors?

Related

How to predict Total Hours needed with List as Input?

I am struggling with the problem I am facing:
I have a dataset of different products (Cars) that have certain Work Orders open at a given time. I know from historical data how much time this work in TOTAL has caused.
Now I want to predict it for another Car (e.g. Car 3).
Which type of algorithm, regression shall I use for this?
My idea was to transform this row based dataset into column based with binary values e.g. Brake: 0/1, Screen 0/1.. But then I will have lots of Inputs as the number of possible Inputs is 100-200..
Here's a quick idea using multi-factor regression for 30 jobs, each of which is some random accumulation of 6 tasks with a "true cost" for each task. We can regress against the task selections in each job to estimate the cost coefficients that best explain the total job costs.
First done w/ no "noise" in the system (tasks are exact), then with some random noise.
A "more thorough" job would include examining the R-squared value and plotting the residuals to ensure linearity.
In [1]: from sklearn import linear_model
In [2]: import numpy as np
In [3]: jobs = np.random.binomial(1, 0.6, (30, 6))
In [4]: true_costs = np.array([10, 20, 5, 53, 31, 42])
In [5]: jobs
Out[5]:
array([[0, 1, 1, 1, 1, 0],
[1, 0, 0, 1, 0, 1],
[1, 1, 0, 1, 0, 0],
[1, 0, 1, 1, 1, 1],
[1, 1, 0, 0, 1, 1],
[0, 1, 0, 0, 1, 0],
[1, 0, 0, 1, 1, 0],
[1, 1, 1, 1, 0, 1],
[1, 0, 0, 1, 0, 1],
[0, 1, 0, 1, 0, 0],
[0, 0, 1, 0, 1, 1],
[1, 0, 1, 1, 1, 1],
[0, 1, 1, 1, 1, 1],
[1, 0, 1, 1, 1, 1],
[0, 1, 1, 0, 1, 0],
[1, 0, 1, 0, 1, 0],
[1, 1, 1, 1, 1, 1],
[1, 0, 1, 0, 0, 1],
[0, 1, 0, 1, 1, 0],
[1, 1, 1, 0, 1, 0],
[1, 1, 1, 1, 1, 0],
[1, 0, 1, 0, 0, 1],
[0, 0, 0, 1, 1, 1],
[1, 1, 0, 1, 1, 1],
[1, 0, 1, 1, 0, 1],
[1, 1, 1, 1, 1, 1],
[1, 0, 1, 1, 1, 1],
[0, 0, 1, 1, 0, 0],
[1, 1, 0, 0, 1, 1],
[1, 1, 1, 1, 0, 0]])
In [6]: tot_job_costs = jobs # true_costs
In [7]: reg = linear_model.LinearRegression()
In [8]: reg.fit(jobs, tot_job_costs)
Out[8]: LinearRegression()
In [9]: reg.coef_
Out[9]: array([10., 20., 5., 53., 31., 42.])
In [10]: np.random.normal?
In [11]: noise = np.random.normal(0, scale=5, size=30)
In [12]: noisy_costs = tot_job_costs + noise
In [13]: noisy_costs
Out[13]:
array([113.94632664, 103.82109478, 78.73776288, 145.12778089,
104.92931235, 48.14676751, 94.1052639 , 134.64827785,
109.58893129, 67.48897806, 75.70934522, 143.46588308,
143.12160502, 147.71249157, 53.93020167, 44.22848841,
159.64772255, 52.49447057, 102.70555991, 69.08774251,
125.10685342, 45.79436364, 129.81354375, 160.92510393,
108.59837665, 149.1673096 , 135.12600871, 60.55375843,
107.7925208 , 88.16833899])
In [14]: reg.fit(jobs, noisy_costs)
Out[14]: LinearRegression()
In [15]: reg.coef_
Out[15]:
array([12.09045186, 19.0013987 , 3.44981506, 55.21114084, 33.82282467,
40.48642199])
In [16]:

Decode JSON with multiple keys using Codable

I have stored in iCloud several JSON files as a Byte type. Hope that's correct so far.
I've got to fetch those CKRecords and then parse them and show a graph using the values stored in the JSON. I am capable of fetching the data from iCloud but I got no luck parsing this JSON.
{"00:00": 17, "00:10": 16, "00:20": 17, "00:30": 16, "00:40": 16, "00:50": 17, "01:00": 16, "01:10": 16, "01:20": 16, "01:30": 16, "01:40": 17, "01:50": 17, "02:00": 18, "02:10": 18, "02:20": 17, "02:30": 17, "02:40": 17, "02:50": 17, "03:00": 16, "03:10": 17, "03:20": 16, "03:30": 16, "03:40": 17, "03:50": 16, "04:00": 16, "04:10": 16, "04:20": 16, "04:30": 16, "04:40": 16, "04:50": 18, "05:00": 17, "05:10": 16, "05:20": 17, "05:30": 17, "05:40": 17, "05:50": 17, "06:00": 17, "06:10": 17, "06:20": 16, "06:30": 16, "06:40": 17, "06:50": 15, "07:00": 15, "07:10": 15, "07:20": 14, "07:30": 15, "07:40": 13, "07:50": 11, "08:00": 8, "08:10": 8, "08:20": 7, "08:30": 5, "08:40": 4, "08:50": 4, "09:00": 2, "09:10": 2, "09:20": 9, "09:30": 8, "09:40": 7, "09:50": 7, "10:00": 5, "10:10": 4, "10:20": 6, "10:30": 5, "10:40": 4, "10:50": 4, "11:00": 3, "11:10": 2, "11:20": 2, "11:30": 1, "11:40": 2, "11:50": 1, "12:00": 1, "12:10": 2, "12:20": 3, "12:30": 3, "12:40": 2, "12:50": 3, "13:00": 1, "13:10": 0, "13:20": 1, "13:30": 0, "13:40": 3, "13:50": 2, "14:00": 3, "14:10": 4, "14:20": 3, "14:30": 6, "14:40": 4, "14:50": 6, "15:00": 5, "15:10": 7, "15:20": 7, "15:30": 7, "15:40": 7, "15:50": 6, "16:00": 7, "16:10": 8, "16:20": 8, "16:30": 6, "16:40": 5, "16:50": 5, "17:00": 5, "17:10": 4, "17:20": 4, "17:30": 6, "17:40": 6, "17:50": 5, "18:00": 6, "18:10": 7, "18:20": 4, "18:30": 3, "18:40": 3, "18:50": 5, "19:00": 5, "19:10": 3, "19:20": 4, "19:30": 4, "19:40": 2, "19:50": 4, "20:00": 1, "20:10": 2, "20:20": 2, "20:30": 1, "20:40": 3, "20:50": 2, "21:00": 4, "21:10": 4, "21:20": 7, "21:30": 7, "21:40": 6, "21:50": 7, "22:00": 8, "22:10": 7, "22:20": 8, "22:30": 9, "22:40": 10, "22:50": 10, "23:00": 10, "23:10": 9, "23:20": 9, "23:30": 10, "23:40": 10, "23:50": 10}
I previously used Codable to parse a JSON using Swift but this case is different. I got a lot of keys that are numbers and I don't really know how to approach this.
you can try
let res = try? JSONDecoder().decode([String:Int].self,from:data)

How to zero pad on both sides and encode the sequence into one hot in keras?

I have text data as follows.
X_train_orignal= np.array(['OC(=O)C1=C(Cl)C=CC=C1Cl', 'OC(=O)C1=C(Cl)C=C(Cl)C=C1Cl',
'OC(=O)C1=CC=CC(=C1Cl)Cl', 'OC(=O)C1=CC(=CC=C1Cl)Cl',
'OC1=C(C=C(C=C1)[N+]([O-])=O)[N+]([O-])=O'])
As it is evident that different sequences have different length. How can I zero pad the sequence on both sides of the sequence to some maximum length. And then convert each sequence into one hot encoding based on each characters?
Try:
I used the following keras API but it doesn't work with strings sequence.
keras.preprocessing.sequence.pad_sequences(sequences, maxlen=None, dtype='int32', padding='pre', truncating='pre', value=0.0)
I might need to convert my sequence data into one hot vectors first and then zero pad it. For that I tried to use Tokanizeas follows.
tk = Tokenizer(nb_words=?, split=?)
But then, what should be the split value and nb_words as my sequence data doesn't have any space? How to use it for character based one hot?
MY overall goal is to zero pad my sequences and convert it to one hot before I feed it into RNN.
So i came across a way to do by using Tokenizer first and then pad_sequences to zero pad my sequence in the start as follows.
from keras.preprocessing.text import Tokenizer
tokenizer = Tokenizer(char_level=True)
tokenizer.fit_on_texts(X_train_orignal)
sequence_of_int = tokenizer.texts_to_sequences(X_train_orignal)
This gives me the output as follows.
[[3, 1, 4, 2, 3, 5, 1, 6, 2, 1, 4, 1, 7, 5, 1, 2, 1, 1, 2, 1, 6, 1, 7],
[3,
1,
4,
2,
3,
5,
1,
6,
2,
1,
4,
1,
7,
5,
1,
2,
1,
4,
1,
7,
5,
1,
2,
1,
6,
1,
7],
[3, 1, 4, 2, 3, 5, 1, 6, 2, 1, 1, 2, 1, 1, 4, 2, 1, 6, 1, 7, 5, 1, 7],
[3, 1, 4, 2, 3, 5, 1, 6, 2, 1, 1, 4, 2, 1, 1, 2, 1, 6, 1, 7, 5, 1, 7],
[3,
1,
6,
2,
1,
4,
1,
2,
1,
4,
1,
2,
1,
6,
5,
8,
10,
11,
9,
4,
8,
3,
12,
9,
5,
2,
3,
5,
8,
10,
11,
9,
4,
8,
3,
12,
9,
5,
2,
3]]
Now I do not understand why it is giving sequence_of_int[1], sequence_of_int[4] output in column format?
After getting the tokens, I applied the pad_sequences as follows.
seq=keras.preprocessing.sequence.pad_sequences(sequence_of_int, maxlen=None, dtype='int32', padding='pre', value=0.0)
and it gives me the output as follows.
array([[ 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 3, 1, 4, 2, 3, 5, 1, 6, 2, 1, 4, 1, 7, 5, 1,
2, 1, 1, 2, 1, 6, 1, 7],
[ 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 3, 1, 4,
2, 3, 5, 1, 6, 2, 1, 4, 1, 7, 5, 1, 2, 1, 4, 1,
7, 5, 1, 2, 1, 6, 1, 7],
[ 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 3, 1, 4, 2, 3, 5, 1, 6, 2, 1, 1, 2, 1, 1, 4,
2, 1, 6, 1, 7, 5, 1, 7],
[ 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 3, 1, 4, 2, 3, 5, 1, 6, 2, 1, 1, 4, 2, 1, 1,
2, 1, 6, 1, 7, 5, 1, 7],
[ 3, 1, 6, 2, 1, 4, 1, 2, 1, 4, 1, 2, 1, 6, 5, 8,
10, 11, 9, 4, 8, 3, 12, 9, 5, 2, 3, 5, 8, 10, 11, 9,
4, 8, 3, 12, 9, 5, 2, 3]], dtype=int32)
Then after that, I converted it into one hot as follows.
one_hot=keras.utils.to_categorical(seq)

Disk structuring element in opencv

I know a disk structuring element can be created in MATLAB as following:
se=strel('disk',4);
0 0 1 1 1 0 0
0 1 1 1 1 1 0
1 1 1 1 1 1 1
1 1 1 1 1 1 1
1 1 1 1 1 1 1
0 1 1 1 1 1 0
0 0 1 1 1 0 0
Is there any function or method or any other way of creating the structuring element same as above in opencv. I know we can manually create it using loops but I just want to know if some function exist for that.
The closest one (not the exact same) you can get in OpenCV is by calling getStructuringElement():
int sz = 4;
cv::Mat se = cv::getStructuringElement(cv::MORPH_ELLIPSE, cv::Size(2*sz-1, 2*sz-1));
, which gives the matrix with values
[0, 0, 0, 1, 0, 0, 0;
0, 1, 1, 1, 1, 1, 0;
1, 1, 1, 1, 1, 1, 1;
1, 1, 1, 1, 1, 1, 1;
1, 1, 1, 1, 1, 1, 1;
0, 1, 1, 1, 1, 1, 0;
0, 0, 0, 1, 0, 0, 0]
def estructurant(radius):
kernel = np.zeros((2*radius+1, 2*radius+1) ,np.uint8)
y,x = np.ogrid[-radius:radius+1, -radius:radius+1]
mask = x**2 + y**2 <= radius**2
kernel[mask] = 1
kernel[0,radius-1:kernel.shape[1]-radius+1] = 1
kernel[kernel.shape[0]-1,radius-1:kernel.shape[1]-radius+1]= 1
kernel[radius-1:kernel.shape[0]-radius+1,0] = 1
kernel[radius-1:kernel.shape[0]-radius+1,kernel.shape[1]-1] = 1
return kernel
try this
You could also use skimage.morphology.disk, which produces a symmetric result (unlike cv2.getStructuringElement):
>>> disk(4)
array([[0, 0, 0, 0, 1, 0, 0, 0, 0],
[0, 0, 1, 1, 1, 1, 1, 0, 0],
[0, 1, 1, 1, 1, 1, 1, 1, 0],
[0, 1, 1, 1, 1, 1, 1, 1, 0],
[1, 1, 1, 1, 1, 1, 1, 1, 1],
[0, 1, 1, 1, 1, 1, 1, 1, 0],
[0, 1, 1, 1, 1, 1, 1, 1, 0],
[0, 0, 1, 1, 1, 1, 1, 0, 0],
[0, 0, 0, 0, 1, 0, 0, 0, 0]], dtype=uint8)

Sort by doesn't sort when includes is used in rails

I've got a distance attribute in my User model :
attr_accessor :distance
So when I calculate the distance for each user, and store it in the distance then I can sort them like :
users.sort_by!(&:distance)
And the users get sorted according to the distance appropriately. But when I include other associated methods i.e :
users.includes(:photo).sort_by!(&:distance)
This doesn't sort the users at all, why is this? How can I sort it with distance but include association as well?
the ! in the sort_by! method indicates that the object itself is changed rather than returns a different object.
When you call users.includes(:photo) this method returns a different object. So, what you are actually doing is like:
users2 = users.includes(:photo)
users2.sort_by!(&:distance)
This is why the users object is not sorted after you call sort_by!. A better way to do it might be
users = users.includes(:photo).sort_by(&:distance)
Well it does for me. I do "User", not "users"
User.includes(:photo).sort_by!(&:distance)
What does "users" variable hold, anyway?. Try User.
Edited with my example, here I user Enquiry for User and Score for Distance.
1.9.3p385 :059 > Enquiry.all.sort_by!(&:score).map &:score
Enquiry Load (0.7ms) SELECT `enquiries`.* FROM `enquiries`
=> [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 3, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 6, 7, 8, 8, 10, 10]
1.9.3p385 :060 > Enquiry.includes(:follow_ups).sort_by!(&:score).map &:score
Enquiry Load (0.1ms) SELECT `enquiries`.* FROM `enquiries`
FollowUp Load (0.1ms) SELECT `follow_ups`.* FROM `follow_ups` WHERE `follow_ups`.`enquiry_id` IN (55, 64, 65, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 80, 85, 86, 89, 91, 92, 93, 95, 96, 97, 98, 99, 100, 101, 102, 103, 104, 105, 106, 107, 108, 109, 110, 111, 112, 113, 114, 115, 116, 117, 118, 119, 120, 121, 122, 123, 124, 125, 126, 127)
=> [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 3, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 6, 7, 8, 8, 10, 10]
1.9.3p385 :057 > enquiries = Enquiry.where(status_id: [1,2,3])
1.9.3p385 :061 > enquiries.includes(:follow_ups).sort_by!(&:score).map &:score
Enquiry Load (0.5ms) SELECT `enquiries`.* FROM `enquiries` WHERE `enquiries`.`status_id` IN (1, 2, 3)
FollowUp Load (0.2ms) SELECT `follow_ups`.* FROM `follow_ups` WHERE `follow_ups`.`enquiry_id` IN (68, 75, 78, 91, 95, 96, 97, 98, 99, 100, 101, 102, 103, 104, 105, 106, 107, 108, 113, 114, 115, 116, 117, 120, 122, 123, 124, 125, 126, 127)
=> [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 3, 5, 5, 5, 5, 5, 5, 5, 5, 7, 8, 8, 10]
Note: your question is wrong and you downvote me.
I believe you should do
users.sort_by!(&:distance).includes(:photo)
Use this:
User.includes(:photo).sort_by!(&:distance)
includes is used for Model, not for array.
so User is model name and users is array.
array have 'includes?' method and
Model have 'include' method
So use this
User.includes(:photo).sort_by!(&:distance)
instead of
users.includes(:photo).sort_by!(&:distance)

Resources