Related
I am struggling with the problem I am facing:
I have a dataset of different products (Cars) that have certain Work Orders open at a given time. I know from historical data how much time this work in TOTAL has caused.
Now I want to predict it for another Car (e.g. Car 3).
Which type of algorithm, regression shall I use for this?
My idea was to transform this row based dataset into column based with binary values e.g. Brake: 0/1, Screen 0/1.. But then I will have lots of Inputs as the number of possible Inputs is 100-200..
Here's a quick idea using multi-factor regression for 30 jobs, each of which is some random accumulation of 6 tasks with a "true cost" for each task. We can regress against the task selections in each job to estimate the cost coefficients that best explain the total job costs.
First done w/ no "noise" in the system (tasks are exact), then with some random noise.
A "more thorough" job would include examining the R-squared value and plotting the residuals to ensure linearity.
In [1]: from sklearn import linear_model
In [2]: import numpy as np
In [3]: jobs = np.random.binomial(1, 0.6, (30, 6))
In [4]: true_costs = np.array([10, 20, 5, 53, 31, 42])
In [5]: jobs
Out[5]:
array([[0, 1, 1, 1, 1, 0],
[1, 0, 0, 1, 0, 1],
[1, 1, 0, 1, 0, 0],
[1, 0, 1, 1, 1, 1],
[1, 1, 0, 0, 1, 1],
[0, 1, 0, 0, 1, 0],
[1, 0, 0, 1, 1, 0],
[1, 1, 1, 1, 0, 1],
[1, 0, 0, 1, 0, 1],
[0, 1, 0, 1, 0, 0],
[0, 0, 1, 0, 1, 1],
[1, 0, 1, 1, 1, 1],
[0, 1, 1, 1, 1, 1],
[1, 0, 1, 1, 1, 1],
[0, 1, 1, 0, 1, 0],
[1, 0, 1, 0, 1, 0],
[1, 1, 1, 1, 1, 1],
[1, 0, 1, 0, 0, 1],
[0, 1, 0, 1, 1, 0],
[1, 1, 1, 0, 1, 0],
[1, 1, 1, 1, 1, 0],
[1, 0, 1, 0, 0, 1],
[0, 0, 0, 1, 1, 1],
[1, 1, 0, 1, 1, 1],
[1, 0, 1, 1, 0, 1],
[1, 1, 1, 1, 1, 1],
[1, 0, 1, 1, 1, 1],
[0, 0, 1, 1, 0, 0],
[1, 1, 0, 0, 1, 1],
[1, 1, 1, 1, 0, 0]])
In [6]: tot_job_costs = jobs # true_costs
In [7]: reg = linear_model.LinearRegression()
In [8]: reg.fit(jobs, tot_job_costs)
Out[8]: LinearRegression()
In [9]: reg.coef_
Out[9]: array([10., 20., 5., 53., 31., 42.])
In [10]: np.random.normal?
In [11]: noise = np.random.normal(0, scale=5, size=30)
In [12]: noisy_costs = tot_job_costs + noise
In [13]: noisy_costs
Out[13]:
array([113.94632664, 103.82109478, 78.73776288, 145.12778089,
104.92931235, 48.14676751, 94.1052639 , 134.64827785,
109.58893129, 67.48897806, 75.70934522, 143.46588308,
143.12160502, 147.71249157, 53.93020167, 44.22848841,
159.64772255, 52.49447057, 102.70555991, 69.08774251,
125.10685342, 45.79436364, 129.81354375, 160.92510393,
108.59837665, 149.1673096 , 135.12600871, 60.55375843,
107.7925208 , 88.16833899])
In [14]: reg.fit(jobs, noisy_costs)
Out[14]: LinearRegression()
In [15]: reg.coef_
Out[15]:
array([12.09045186, 19.0013987 , 3.44981506, 55.21114084, 33.82282467,
40.48642199])
In [16]:
I am trying to implement a weighted sampler for a very imbalanced data set. There are 182 different classes. Here is an array of the bin counts per class:
array([69487, 5770, 5753, 138, 4308, 10, 1161, 29, 5611,
350, 7, 183, 218, 4, 3, 3872, 5, 950,
33, 3, 443, 16, 20, 330, 4353, 186, 19,
122, 546, 6, 44, 6, 3561, 2186, 3, 48,
8440, 338, 9, 610, 74, 236, 160, 449, 72,
6, 37, 1729, 2255, 1392, 12, 1, 3426, 513,
44, 3, 28, 12, 9, 27, 5, 75, 15,
3, 21, 549, 7, 25, 871, 240, 128, 28,
253, 62, 55, 12, 8, 57, 16, 99, 6,
5, 150, 7, 110, 8, 2, 1296, 70, 1927,
470, 1, 1, 511, 2, 620, 946, 36, 19,
21, 39, 6, 101, 15, 7, 1, 90, 29,
40, 14, 1, 4, 330, 1099, 1248, 1146, 7414,
934, 156, 80, 755, 3, 6, 6, 9, 21,
70, 219, 3, 3, 15, 15, 12, 69, 21,
15, 3, 101, 9, 9, 11, 6, 32, 6,
32, 4422, 16282, 12408, 2959, 3352, 146, 1329, 1300,
3795, 90, 1109, 120, 48, 23, 9, 1, 6,
2, 1, 11, 5, 27, 3, 7, 1, 3,
70, 1598, 254, 90, 20, 120, 380, 230, 180,
10, 10])
In some classes, instances are as low as 1. I am trying to implement a Weighted random sampler from torch for this dataset. However, as the class imbalance is so large, when I calculate weights using
count_occr = np.bincount(dataset.y)
lbl_weights = 1. / count_occr
weights = np.array(lbl_weights)
weights = torch.from_numpy(weights)
sampler = WeightedRandomSampler(weights.type('torch.DoubleTensor'), len(weights*2))
I get two error messages:
RuntimeWarning: divide by zero encountered in true_divide
and
RuntimeError: invalid multinomial distribution (encountering probability entry = infinity or NaN)
Does anyone have a work around for this ? I was considering multiplying the lbl_weights by some scalar however I am not sure if this is a viable option.
I ran several benchmarks on a CPU cache simulation to count the number of the reuse distances (the distance between two accesses of the same cache entry during execution of the program), here are two examples of the data I got from two different benchmarks:
Benchmark 1:
"reuse_dist_counts": {
"-1": 340,
"0": 623,
"1": 930,
"100": 1,
"107": 1,
"114": 1,
"121": 1,
"128": 1,
"135": 1,
"142": 1,
"149": 1,
"156": 1,
"163": 1,
"170": 1,
"177": 1,
"184": 1,
"191": 1,
"198": 1,
"2": 617,
"205": 1,
"212": 1,
"219": 1,
"226": 1,
"233": 1,
"240": 1,
"247": 1,
"254": 1,
"261": 1,
"268": 1,
"275": 1,
"282": 1,
"289": 1,
"296": 1,
"3": 617,
"303": 1,
"310": 1,
"311": 1,
"314": 1,
"4": 1,
"48": 1,
"55": 1,
"62": 1,
"69": 1,
"76": 1,
"79": 1,
"86": 1,
"93": 1
}
Benchmark2:
"reuse_dist_counts": {
"-1": 58,
"0": 128,
"1": 320,
"11": 17,
"12": 1,
"13": 2,
"14": 4,
"15": 18,
"16": 14,
"17": 13,
"18": 13,
"19": 16,
"2": 256,
"20": 16,
"21": 17,
"22": 17,
"23": 2,
"24": 1,
"25": 3,
"26": 3,
"27": 2,
"28": 3,
"29": 2,
"3": 289,
"30": 2,
"31": 2,
"34": 2,
"35": 6,
"38": 2,
"39": 2,
"4": 198,
"40": 4,
"41": 1,
"43": 1,
"44": 2,
"45": 1,
"47": 1,
"48": 2,
"5": 63,
"50": 1,
"6": 81,
"7": 106,
"8": 1
}
it has the form a:b where a is a reuse distance and b is how many times that reuse distance has been found during the execution.
As I'm going to design a neural network based on this data, it's necessary to find a good normalization to represent it in a different way. Any idea on how to normalize such data with such low variations, and how to represent its features vectors?
When running the following piece of code:
!pip install influxdb-client
from influxdb_client import InfluxDBClient, BucketRetentionRules,WritePrecision
import influxdb_client
import time
client = InfluxDBClient(url="http://localhost:8086", token= "my_password", org="primary")
write_api = client.write_api()
query_api =client.query_api()
write_api.write("myFirstBucket", "primary", ["myMeasurement,location=coyote_creek water_level=1"],write_precision=WritePrecision.MS)
time.sleep(1)
write_api.write("myFirstBucket", "primary", ["myMeasurement,location=coyote_creek water_level=2"],write_precision=WritePrecision.MS)
time.sleep(1)
write_api.write("myFirstBucket", "primary", ["myMeasurement,location=coyote_creek water_level=3"],write_precision=WritePrecision.MS)
time.sleep(1)
write_api.write("myFirstBucket", "primary", ["myMeasurement,location=coyote_creek water_level=4"],write_precision=WritePrecision.MS)
time.sleep(1)
tables=query_api.query('from(bucket:"myFirstBucket") |> range(start: -15s)') #last 15 seconds
for table in tables:
print(table)
for row in table.records:
print (row.values)
I get:
FluxTable() columns: 9, records: 2
{'result': '_result', 'table': 0, '_start': datetime.datetime(2022, 3, 2, 11, 32, 18, 591779, tzinfo=tzutc()), '_stop': datetime.datetime(2022, 3, 2, 11, 32, 33, 591779, tzinfo=tzutc()), '_time': datetime.datetime(2022, 3, 2, 11, 32, 30, 595000, tzinfo=tzutc()), '_value': 2.0, '_field': 'water_level', '_measurement': 'myMeasurement', 'location': 'coyote_creek'}
{'result': '_result', 'table': 0, '_start': datetime.datetime(2022, 3, 2, 11, 32, 18, 591779, tzinfo=tzutc()), '_stop': datetime.datetime(2022, 3, 2, 11, 32, 33, 591779, tzinfo=tzutc()), '_time': datetime.datetime(2022, 3, 2, 11, 32, 32, 590000, tzinfo=tzutc()), '_value': 3.0, '_field': 'water_level', '_measurement': 'myMeasurement', 'location': 'coyote_creek'}
Why do I get only two records? I would expect four records: one record for each write command!
I've got a distance attribute in my User model :
attr_accessor :distance
So when I calculate the distance for each user, and store it in the distance then I can sort them like :
users.sort_by!(&:distance)
And the users get sorted according to the distance appropriately. But when I include other associated methods i.e :
users.includes(:photo).sort_by!(&:distance)
This doesn't sort the users at all, why is this? How can I sort it with distance but include association as well?
the ! in the sort_by! method indicates that the object itself is changed rather than returns a different object.
When you call users.includes(:photo) this method returns a different object. So, what you are actually doing is like:
users2 = users.includes(:photo)
users2.sort_by!(&:distance)
This is why the users object is not sorted after you call sort_by!. A better way to do it might be
users = users.includes(:photo).sort_by(&:distance)
Well it does for me. I do "User", not "users"
User.includes(:photo).sort_by!(&:distance)
What does "users" variable hold, anyway?. Try User.
Edited with my example, here I user Enquiry for User and Score for Distance.
1.9.3p385 :059 > Enquiry.all.sort_by!(&:score).map &:score
Enquiry Load (0.7ms) SELECT `enquiries`.* FROM `enquiries`
=> [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 3, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 6, 7, 8, 8, 10, 10]
1.9.3p385 :060 > Enquiry.includes(:follow_ups).sort_by!(&:score).map &:score
Enquiry Load (0.1ms) SELECT `enquiries`.* FROM `enquiries`
FollowUp Load (0.1ms) SELECT `follow_ups`.* FROM `follow_ups` WHERE `follow_ups`.`enquiry_id` IN (55, 64, 65, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 80, 85, 86, 89, 91, 92, 93, 95, 96, 97, 98, 99, 100, 101, 102, 103, 104, 105, 106, 107, 108, 109, 110, 111, 112, 113, 114, 115, 116, 117, 118, 119, 120, 121, 122, 123, 124, 125, 126, 127)
=> [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 3, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 6, 7, 8, 8, 10, 10]
1.9.3p385 :057 > enquiries = Enquiry.where(status_id: [1,2,3])
1.9.3p385 :061 > enquiries.includes(:follow_ups).sort_by!(&:score).map &:score
Enquiry Load (0.5ms) SELECT `enquiries`.* FROM `enquiries` WHERE `enquiries`.`status_id` IN (1, 2, 3)
FollowUp Load (0.2ms) SELECT `follow_ups`.* FROM `follow_ups` WHERE `follow_ups`.`enquiry_id` IN (68, 75, 78, 91, 95, 96, 97, 98, 99, 100, 101, 102, 103, 104, 105, 106, 107, 108, 113, 114, 115, 116, 117, 120, 122, 123, 124, 125, 126, 127)
=> [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 3, 5, 5, 5, 5, 5, 5, 5, 5, 7, 8, 8, 10]
Note: your question is wrong and you downvote me.
I believe you should do
users.sort_by!(&:distance).includes(:photo)
Use this:
User.includes(:photo).sort_by!(&:distance)
includes is used for Model, not for array.
so User is model name and users is array.
array have 'includes?' method and
Model have 'include' method
So use this
User.includes(:photo).sort_by!(&:distance)
instead of
users.includes(:photo).sort_by!(&:distance)