How to optimize pyspark join with fill last occurance? - join

I have two dataframes: stage_changes, journeys with their description as follows,
In my app, every candidate gets associated with a level_id on his starting day. Candidate's level is changed whenever some progress is made, so duration of num of days between level change is not fixed.
For example, candiate-A is on level-0 on day-0 then level-1 on day-5 then directly level-4 on day-25. This data is tracked in the dataframe stage_changes.
stage_changes:
account_id | candidate_id | day_num | level_id
21 | 23097 | 0 | 0
21 | 23097 | 5 | 1
21 | 23097 | 25 | 4
45 | 53838 | 4 | 0
45 | 23097 | 30 | 7
21 | 23056 | 45 | 1
Every candidate is active for a specific period, described as starting_day to ending_day. And this is tracked in another dataframe journeys as,
journeys:
account_id | candidate_id | starting_day | ending_day
21 | 23097 | 0 | 76
45 | 53838 | 4 | 45
21 | 23056 | 45 | 101
I want to get level_id of every candidate on each day during his/her journey. I am currently doing this as follows,
#udf("array<integer>")
def day_range(start_day, end_day):
return list(range(start_day, end_day+1))
all_journey_days = journeys \
.withColumn("day_range", day_range(col("starting_day"), col("ending_day"))) \
.select(["account_id", "candidate_id", explode("day_range").alias("day_num")])
window = Window().partitionBy(["account_id", "candidate_id"]).orderBy("day_num") \
.rowsBetween(Window.unboundedPreceding, 0)
all_day_stage_changes = all_journey_days \
.join(stage_changes, on=["account_id", "candidate_id", "day_num"], how="left") \
.withColumn("final_level_id", last(col("level_id"), ignorenulls=True).over(window))
I'm getting the correct output as per this code, but this takes several minutes considering huge data size at my end. Is there any optimization to this, so that the whole process can finish quickly?
Important points:
1. Every candidate has different activity period.
2. Level changes of every candidate are different, there is no hidden logic here.
3. The number of candidates is ~1M and their avg. activity period ~100 days.
4. Every pair of account_id, candidate_id is unique in the system and not only the candidate_id.

Related

How to forecast (or any other function) in Google Sheets with only one cell of data?

My sheet:
+---------+-----------+---------+---------+-----------+
| product | value 1 | value 2 | value 3 | value 4 |
+---------+-----------+---------+---------+-----------+
| name 1 | 700,000 | 500 | 10,000 | 2,000,000 |
+---------+-----------+---------+---------+-----------+
| name 2 | 200,000 | 800 | 20,000 | ? |
+---------+-----------+---------+---------+-----------+
| name 3 | 100,000 | 150 | 6,000 | ? |
+---------+-----------+---------+---------+-----------+
| name 4 | 1,000,000 | 1,000 | 25,000 | ? |
+---------+-----------+---------+---------+-----------+
| name 5 | 2,000,000 | 1,500 | 30,000 | ? |
+---------+-----------+---------+---------+-----------+
| name 6 | 2,500,000 | 3,000 | 65,000 | ? |
+---------+-----------+---------+---------+-----------+
| name 7 | 300,000 | 300 | 12,000 | ? |
+---------+-----------+---------+---------+-----------+
| name 8 | 350,000 | 200 | 9,000 | ? |
+---------+-----------+---------+---------+-----------+
| name 9 | 900,000 | 1,200 | 28,000 | ? |
+---------+-----------+---------+---------+-----------+
| name 10 | 150,000 | 100 | 5,000 | ? |
+---------+-----------+---------+---------+-----------+
What I am attempting is to predict the empty columns based on the data that I do have. Maybe just one of the columns that contain data in every row or maybe I should be only focusing on one column that contains data in every row?
I have used FORECAST previously but had more data in the column that I was predicting values for which the lack of data I think is my root problem(?). Not sure if FORECAST is best for this so any recommendations for other functions are most welcome.
The last thing I can add though is that the known value in column E (value 4) is a confident number and ideally it's used in any formula that I end up with (although I am open to any other recommendations).
The formula I was using:
=FORECAST(D3,E2,$D$2:$D$11)
I don't think this is possible without more information. If you think about it, Value 4 can be a constant (always 2,000,000), be dependent on only one other value (say 200 times value 3), or be a complex formula (say add values 1, 2, and 3 with a constant). Each of these 3 models agree with the values for name 1, however they generate vastly different value 4 predictions.
In the case of name 2, the models would output the following for value 4:
Constant: 2,000,000
Value 3: 8,000,000
Sum: 2,489,700
Each of those values could be valid without providing further constraints (either through data points or specifying the kind of model, but probably both).

In data warehouse, can fact table contain two same records?

If a user ordered same product with two different order_id;
The orders are created within a same date-hour granularity, for example
order#1 2019-05-05 17:23:21
order#2 2019-05-05 17:33:21
In the data warehouse, should we put them into two rows like this (Option 1):
| id | user_key | product_key | date_key | time_key | price | quantity |
|-----|----------|-------------|----------|----------|-------|----------|
| 001 | 1111 | 22 | 123 | 456 | 10 | 1 |
| 002 | 1111 | 22 | 123 | 456 | 10 | 2 |
Or just put them in one row with the aggregated quantity (Option 2):
| id | user_key | product_key | date_key | time_key | price | quantity |
|-----|----------|-------------|----------|----------|-------|----------|
| 001 | 1111 | 22 | 123 | 456 | 10 | 3 |
I know if I put the order_id as a degenerate dimension in the fact table, it should be Option 1. But in our case, we don't really want to keep the order_id.
Also I once read an article that says that when all dimensions are filtered out, there should be only one row of data in the fact table. If this statement is correct, the Option 2 will be the choice.
Is there a principle where I can refer ?
Conceptually, fact tables in a data warehouse should be designed at the most detailed grain available. You can always aggregate data from the lower granularity to the higher one, while the opposite is not true - if you combine the records, some information is lost permanently. If you ever need it later (even though you might not see it now), you'll regret the decision.
I would recommend the following approach: in a data warehouse, keep order number as degenerate dimension. Then, when you publish a star schema, you might build a pre-aggregated version of the table (skip order number, group identical records by date/hour). This way, you can have smaller/cleaner fact table in your dimensional model, and yet preserve more detailed data in the DW.

How to import apple core motion dataset in turi create?

I've recently discovered that apple core motion data (accelerometer, gyroscope etc) can be used to create learning models. The link below shows an example:
https://github.com/apple/turicreate/blob/master/userguide/activity_classifier/introduction.md
This example uses data from a large dataset (HAPT). In my situation I'm the creator of my own dataset using recordings of core motion data while performing different activities (i.e. jumping, walking, sitting). The next step is to import my dataset in turi to create a model. How this can be achieved? Could anyone provide a list of steps to follow?
Thank you
Ideally, you would have recorded your motion data into some standard format. Let's assume it is in CSV format.
walking,jumping,sitting
82,309635,1
82,309635,1
25,18265403,1
30,18527312,8
30,17977769,40
30,18375422,37
30,18292441,38
30,303092,7
85,18449654,3
You can read the file using any file reader. To simplify your life, pandas or sframe may rescue you.
In [14]: import turicreate as tc
In [15]: sf = tc.SFrame.read_csv('/tmp/activity.csv')
Finished parsing file /tmp/activity.csv
Parsing completed. Parsed 9 lines in 0.13823 secs.
------------------------------------------------------
Inferred types from first 100 line(s) of file as
column_type_hints=[int,int,int]
If parsing fails due to incorrect types, you can correct
the inferred type list above and pass it to read_csv in
the column_type_hints argument
------------------------------------------------------
Finished parsing file /tmp/activity.csv
Parsing completed. Parsed 9 lines in 0.113868 secs.
In [16]: sf.head()
Out[16]:
Columns:
walking int
jumping int
sitting int
Rows: 9
Data:
+---------+----------+---------+
| walking | jumping | sitting |
+---------+----------+---------+
| 82 | 309635 | 1 |
| 82 | 309635 | 1 |
| 25 | 18265403 | 1 |
| 30 | 18527312 | 8 |
| 30 | 17977769 | 40 |
| 30 | 18375422 | 37 |
| 30 | 18292441 | 38 |
| 30 | 303092 | 7 |
| 85 | 18449654 | 3 |
+---------+----------+---------+
[9 rows x 3 columns]

Ideal solution for the following case scenario in database

There are 50 exams to be written by around millions of students online, One person may or may not write more than one exam. A person can also write a single exam more than one time ( retries ) ..
So which of the below solution is better for this case, I am okay with a better solution than these two as well
Option 1. Store each exam in a single table :
Subject 1
+----------------+---------+
| student id | Marks |
+----------------+---------+
| 1 | 85 |
| 2 | 32 |
| 2 | 60 |
+----------------+---------+
Subject 2
+----------------+---------+
| student id | Marks |
+----------------+---------+
| 1 | 85 |
| 2 | 32 |
| 2 | 60 |
+----------------+---------+
Like above with each table will have the student id only if that particular person has taken that exam , and have multiple occurrences of the student id if he has taken it more than once.
Option 2 :
+----------------+---------+---------+
| student id | Subject | Marks |
+----------------+---------+---------+
| 1 | Subj1 | 85 |
| 2 | Subj1 | 32 |
| 2 | Subj1 | 60 |
| 1 | Subj2 | 80 |
| 3 | Subj2 | 90 |
+----------------+---------+---------+
with all the values in a single table.
Which is better in terms of performance and storage perspective.
My various que
I think the best here is following:
Table STUDENT with information about students
Table EXAM with information about exams
Table EXAM_TRY with reference to STUDENT and EXAM tables, and fields DATE_OF_EXAM and RESULT_OF_EXAM
2 indexes on foreign keys in table EXAM_TRY
Depending on situation - index on date field (for example, you would need it for planning work for examiners)

Iterating over irregular data with Ruby while 'filling in the blanks'

I'm rolling the following:
Rails 3.2.9
Highcharts
State Machine
I've got an irregular set of data that represents the change of state of hundreds of linux boxes. Each box checks into a central ping server every two minutes.
Every time a device heartbeats, the ping server checks if the device's current state is offline and if so, changes the state to online and sets the heartbeat table's online col to true and inserts the time this happened.
On the ping server, we have a cron that runs a rake task every 5 minutes. This finds all devices with a heartbeat less than the time now minus 5 minutes.
If it discovers a device is offline, it sets the device state to offline and marks to heartbeat table with the time of the last heartbeat and a 0.
We've been doing this for a while and it seems like an efficient way to store the uptime data without creating a row for 500 devices every 5 minutes.
The table looks a little like this:
+---------------------+--------+--------+
| created_at | dev_id | online |
+---------------------+--------+--------+
| 2012-10-08 16:29:16 | 2345 | 0 |
| 2012-11-21 16:40:22 | 2345 | 1 |
| 2012-11-03 19:15:00 | 2345 | 0 |
| 2012-11-08 09:15:01 | 2345 | 1 |
| 2012-11-08 09:18:03 | 2345 | 0 |
| 2012-11-09 17:57:22 | 2345 | 1 |
| 2012-12-09 13:57:23 | 2345 | 0 |
| 2012-12-09 14:57:25 | 2345 | 1 |
| 2012-12-09 15:00:30 | 2345 | 0 |
| 2012-12-09 15:57:31 | 2345 | 1 |
| 2012-12-09 16:07:35 | 2345 | 0 |
| 2012-12-09 16:37:38 | 2345 | 1 |
| 2012-12-09 17:57:40 | 2345 | 0 |
+---------------------+--------+--------+
Following Ryan Bate's fantastic Railscast on Highcharts, I can create a line graph of this data with irregular intervals.
The chart and data series
Following this example:
http://www.highcharts.com/demo/spline-irregular-time
And using a data series something like this:
= #devices.heartbeats.map { |o| o.online == true ? 1 : 0 }
It was plotting the line graph pretty nicely.
Where I'm stuck
The graph finishes at the last time it checked in and I need the graph to show a point at Now. In Ryan's example, he maps a zero to a date if there's no value. I can't translate this part.
I'm trying to achieve a graph like the stack bar chart but can't get the data sorted.
http://www.highcharts.com/demo/bar-stacked
How can I format my query so I get the data until Now as well as each individual point so I can create such a graph?

Resources