How can I flatten a pyspark dataframe joining on linked ids - join

I have a PySpark dataframe of ids showing the results of a series of linkage operations. It shows how the records within a series of dataframes are linked across those dataframes. An example is as below where df_a represents the first dataframe in a pair comparison and df_b the second. link_a is the id of a record in df_a that links to an entry in df_b.
So the first row is saying that dataframe 1 entry 100 links to entry 200 in dataframe 2. In the next row entry 100 in dataframe 1 links to entry 300 in dataframe 3.
df_a
df_b
link_a
link_b
1
2
100
200
1
3
100
300
1
4
100
400
1
5
100
500
2
3
200
300
2
4
200
400
2
5
200
500
3
4
300
400
4
5
400
500
1
2
101
201
1
3
101
301
2
3
201
301
2
3
202
302
1
3
103
303
1
5
103
503
In the real table there are many dataframe comparisons and across all the dataframes 100000's of links.
What I am trying to achieve is to reformat this table to show the links across all the dataframes. So you can see in the above example that 100 links to 200, 200 to 300, 300 to 400 and 400 to 500. So all 5 records link together. Not all cases will have a record in each dataframe so in some cases you will end with incomplete chains empty values.
The end result would look like.
df_1
df_2
df_3
df_4
df_5
100
200
300
400
500
101
201
301
-
-
-
202
302
-
-
103
-
303
-
503
I will then use the above to add a common id to the underlying data.
I have gotten part of the way to this using a series of joins but this seems clumsy and quickly becomes difficult to follow. I have now run out of ideas as to how to solve so assistance getting closer or even to a full solution would be gratefully received.

Related

Splitting an array into equal sized chunks

I'm trying to create a Google Sheet that lets you enter the number of pages in each chapter of a book like so:
Chapter | # of pages
5 | 75
6 | 88
... | ...
53 | 63
And split it into x number of chunks of chapters, so that each chunk has about the same number of pages. So say I want to read 300 pages in the next 5 days, and in the next 300 pages are a total of 13 chapters, each of varying length. How can I break those 13 chapters up so I have about the same amount of reading each day?
Edit:
![example of working sheet](https://i.stack.imgur.com/D7lbu.jpg
The goal is to enter an arbitrary number of chapters and days (in this case, 7) and distribute the chapters between the days so that there's an approximately even amount of pages per day, while keeping the chapters in order.
try:
=TEXT(((C1*C3)/TEXT(C2, "[m]"))/24, "[h] \da\y(\s) m \min")

Parsing Text FIle Using Awk

I would like to parse a text file that has the section of interest as follows:
mesh 0 400 12000
400 300 400
1 0 -1
300 500 600
0 0 1
etc....
12000
1300
1100
etc..
I would only like the rows that immediately follow the row that starts with string mesh and every other one after that as well, and has 3 columns. I would like this output to be in a separate text file with a modified name.
So desired output text file:
400 300 400
300 500 600
I tried to do this with python and loops but it literally took hours and never finished as there are thousand to hundred of thousands of lines in the original text file.
Is there a more efficient way to do this in with a bash script using awk?
awk to the rescue!
$ awk '/^mesh/{n=NR;next} NF==3 && n && NR%2==(n+1)%2' file > filtered_file
400 300 400
300 500 600

collaborative filtering item-based in mahout - without isolating users

In mahout there is implemented method for item based Collaborative filtering called itemsimilarity.
In the theory, similarity between items should be calculated only for users who ranked both items. During testing I realized that in mahout it works different.
In below example the similarity between item 11 and 12 should be equal 1, but mahout output is 0.36.
Example 1. items are 11-12
Similarity between items:
101 102 0.36602540378443865
Matrix with preferences:
11 12
1 1
2 1
3 1 1
4 1
It looks like mahout treats null as 0.
Example 2. items are 101-103.
Similarity between items:
101 102 0.2612038749637414
101 103 0.4340578302732228
102 103 0.2600070276638468
Matrix with preferences:
101 102 103
1 1 0.1
2 1 0.1
3 1 0.1
4 1 1 0.1
5 1 1 0.1
6 1 0.1
7 1 0.1
8 1 0.1
9 1 0.1
10 1 0.1
Similarity between items 101 and 102 should be calculated using only ranks for users 4 and 5, and the same for items 101 and 103 (that should be based on theory). Here (101,103) is more similar than (101,102), and it shouldn't be.
Both examples were run without any additional parameters.
Is this problem solved somwhere, somehow? Any ideas?
Source: http://files.grouplens.org/papers/www10_sarwar.pdf
Those users are not identical. Collaborative filtering needs to have a measure of cooccurrence and the same items do not cooccur between those users. Likewise the items are not identical, they each have different users who prefered them.
The data is turned into a "sparse matrix" where only non-zero values are recorded. The rest are treated as a 0 value, this is expected and correct. The algorithms treat 0 as no preference, not a negative preference.
It's doing the right thing.

Logical Addresses & Page numbers

I just started learning Memory Management and have an idea of page,frames,virtual memory and so on but I'm not understanding the procedure from changing logical addresses to their corresponding page numbers,
Here is the scenario-
Page Size = 100 words /8000 bits?
Process generates this logical address:
10 11 104 170 73 309 185 245 246 434 458 364
Process takes up two page frames,and that none of its are resident (in page frames) when the process begins execution.
Determine the page number corresponding to each logical address and fill them into a table with one row and 12 columns.
I know the answer is :
0 0 1 1 0 3 1 2 2 4 4 3
But can someone explain how this is done? Is there a equation or something? I remember seeing something with a table and changing things to binary and putting them in the page table like 00100 in Page 1 but I am not really sure. Graphical representations of how this works would be more than appreciated. Thanks

iOS Random Numbers in a range

I know I can get a random number for example from 0 to 700 using:
arc4random() % 362
But how can I get a random number in between for example 200 to 300?
(arc4random() % 100) + 200
Not to be picky, but I am pretty sure you have to add a number...
if you want all the numbers from 200 to and including 300... use
200 + arc4random()%101;
arc4random()%100 would give a random number from 0 to 99 so 300 would never occur...

Resources