asof join with Julia data tools - join

I am hoping to do something along the lines of pandas merge_asof or QuestDB's ASOF JOIN in Julia. Critically, I also need to apply a group-by operation.
I would be happy to use any of Julia's Table.jl respecting tools. DataFrame's leftjoin get's close, but requires exact key matches, and doesn't do grouping (as far as I can tell). SplitApplyCombine.jl's leftgroupjoin allows you to pass in your own comparison function, but I don't quite see how to use that function to specify the "nearest less than" value, or "nearest greater than" value.
For a simple example where group-bys are not necessary, on two tables left and right, each with a column time, I could use a function like
function find_nearest_before(val, data)
findlast(x -> x <= val, data)
end
[find_nearest_before(t, right.time) for t in left.time]
and this would get me the indices in right to join to left. However, I don't quite see how to put this together with a group-by.
EDIT
Adding an example to make the question more clear. The first table sensor_pings reports when a sensor sees something. The second table in_sensor_FOV tells us what object is actually in a sensor's field of view (FOV) at a given time. Assume a sensor only has one object in its FOV at a time (opposite is not necessarily true).
julia> using TypedTables
julia> sensor_pings = Table(time=[4,5,7,8,9,10,11,13,15,16], sensor_id=[2,1,1,3,2,3,1,2,3,2])
Table with 2 columns and 10 rows:
time sensor_id
┌────────────────
1 │ 4 2
2 │ 5 1
3 │ 7 1
4 │ 8 3
5 │ 9 2
6 │ 10 3
7 │ 11 1
8 │ 13 2
9 │ 15 3
10 │ 16 2
julia> in_sensor_FOV = Table(time=[1.3,2.6,3.8,5.9,7.3,8.0,12.3,14.7], sensor_id=[3,1,2,3,2,2,3,1], object_in_sensor_FOV=[:a,:b,:c,:b,:c,:a,:c,:b])
Table with 3 columns and 8 rows:
time sensor_id object_in_sensor_FOV
┌──────────────────────────────────────
1 │ 1.3 3 a
2 │ 2.6 1 b
3 │ 3.8 2 c
4 │ 5.9 3 b
5 │ 7.3 2 c
6 │ 8.0 2 a
7 │ 12.3 3 c
8 │ 14.7 1 b
The end result of the desired operation would look like
julia> Table(time=[4,5,7,8,9,10,11,13,15,16], sensor_id=[2,1,1,3,2,3,1,2,3,2], object_in_sensor_FOV=[:c,:b,:b,:b,:a,:b,:b,:a,:c,:a])
Table with 3 columns and 10 rows:
time sensor_id object_in_sensor_FOV
┌──────────────────────────────────────
1 │ 4 2 c
2 │ 5 1 b
3 │ 7 1 b
4 │ 8 3 b
5 │ 9 2 a
6 │ 10 3 b
7 │ 11 1 b
8 │ 13 2 a
9 │ 15 3 c
10 │ 16 2 a

It's rather easy to write something like that, you just need to implement double cursor
using TypedTables
using Setfield
sensor_pings = Table(time=[4,5,7,8,9,10,11,13,15,16], sensor_id=[2,1,1,3,2,3,1,2,3,2])
in_sensor_FOV = Table(time=[1.3,2.6,3.8,5.9,7.3,8.0,12.3,14.7], sensor_id=[3,1,2,3,2,2,3,1], object_in_sensor_FOV=[:a,:b,:c,:b,:c,:a,:c,:b])
function mergeasof(t1, t2)
objects = similar(t2.object_in_sensor_FOV, length(t1.time))
d = ntuple(_ -> :z, 3) # :z is a sentinel value, means that there were no objects up to this moment. Can be anything
i2 = 1
# Double cursor
for i1 in axes(t1, 1)
tm1 = t1.time[i1]
# updating `d` to the current time step
while i2 <= length(t2.time)
t2.time[i2] > tm1 && break
#set! d[t2.sensor_id[i2]] = t2.object_in_sensor_FOV[i2]
i2 += 1
end
objects[i1] = d[t1.sensor_id[i1]]
end
return Table(time = t1.time, sensor_id = t1.sensor_id, object_in_sensor_FOV = objects)
end
julia> mergeasof(sensor_pings, in_sensor_FOV)
Table with 3 columns and 10 rows:
time sensor_id object_in_sensor_FOV
┌──────────────────────────────────────
1 │ 4 2 c
2 │ 5 1 b
3 │ 7 1 b
4 │ 8 3 b
5 │ 9 2 a
6 │ 10 3 b
7 │ 11 1 b
8 │ 13 2 a
9 │ 15 3 c
10 │ 16 2 a
it should be rather fast and could be adapted for an arbitrary number of columns (it's just more tedious to right).
Few notes, though
This function expects that tables are sorted over time
It can be adapted to forward search, yet it can be more tedious.
I am using the fact that there are 3 sensors. If the amount of sensors is known beforehand, then it can should be used in ntuple function. If it is unknown or large or indices are arbitrary, then instead of ntuple you can use Dict
d = Dict{Int, Symbol}()
and #set! should be removed
d[t2.sensor_id[i2]] = t2.object_in_sensor_FOV[i2]
and instead of
objects[i1] = d[t1.sensor_id[i1]]
you should use
objects[i1] = get(d, t1.sensor_id[i1], :z)

Here's one way of doing it in DataFrames - this is certainly not the peak of efficiency, but if your data is small enough that you can afford the first leftjoin it might be good enough.
Start by joining in_sensor_FOV onto sensor_pings:
julia> df = leftjoin(sensor_pings, in_sensor_FOV, on = :sensor_id, makeunique = true);
after that you'll have multiple rows for each sensor in sensor_pings, which is where this approach might fail if your data is large.
Then get the time difference:
julia> transform!(df, [:time, :time_1] => ((x, y) -> x - y) => :time_diff);
Now your findlast approach iiuc means we only consider rows with positive time difference:
julia> df = df[df.time_diff .> 0.0, :];
Then we sort by sensor and time diff and pick the first row for each sensor:
julia> res = combine(groupby(sort(df, [:sensor_id, :time_diff]), [:sensor_id, :time]), names(df[:, Not([:sensor_id, :time])]) .=> first .=> names(df[:, Not([:sensor_id, :time])]));
Result (sorted to produce the same output):
julia> sort(select(res, [:time, :sensor_id, :object_in_sensor_FOV]), :time)
10×3 DataFrame
Row │ time sensor_id object_in_sensor_FOV
│ Int64 Int64 Symbol
─────┼────────────────────────────────────────
1 │ 4 2 c
2 │ 5 1 b
3 │ 7 1 b
4 │ 8 3 b
5 │ 9 2 a
6 │ 10 3 b
7 │ 11 1 b
8 │ 13 2 a
9 │ 15 3 c
10 │ 16 2 a

Related

Remove Element from the linked list

def removeKFromList(l, k):
pointer = l
while pointer:
if pointer.next and pointer.next.value == k:
pointer.next = pointer.next.next
else:
pointer = pointer.next
if l and l.value == k:
return l.next
else:
return l
In this code, why do I need to put
pointer = pointer.next
under else? Code does not work if I don't write this under else, but I don't see why.
First realise that pointer is intended to reference the node that precedes the one that might need to be deleted.
Now if we find that the successor of pointer must be deleted, then we get in the if block. In that case the removal of the next node will make the node that comes after that one, the new successor of pointer. As we want to make the check (for deletion) also for that new successor, we should not move pointer, but just leave it like that. In the next iteration we will then correctly determine whether that new successor node must be deleted or not.
If we would have moved pointer with pointer.next, then pointer would refer to the new successor, but it would not be checked for removal (in the next iteration). It would escape the removal check!
Here is a visualisation of what can go wrong when we do pointer = pointer.next in that case.
Input: l = [1,2,2,3], k = 2
l
pointer
↓
┌───────────┐ ┌───────────┐ ┌───────────┐ ┌───────────┐
│ value: 1 │ │ value: 2 │ │ value: 2 │ │ value: 3 │
│ next: ───────> │ next: ───────> │ next: ───────> │ next:None │
└───────────┘ └───────────┘ └───────────┘ └───────────┘
In the first iteration the condition pointer.next.value == k is true, and so the removal is performed with pointer.next = pointer.next.next, resulting in this situation:
l
pointer
↓ ┌────────────────┐
┌───────────┐ │ ┌───────────┐ │ ┌───────────┐ ┌───────────┐
│ value: 1 │ │ │ value: 2 │ └> │ value: 2 │ │ value: 3 │
│ next: ──────┘ │ next: ───────> │ next: ───────> │ next:None │
└───────────┘ └───────────┘ └───────────┘ └───────────┘
If now we would do pointer = pointer.next, then we get this:
l pointer
↓ ┌────────────────┐ ↓
┌───────────┐ │ ┌───────────┐ │ ┌───────────┐ ┌───────────┐
│ value: 1 │ │ │ value: 2 │ └> │ value: 2 │ │ value: 3 │
│ next: ──────┘ │ next: ───────> │ next: ───────> │ next:None │
└───────────┘ └───────────┘ └───────────┘ └───────────┘
...and in the next iteration the if condition will not be about that node, but about the next, the one with value 3, and so the second occurrence of 2 will not be deleted!

Google Sheets: How can I FILTER a range based on a value in the previous row?

I'm working with a golf data set and I'm looking for a way to filter holes based on the result of a previous hole. In the end, I want this range to be able to get the average score of the golfer following a bogey or worse.
I've made a few attempts with FILTER(), OFFSET(), and even INDIRECT(), but I can't figure out how to properly use values from a different row as the condition for my filter.
=FILTER(A2:D10, OFFSET(D2:D10, -1, 0) >= 1, ROW(D2:D10) <> 2) (errors with "FILTER has mismatched range sizes.")
=INDIRECT("D"&FILTER(ROW(A2:D10)+1, D2:D10 >= 1, ROW(D2:D10) <> 2)) (only returns the first value)
Sample Data:
A B C D
-----------------------------
1 | Hole Par Score ScoreDiff
2 | 1 4 5 1
3 | 2 4 4 0
4 | 3 4 3 -1
5 | 4 5 6 1
6 | 5 3 3 0
7 | 6 5 6 1
8 | 7 3 4 1
9 | 8 4 5 1
10 | 9 4 4 0
Desired outcome: only the holes directly following a bogey or worse (where ScoreDiff >= 1)
A B C D
-----------------------------
1 | 2 4 4 0
2 | 5 3 3 0
3 | 7 3 4 1
4 | 8 4 5 1
5 | 9 4 4 0
Simpler option:
=FILTER(A3:D11,D2:D10>=1)
try:
=FILTER(A2:D10, {""; D2:D9} >= 1, ROW(D2:D10) <> 2)

Condition for memory access conflict in memory-banked vector processors

The Hennessy-Patterson book on Computer Architecture (Quantitative Approach 5ed) says that in a vector architecture with multiple memory banks, a bank conflict can happen if the following condition is met (Page 279 in 5ed):
(Number of banks) / LeastCommonMultiple(Number of banks, Stride) < Bank busy time
However, I think it should be GreatestCommonFactor instead of LCM, because memory conflict would occur if the effective number of banks you have is less than the busy time. By effective number of banks I mean this - let's say you have 8 banks, and a stride of 2. Then effectively you have 4 banks, because the memory accesses will be lined up only at four banks (e.g, let's say your accesses are all even numbers, starting from 0, then your accesses will be lined up at banks 0,2,4,6).
In fact, this formula even fails for the example given right below it. Suppose we have 8 memory banks with busy time of 6 clock cycles, with total memory latency of 12 clock cycles, how long will it take to complete a 64-element vector load with stride of 1? - Here they calculate the time as 12+64=76 clock cycles. However, memory bank conflict will occur according to the condition given, so we clearly can't have one access per cycle (64 in the equation).
Am I getting it wrong, or has the wrong formula managed to survive 5 editions of this book (unlikely)?
GCD(banks, stride) should come into it; your argument about that is correct.
Let's try this for a few different strides and see what we get,
for number of banks = b = 8.
# generated with the calc(1) function
define f(s) { print s, " | ", lcm(s,8), " | ", gcd(s,8), " | ", 8/lcm(s,8), " | ", 8/gcd(s,8) }`
stride | LCM(s,b) | GCF(s,b) | b/LCM(s,b) | b/GCF(s,b)
1 | 8 | 1 | 1 | 8 # 8 < 6 = false: no conflict
2 | 8 | 2 | 1 | 4 # 4 < 6 = true: conflict
3 | 24 | 1 | ~0.333 | 8 # 8 < 6 = false: no conflict
4 | 8 | 4 | 1 | 2 # 2 < 6 = true: conflict
5 | 40 | 1 | 0.2 | 8
6 | 24 | 2 | ~0.333 | 4
7 | 56 | 1 | ~0.143 | 8
8 | 8 | 8 | 1 | 1
9 | 72 | 1 | ~0.111 | 8
x >=8 2^0..3 <=1 1 2 4 or 8
b/LCM(s,b) is always <=1, so it always predicts conflicts.
I think GCF (aka GCD) looks right for the stride values I've looked at so far. You only have a problem if the stride doesn't distribute the accesses over all the banks, and that's what b/GCF(s,b) tells you.
Stride = 8 should be the worst-case, using the same bank every time. gcd(8,8) = lcm(8,8) = 8. So both expressions give 8/8 = 1 which is less than the bank busy/recovery time, thus correctly predicting conflicts.
Stride=1 is of course the best case (no conflicts if there are enough banks to hide the busy time). gcd(8,1) = 1 correctly predicts no conflicts: (8/1 = 8, which is not less than 6). lcm(8,1) = 8. (8/8 < 6 is true) incorrectly predicts conflicts.

How to improve ILIKE Performance?

My application has a page where all the cities of a state that start with a particular alphabet are shown.
For ex:
State: Alabama, Page A
--> All cities in Alabama starting with alphabet 'A'
This is my query
City.where(state: 'Alabama').where("name ilike?", "a%")
This query takes ~110 - 140 ms. Is there any way in which I can bring down the query time to <10 ms.
Thanks in advance :)
PostgreSQL doesn't use usual index for LIKE operator
postgres=# create index on obce(nazev);
CREATE INDEX
Time: 120.605 ms
postgres=# explain analyze select * from obce where nazev like 'P%';
┌─────────────────────────────────────────────────────────────────────────────────────────────────────┐
│ QUERY PLAN │
╞═════════════════════════════════════════════════════════════════════════════════════════════════════╡
│ Seq Scan on obce (cost=0.00..137.12 rows=435 width=41) (actual time=0.023..2.345 rows=450 loops=1) │
│ Filter: ((nazev)::text ~~ 'P%'::text) │
│ Rows Removed by Filter: 5800 │
│ Planning time: 0.485 ms │
│ Execution time: 2.413 ms │
└─────────────────────────────────────────────────────────────────────────────────────────────────────┘
(5 rows)
You should to use special syntax with varchar_pattern_ops keyword
postgres=# create index on obce(nazev varchar_pattern_ops);
CREATE INDEX
Time: 124.709 ms
postgres=# explain analyze select * from obce where nazev like 'P%';
┌─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┐
│ QUERY PLAN │
╞═════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════╡
│ Bitmap Heap Scan on obce (cost=12.39..76.39 rows=435 width=41) (actual time=0.291..0.714 rows=450 loops=1) │
│ Filter: ((nazev)::text ~~ 'P%'::text) │
│ Heap Blocks: exact=58 │
│ -> Bitmap Index Scan on obce_nazev_idx1 (cost=0.00..12.28 rows=400 width=0) (actual time=0.253..0.253 rows=450 loops=1) │
│ Index Cond: (((nazev)::text ~>=~ 'P'::text) AND ((nazev)::text ~<~ 'Q'::text)) │
│ Planning time: 0.953 ms │
│ Execution time: 0.831 ms │
└─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┘
(7 rows)
But this doesn't work for ILIKE - the workaround can be functional index:
create index on obce(upper(nazev) varchar_pattern_ops);
select * from obce where upper(nazev) like upper('P%');
Note: "Nazev" is the name in Czech language
Another possibility is using pg_trgm extension and using trigram index. It is working for both LIKE, ILIKE, but the index is much bigger - it is not problem for relative small static tables.
create extension pg_trgm ;
create index on obce using gin (nazev gin_trgm_ops);
postgres=# explain analyze select * from obce where nazev like 'P%';
┌─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┐
│ QUERY PLAN │
╞═════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════╡
│ Bitmap Heap Scan on obce (cost=15.37..79.81 rows=435 width=41) (actual time=0.327..0.933 rows=450 loops=1) │
│ Recheck Cond: ((nazev)::text ~~ 'P%'::text) │
│ Rows Removed by Index Recheck: 134 │
│ Heap Blocks: exact=58 │
│ -> Bitmap Index Scan on obce_nazev_idx1 (cost=0.00..15.26 rows=435 width=0) (actual time=0.287..0.287 rows=584 loops=1) │
│ Index Cond: ((nazev)::text ~~ 'P%'::text) │
│ Planning time: 0.359 ms │
│ Execution time: 1.056 ms │
└─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┘
(8 rows)

Clustering unique datasets based on similarities (equality)

I just entered into the space of data mining, machine learning and clustering. I'm having special problem, and do not know which technique to use it for solving it.
I want to perform clustering of observations (objects or whatever) on specific data format. All variables in each observation is numeric. My data input looks like this:
1 2 3 4 5 6
1 3 5 7
2 9 10 11 12 13 14
45 1 22 23 24
Let's say that n represent row (observation, or 1D vector,..) and m represents column (variable index in each vector). n could be very large number, and 0 < m < 100. Also main point is that same observation (row) cannot have identical values (in 1st row, one value could appear only once).
So, I want to somehow perform clustering where I'll put observations in one cluster based on number of identical values which contain each row/observation.
If there are two rows like:
1
1 2 3 4 5
They should be clustered in same cluster, if there are no match than for sure not. Also number of each rows in one cluster should not go above 100.
Sick problem..? If not, just for info that I didn't mention time dimension. But let's skip that for now.
So, any directions from you guys,
Thanks and best regards,
JDK
Its hard to recommend anything since your problem is totally vague, and we have no information on the data. Data mining (and in particular explorative techniques like clustering) is all about understanding the data. So we cannot provide the ultimate answer.
Two things for you to consider:
1. if the data indicates presence of species or traits, Jaccard similarity (and other set based metrics) are worth a try.
2. if absence is less informative, maybe you should be mining association rules, not clusters
Either way, without understanding your data these numbers are as good as random numbers. You can easily cluster random numbers, and spend weeks to get the best useless result!
Can your problem be treated as a Bag-of-words model, where each article (observation row) has no more than 100 terms?
Anyway, I think your have to give more information and examples about "why" and "how" you want to cluster these data. For example, we have:
1 2 3
2 3 4
2 3 4 5
1 2 3 4
3 4 6
6 7 8
9 10
9 11
10 12 13 14
What is your expected clustering? How many clusters are there in this clustering? Only two clusters?
Before you give more information, according to you current description, I think you do not need a cluster algorithm, but a structure of connected components. The first round you process the dataset to get the information of connected components, and you need a second round to check each row belong to which connected components. Take the example above, first round:
1 2 3 : 1 <- 1, 1 <- 2, 1 <- 3 (all point linked to the smallest point to
represent they are belong to the same cluster of the smallest point)
2 3 4 : 2 <- 4 (2 and 3 have already linked to 1 which is <= 2, so they do
not need to change)
2 3 4 5 : 2 <- 5
1 2 3 4 : 1 <- 4 (in fact this change are not essential because we have
1 <- 2 <- 4, but change this can speed up the second round)
3 4 6 : 3 <- 6
6 7 8 : 6 <- 7, 6 <- 8
9 10 : 9 <- 9, 9 <- 10
9 11 : 9 <- 11
10 11 12 13 14 : 10 <- 12, 10 <- 13, 10 <- 14
Now we have a forest structure to represent the connected components of points. The second round you can easily pick up one point in each row (the smallest one is the best) and trace its root in the forest. The rows which have the same root are in the same, in your words, cluster. For example:
1 2 3 : 1 <- 1, cluster root 1
2 3 4 5 : 1 <- 1 <- 2, cluster root 1
6 7 8 : 1 <- 1 <- 3 <- 6, cluster root 1
9 10 : 9 <- 9, cluster root 9
10 11 12 13 14 : 9 <- 9 <- 10, cluster root 9
This process takes O(k) space where k is the number of points, and O(nm + nh) time, where r is the height of the forest structure, where r << m.
I am not sure if this is the result you want.

Resources