ClickHouse select count from joined table records (COUNT, JOIN)

ClickHouse select count from joined table records (COUNT, JOIN) - join

Can anyone suggest in the following question:
I have 2 tables:
"T1" contents:
EVENT_ID| USER_ID | RECORD_CREATED_DATE |
_________________________________________
| 5f0172| 111 | 2020.07.13 |
| 5f0173| 222 | 2020.06.11 |
| 5f0174| 111 | 2020.08.20 |
"T2" contents:
| ID | USER_ID | RECORD_CREATED_DATE | SAVE_DATE |
___________________________________________________
| 1 | 111 | 2020.05.21 | 2020.05.21 |
| 2 | 222 | 2020.03.18 | 2020.03.18 |
| 3 | 111 | 2020.07.21 | 2020.07.21 |
| 4 | 222 | 2020.08.15 | 2020.08.15 |
And I need to output the result so that from table 2 it is possible to get the number of records recorded in the context of the T1.RECORD_CREATED_DATE and T1.USER_ID from the first table without GROUP BY USER_ID
EVENT_ID| USER_ID | RECORD_CREATED_DATE | COUNT(T2.Id) |
_______________________________________________________|
| 5f0172| 111 | 2020.07.13 | 1 | -> because tabl "T2" has 1 record less date 2020.07.13
| 5f0173| 222 | 2020.06.11 | 1 |
| 5f0174| 111 | 2020.08.20 | 2 |
Any ideas?

Try this query:
SELECT
EVENT_ID,
MAX(T1.USER_ID) AS USER_ID,
MAX(T1.RECORD_CREATED_DATE) AS RECORD_CREATED_DATE,
count() AS cnt
FROM T2
INNER JOIN T1 ON T1.USER_ID = T2.USER_ID
WHERE T2.RECORD_CREATED_DATE < T1.RECORD_CREATED_DATE
GROUP BY EVENT_ID
ORDER BY EVENT_ID ASC
/*
┌─EVENT_ID─┬─USER_ID─┬─RECORD_CREATED_DATE─┬─cnt─┐
│ 5f0172 │ 111 │ 2020-07-13 00:00:00 │ 1 │
│ 5f0173 │ 222 │ 2020-06-11 00:00:00 │ 1 │
│ 5f0174 │ 111 │ 2020-08-20 00:00:00 │ 2 │
└──────────┴─────────┴─────────────────────┴─────┘
*/

Related

MYSQL joining the sum of matching fields

I record eftpos payments that are payed as a group at the end of each day, but am having trouble matching individual payments to the daily total
Payments table:
|id | paymentjobno| paymentamount| paymentdate|paymenttype|
| 1 | 1000 | 10 | 01/01/2000 | 2 |
| 2 | 1001 | 15 | 01/01/2000 | 2 |
| 3 | 1002 | 18 | 01/01/2000 | 2 |
| 4 | 1003 | 10 | 01/01/2000 | 1 |
| 5 | 1004 | 127 | 02/01/2000 | 2 |
I want to return something like this so I can match it to $43 transactions on the following day and record payment ID numbers against the transaction
|id | paymentjobno| paymentamount| paymentdate|paymenttype|daytotal|
| 1 | 1000 | 10 | 01/01/2000 | 2 | 43 |
| 2 | 1001 | 15 | 01/01/2000 | 2 | 43 |
| 3 | 1002 | 18 | 01/01/2000 | 2 | 43 |
Below is my current attempt, but I only get one returned row per day even if there's multiple payments, and the daytotal is the same for every returned result, which is also not the value I was expecting. What am I doing wrong?
SELECT
id,
paymentjobno,
paymentamount,
paymentdate,
paymenttype,
t.daytotal
FROM payments
LEFT JOIN (
SELECT SUM(paymentamount) AS daytotal
FROM payments
GROUP BY paymentdate) t ON id = payments.id
WHERE paymenttype = 2 AND paymentdate $dateclause
GROUP BY payments.paymentdate

How to add slim to rails statistics (stats) for code statistics?

I tried to search and experimented, but couldn't figure out, how to add slim to rails stats views statistics. It is counting only .erb templates, but I want .slim to be added as these are views too.
% bin/rails stats
+----------------------+--------+--------+---------+---------+-----+-------+
| Name | Lines | LOC | Classes | Methods | M/C | LOC/M |
+----------------------+--------+--------+---------+---------+-----+-------+
| Controllers | 3245 | 1634 | 57 | 218 | 3 | 5 |
| Helpers | 186 | 149 | 0 | 18 | 0 | 6 |
| Jobs | 34 | 20 | 2 | 2 | 1 | 8 |
| Models | 879 | 541 | 25 | 77 | 3 | 5 |
| Mailers | 85 | 53 | 3 | 6 | 2 | 6 |
| Channels | 46 | 28 | 3 | 4 | 1 | 5 |
| Views | 0 | 0 | 0 | 0 | 0 | 0 |
+----------------------+--------+--------+---------+---------+-----+-------+
I could add an extra rules for something like "Slim views", but this would count the .erb templates in views too.

Neo4j Cypher: How to optimize a NOT EXISTS Query when cardinality is high

The below query takes over 1 second & consumer about 7 MB when cardinality b/w users to posts is about 8000 (one user views about 8000 posts). It is difficult to scale this due to high & linearly growing latencies & memory consumption. Is there a possibility to model this differently and/or optimise the query?
Query
PROFILE MATCH (u:User)-[:CREATED]->(p:Post) WHERE NOT (:User{ID: 2})-[:VIEWED]->(p) RETURN p.ID
Plan
| Plan | Statement | Version | Planner | Runtime | Time | DbHits | Rows | Memory (Bytes) |
+-----------------------------------------------------------------------------------------------------------+
| "PROFILE" | "READ_ONLY" | "CYPHER 4.1" | "COST" | "INTERPRETED" | 1033 | 3721750 | 10 | 6696240 |
+-----------------------------------------------------------------------------------------------------------+
+------------------------------+-----------------------------------------------+----------------+------+---------+-----------+----------------+----------------+
| Operator | Details | Estimated Rows | Rows | DB Hits | Cache H/M | Memory (Bytes) | Ordered by |
+------------------------------+-----------------------------------------------+----------------+------+---------+-----------+----------------+----------------+
| +ProduceResults#neo4j | `p.ID` | 2158 | 10 | 0 | 0/0 | | |
| | +-----------------------------------------------+----------------+------+---------+-----------+----------------+----------------+
| +Projection#neo4j | p.ID AS `p.ID` | 2158 | 10 | 10 | 0/0 | | |
| | +-----------------------------------------------+----------------+------+---------+-----------+----------------+----------------+
| +Filter#neo4j | u:User | 2158 | 10 | 10 | 0/0 | | |
| | +-----------------------------------------------+----------------+------+---------+-----------+----------------+----------------+
| +Expand(All)#neo4j | (p)<-[anon_15:CREATED]-(u) | 2158 | 10 | 20 | 0/0 | | |
| | +-----------------------------------------------+----------------+------+---------+-----------+----------------+----------------+
| +AntiSemiApply#neo4j | | 2158 | 10 | 0 | 0/0 | | |
| |\ +-----------------------------------------------+----------------+------+---------+-----------+----------------+----------------+
| | +Expand(Into)#neo4j | (anon_47)-[anon_61:VIEWED]->(p) | 233 | 0 | 3695819 | 0/0 | 6696240 | anon_47.ID ASC |
| | | +-----------------------------------------------+----------------+------+---------+-----------+----------------+----------------+
| | +NodeUniqueIndexSeek#neo4j | UNIQUE anon_47:User(ID) WHERE ID = $autoint_0 | 8630 | 8630 | 17260 | 0/0 | | anon_47.ID ASC |
| | +-----------------------------------------------+----------------+------+---------+-----------+----------------+----------------+
| +NodeByLabelScan#neo4j | p:Post | 8630 | 8630 | 8631 | 0/0 | | |
+------------------------------+-----------------------------------------------+----------------+------+---------+-----------+----------------+----------------+

Yes, this can be improved.
First, let's understand what this is doing.
First, it starts with a NodeByLabelScan. That makes sense, there's no avoiding that.
But then, for every node of the label (the following executes PER ROW!), it matches to user 2, and expands all :VIEWED relationships from user 2 to see if any of them is the post for that particular row.
Can you see why this is inefficient? There are 8630 post nodes according to the PROFILE plan, so user 2 is looked up by index 8630 times, and their :VIEWED relationships are expanded 8630 times. Why 8630 times? Because this is happening per :Post node.
Instead, try this:
MATCH (:User{ID: 2})-[:VIEWED]->(viewedPost)
WITH collect(viewedPost) as viewedPosts
MATCH (:User)-[:CREATED]->(p:Post)
WHERE NOT p IN viewedPosts
RETURN p.ID
This changes things up a bit.
First it matches to user 2's viewed posts (the lookup and expansion is performed only once), then those viewed posts are collected.
Then it will do a label scan, and filter such that the post isn't in the collection of viewed posts.

joinging three tables in psql and keeping results according to group membership

I am using psql and joined three tables A, B and C from table A.
For example resulting table is as follows:
+----+------+------+------+
| pk | a_id | b_id | c_id |
+----+------+------+------+
| 1 | 5 | 12 | 16 |
| 2 | 5 | 7 | 8 |
| 3 | 5 | 6 | 21 |
| 4 | 8 | 12 | 16 |
| 5 | 8 | 3 | 9 |
| 6 | 9 | 11 | 32 |
| 7 | 9 | 8 | 2 |
+----+------+------+------+
I am trying to create c_id relations over a_id. In a_id there are three groups [5,8,9]. For example c_id=16 has a relation to a_id=[5,8], so c_id=[8,21,9,32] must be protected via a_id=[5,8]. And resulting table should look like as follows:
+----+------+------+------+
| pk | a_id | b_id | c_id |
+----+------+------+------+
| 1 | 5 | 12 | 16 |
| 2 | 5 | 7 | 8 |
| 3 | 5 | 6 | 21 |
| 4 | 8 | 12 | 16 |
| 5 | 8 | 3 | 9 |
+----+------+------+------+
How can I write such a condition in join statement?

After the join, you can write this query. I created your result table directly, and then I wrote a SQL query.
SELECT * from res
WHERE a_id in (SELECT distinct a_id
FROM res
WHERE c_id=16)

What is the difference between these two Cypher queries?

I'm a bit stumped.
In my database, I have a relationship like this:
(u:User)-[r1:LISTENS_TO]->(a:Artist)<-[r2:LISTENS_TO]-(u2:User)
I want to perform a query where for a given user, I find the common artists between that user and every other user.
To give an idea of size of my database, I have about 600 users, 47,546 artists, and 184,211 relationships between users and artists.
The first query I was trying was the following:
START me=node(553314), other=node:userLocations("withinDistance:[38.89037,-77.03196,80.467]")
OPTIONAL MATCH
pMutualArtists=(me:User)-[ar1:LISTENS_TO]->(a:Artist)<-[ar2:LISTENS_TO]-(other:User)
WHERE
other:User
WITH other, COUNT(DISTINCT pMutualArtists) AS mutualArtists
ORDER BY mutualArtists DESC
LIMIT 10
RETURN other.username, mutualArtists
This was taking around 20 seconds to return. The profile for this query is as follows:
+----------------------+-------+--------+------------------------+------------------------------------------------------------------------------------------------+
| Operator | Rows | DbHits | Identifiers | Other |
+----------------------+-------+--------+------------------------+------------------------------------------------------------------------------------------------+
| ColumnFilter(0) | 10 | 0 | | keep columns other.username, mutualArtists |
| Extract | 10 | 20 | | other.username |
| ColumnFilter(1) | 10 | 0 | | keep columns other, mutualArtists |
| Top | 10 | 0 | | { AUTOINT0}; Cached( INTERNAL_AGGREGATEb6facb18-1c5d-45a6-83bf-a75c25ba6baf of type Integer) |
| EagerAggregation | 563 | 0 | | other |
| OptionalMatch | 52806 | 0 | | |
| Eager(0) | 563 | 0 | | |
| NodeByIndexQuery(1) | 563 | 564 | other, other | Literal(withinDistance:[38.89037,-77.03196,80.467]); userLocations |
| NodeById(1) | 1 | 1 | me, me | Literal(List(553314)) |
| Eager(1) | 82 | 0 | | |
| ExtractPath | 82 | 0 | pMutualArtists | |
| Filter(0) | 82 | 82 | | (hasLabel(a:Artist(1)) AND NOT(ar1 == ar2)) |
| SimplePatternMatcher | 82 | 82 | a, me, ar2, ar1, other | |
| Filter(1) | 1 | 3 | | ((hasLabel(me:User(3)) AND hasLabel(other:User(3))) AND hasLabel(other:User(3))) |
| NodeByIndexQuery(1) | 563 | 564 | other, other | Literal(withinDistance:[38.89037,-77.03196,80.467]); userLocations |
| NodeById(1) | 1 | 1 | me, me | Literal(List(553314)) |
+----------------------+-------+--------+------------------------+------------------------------------------------------------------------------------------------+
I was frustrated. It didn't seem like this should take 20 seconds.
I came back to the problem later on, and tried debugging it from the start.
I started to break down the query, and I noticed I was getting much faster results. Without the Neo4J Spatial query, I was getting results in about 1.5 seconds.
I finally added things back, and ended up with the following query:
START u=node(553314), u2=node:userLocations("withinDistance:[38.89037,-77.03196,80.467]")
OPTIONAL MATCH
pMutualArtists=(u:User)-[ar1:LISTENS_TO]->(a:Artist)<-[ar2:LISTENS_TO]-(u2:User)
WHERE
u2:User
WITH u2, COUNT(DISTINCT pMutualArtists) AS mutualArtists
ORDER BY mutualArtists DESC
LIMIT 10
RETURN u2.username, mutualArtists
This query returns in 4240 ms. A 5X improvement! The profile for this query is as follows:
+----------------------+-------+--------+--------------------+------------------------------------------------------------------------------------------------+
| Operator | Rows | DbHits | Identifiers | Other |
+----------------------+-------+--------+--------------------+------------------------------------------------------------------------------------------------+
| ColumnFilter(0) | 10 | 0 | | keep columns u2.username, mutualArtists |
| Extract | 10 | 20 | | u2.username |
| ColumnFilter(1) | 10 | 0 | | keep columns u2, mutualArtists |
| Top | 10 | 0 | | { AUTOINT0}; Cached( INTERNAL_AGGREGATEbdf86ac1-8677-4d45-967f-c2dd594aba49 of type Integer) |
| EagerAggregation | 563 | 0 | | u2 |
| OptionalMatch | 52806 | 0 | | |
| Eager(0) | 563 | 0 | | |
| NodeByIndexQuery(1) | 563 | 564 | u2, u2 | Literal(withinDistance:[38.89037,-77.03196,80.467]); userLocations |
| NodeById(1) | 1 | 1 | u, u | Literal(List(553314)) |
| Eager(1) | 82 | 0 | | |
| ExtractPath | 82 | 0 | pMutualArtists | |
| Filter(0) | 82 | 82 | | (hasLabel(a:Artist(1)) AND NOT(ar1 == ar2)) |
| SimplePatternMatcher | 82 | 82 | a, u2, u, ar2, ar1 | |
| Filter(1) | 1 | 3 | | ((hasLabel(u:User(3)) AND hasLabel(u2:User(3))) AND hasLabel(u2:User(3))) |
| NodeByIndexQuery(1) | 563 | 564 | u2, u2 | Literal(withinDistance:[38.89037,-77.03196,80.467]); userLocations |
| NodeById(1) | 1 | 1 | u, u | Literal(List(553314)) |
+----------------------+-------+--------+--------------------+------------------------------------------------------------------------------------------------+
And, to prove that I ran them both in a row and got very different results:
neo4j-sh (?)$ START u=node(553314), u2=node:userLocations("withinDistance:[38.89037,-77.03196,80.467]")
>
> OPTIONAL MATCH
> pMutualArtists=(u:User)-[ar1:LISTENS_TO]->(a:Artist)<-[ar2:LISTENS_TO]-(u2:User)
> WHERE
> u2:User
>
> WITH u2, COUNT(DISTINCT pMutualArtists) AS mutualArtists
> ORDER BY mutualArtists DESC
> LIMIT 10
> RETURN u2.username, mutualArtists
> ;
+------------------------------+
| u2.username | mutualArtists |
+------------------------------+
| "573904765" | 644 |
| "28600291" | 601 |
| "1092510304" | 558 |
| "1367963461" | 521 |
| "1508790199" | 455 |
| "1335360028" | 447 |
| "18200866" | 444 |
| "1229430376" | 435 |
| "748318333" | 434 |
| "5612902" | 431 |
+------------------------------+
10 rows
4240 ms
neo4j-sh (?)$ START me=node(553314), other=node:userLocations("withinDistance:[38.89037,-77.03196,80.467]")
>
> OPTIONAL MATCH
> pMutualArtists=(me:User)-[ar1:LISTENS_TO]->(a:Artist)<-[ar2:LISTENS_TO]-(other:User)
> WHERE
> other:User
>
> WITH other, COUNT(DISTINCT pMutualArtists) AS mutualArtists
> ORDER BY mutualArtists DESC
> LIMIT 10
> RETURN other.username, mutualArtists;
+--------------------------------+
| other.username | mutualArtists |
+--------------------------------+
| "573904765" | 644 |
| "28600291" | 601 |
| "1092510304" | 558 |
| "1367963461" | 521 |
| "1508790199" | 455 |
| "1335360028" | 447 |
| "18200866" | 444 |
| "1229430376" | 435 |
| "748318333" | 434 |
| "5612902" | 431 |
+--------------------------------+
10 rows
20418 ms
Unless I have gone crazy, the only difference between these two queries is the names of the nodes (I've changed "me" to "u" and "other" to "u2").
Why does that cause a 5X improvement??!?!
If anyone has any insight into this, I would be eternally grateful.
Thanks,
-Adam
EDIT 8.1.14
Based on #ulkas's suggestion, I tried simplifying the query.
The results were:
START u=node(553314), u2=node:userLocations("withinDistance:[38.89037,-77.03196,80.467]")
OPTIONAL MATCH pMutualArtists=(u:User)-[ar1:LISTENS_TO]->(a:Artist)<-[ar2:LISTENS_TO]-(u2:User)
RETURN u2.username, COUNT(DISTINCT pMutualArtists) as mutualArtists
ORDER BY mutualArtists DESC
LIMIT 10
~4 seconds
START me=node(553314), other=node:userLocations("withinDistance:[38.89037,-77.03196,80.467]")
OPTIONAL MATCH pMutualArtists=(me:User)-[ar1:LISTENS_TO]->(a:Artist)<-[ar2:LISTENS_TO]-(other:User)
RETURN other.username, COUNT(DISTINCT pMutualArtists) as mutualArtists
ORDER BY mutualArtists DESC
LIMIT 10
~20 seconds
So bizarre. It seems as though literally the named nodes of "other" and "me" cause the query time to jump tremendously. I'm very confused.
Thanks,
-Adam

That sounds like you're seeing the effect of caching. Upon the first access the cache is not populated. Subsequent queries hitting the same graph will be much faster since the nodes/relationships are already available in the cache.

working with OPTIONAL MATCH following WHERE other:User has no sense, since the end node other (u2) must be match. try to perform the queries without optional match and where and without the last with, simply
START me=node(553314), other=node:userLocations("withinDistance[38.89037,-77.03196,80.467]")
MATCH
pMutualArtists=(me:User)-[ar1:LISTENS_TO]->(a:Artist)<-[ar2:LISTENS_TO]-(other:User)
RETURN other.username, count(DISTINCT pMutualArtists) as mutualArtists
ORDER BY mutualArtists DESC
LIMIT 10

Develop Reference

ios ruby-on-rails asp.net-mvc docker delphi jenkins grails google-sheets machine-learning dart

ClickHouse select count from joined table records (COUNT, JOIN) - join

Related

MYSQL joining the sum of matching fields

How to add slim to rails statistics (stats) for code statistics?

Neo4j Cypher: How to optimize a NOT EXISTS Query when cardinality is high

joinging three tables in psql and keeping results according to group membership

What is the difference between these two Cypher queries?

Categories

Resources