Linq query with join, group and count (extension format) - join

I have a table, that stores some information and reference (column parent_ID) to parent row in the same table.
I need to get list of all records (using Linq extension format) with count of child records. This is SQL query that gives me the desired information
Select a.ID, a.Name, Count(b.ID) from Table a
left join Table b on b.parent_ID=a.ID
group by a.ID, a.Name
Example
| ID | Name | parent_ID |
----------------------------------
| 1 | First | null |
| 2 | Child1 | 1 |
| 3 | Child2 | 1 |
| 4 | Child1.1 | 2 |
The result should be:
| 1 | First | 2 |
| 2 | Child1 | 1 |
| 3 | Child2 | 0 |
| 4 | Child1.1 | 0 |
I tried to make at least child counting, but it doesn't work...
var model = _db.Table
.GroupBy(r => new { r.parent_ID })
.Select(r => new {
r.Key.parent_ID,
ChildCount = r.GroupBy(g => g.parent_ID).Count()
});
Shouldn't this query be equivalent to something like this:
select parent_ID, count(parent_ID) from Table group by Table
but it returns count = 1 for each row...
How can i make such a query using linq extension format?

What I believe you are looking for is a group join which can be easier to understand using linq query syntax instead of the linq extensions so for ease of understanding I will post both methods.
I'm self-joining the table on the ID to the parent_ID into an object which keeps the referential integrity for you and select an anonymous object to select out the parent then all of the children.
This is using straight LINQ query syntax.
var model = from t1 in _db.Test1
join t2 in _db.Test1 on t1.ID equals t2.parent_ID into c1
select new {Parent = t1, Children = c1};
And here is the code using LINQ extensions
var model2 = _db.Test1.GroupJoin(_db.Test1,
t1 => t1.ID,
t2 => t2.parent_ID,
(t1, c1) => new {Parent = t1, Children = c1});
I used a quick test program I threw together to post the results into a textbox for both methods but the code was the same so I'll just post it once.
foreach (var test in model)
{
textBox1.AppendTextAddNewLine(String.Format("{0}: {1}",
test.Parent.Name,
test.Children.Count()));
}
And the results of both of those tests were the same below:
First: 2
Child1: 1
Child2: 0
Child1.1: 0
First: 2
Child1: 1
Child2: 0
Child1.1: 0

Related

Apache Beam | Python | Dataflow - How to join BigQuery' collections with different keys?

I've faced the following problem.
I'm trying to use INNER JOIN with two tables from Google BigQuery on Apache Beam (Python) for a specific situation. However, I haven't found a native way to deal with it easily.
This query output I'm going to fill a third table on Google BigQuery, for this situation I really need to query it on Google Dataflow. The first table (client) key is the "id" column, and the second table (purchase) key is the "client_id" column.
1.Tables example (consider 'client_table.id = purchase_table.client_id'):
client_table
| id | name | country |
|----|-------------|---------|
| 1 | first user | usa |
| 2 | second user | usa |
purchase_table
| id | client_id | value |
|----|-------------|---------|
| 1 | 1 | 15 |
| 2 | 1 | 120 |
| 3 | 2 | 190 |
2.Code I'm trying to develop (problem in the second line of 'output'):
options = {'project': PROJECT,
'runner': RUNNER,
'region': REGION,
'staging_location': 'gs://bucket/temp',
'temp_location': 'gs://bucket/temp',
'template_location': 'gs://bucket/temp/test_join'}
pipeline_options = beam.pipeline.PipelineOptions(flags=[], **options)
pipeline = beam.Pipeline(options = pipeline_options)
query_results_1 = (
pipeline
| 'ReadFromBQ_1' >> beam.io.Read(beam.io.ReadFromBigQuery(query="select id as client_id, name from client_table", use_standard_sql=True)))
query_results_2 = (
pipeline
| 'ReadFromBQ_2' >> beam.io.Read(beam.io.ReadFromBigQuery(query="select * from purchase_table", use_standard_sql=True)))
output = ( {'query_results_1':query_results_1,'query_results_2':query_results_2}
| 'join' >> beam.GroupBy('client_id')
| 'writeToBQ' >> beam.io.WriteToBigQuery(
table=TABLE,
dataset=DATASET,
project=PROJECT,
schema=SCHEMA,
create_disposition=beam.io.BigQueryDisposition.CREATE_IF_NEEDED,
write_disposition=beam.io.BigQueryDisposition.WRITE_TRUNCATE))
pipeline.run()
3.Equivalent desired output in SQL:
SELECT a.name, b.value * from client_table as a INNER JOIN purchase_table as b on a.id = b.client_id;
You could use either a CoGroupByKey or side inputs (as a broadcast join) depending on your key cardinality. If you have a few keys with many elements each, I suggest the broadcast join.
The first thing you'd need to do is to add a key to your PCollections after the BQ read:
kv_1 = query_results_1 | Map(lambda x: (x["id"], x))
kv_2 = query_results_1 | Map(lambda x: (x["client_id"], x))
Then you can just do the CoGBK or broadcast join. As an example (since it would be easier to understand), I am going to use the code from this session of Beam College. Note that in your example the Value of the KV is a dictionary, so you'd need to make some modifications.
Data
jobs = [
("John", "Data Scientist"),
("Rebecca", "Full Stack Engineer"),
("John", "Data Engineer"),
("Alice", "CEO"),
("Charles", "Web Designer"),
("Ruben", "Tech Writer")
]
hobbies = [
("John", "Baseball"),
("Rebecca", "Football"),
("John", "Piano"),
("Alice", "Photoshop"),
("Charles", "Coding"),
("Rebecca", "Acting"),
("Rebecca", "Reading")
]
Join with CGBK
def inner_join(element):
name = element[0]
jobs = element[1]["jobs"]
hobbies = element[1]["hobbies"]
joined = [{"name": name,
"job": job,
"hobbie": hobbie}
for job in jobs for hobbie in hobbies]
return joined
jobs_create = p | "Create Jobs" >> Create(jobs)
hobbies_create = p | "Create Hobbies" >> Create(hobbies)
cogbk = {"jobs": jobs_create, "hobbies": hobbies_create} | CoGroupByKey()
join = cogbk | FlatMap(inner_join)
Broadcast join with Side Inputs
def broadcast_inner_join(element, side_input):
name = element[0]
job = element[1]
hobbies = side_input.get(name, [])
joined = [{"name": name,
"job": job,
"hobbie": hobbie}
for hobbie in hobbies]
return joined
hobbies_create = (p | "Create Hobbies" >> Create(hobbies)
| beam.GroupByKey()
)
jobs_create = p | "Create Jobs" >> Create(jobs)
boardcast_join = jobs_create | FlatMap(broadcast_inner_join,
side_input=pvalue.AsDict(hobbies_create))

Aggregating over Distinct Nodes

The data model I have is: (Item)-[:HAS_AN]->(ItemType) and both node types have a field called ID. Items can be related to multiple ItemTypes and in some cases, these ItemTypes can have the same IDs. I'm trying to populate a structure {ID:..., SameType:...}, where SameType = 1 if a node has the same item type as some node (with ID = 1234), and 0 otherwise.
First, I get the candidate list of nodes qList and the source node's ItemType:
MATCH (p:Item{ID:1234})-[:HAS_AN]->(i)
WITH i as pItemType, qList
Then, I go through qList, comparing each node's ItemType to pItemType (which is the ItemType of the source node):
UNWIND qList as q
MATCH (q)-[:HAS_AN]->(i)
WITH q.ID as qID, pItemType, i,
CASE i
WHEN pItemType THEN 1
ELSE 0
END as SameType
RETURN DISTINCT i, qID, pItemType, SameType
The problem I have is when some nodes have two ItemTypes with the same ID. This gives results where some of the entries are duplicates:
{ | | { |
"ID": 18 | 35258417 | "ID": 71 | 0
} | | } |
{ | | { |
"ID": 18 | 35258417 | "ID": 71 | 0
} | | } |
while I'd like to only take one such row, if more than one exists.
Placing DISTINCT where I have in the last part of the query doesn't seem to work. What's the best way to filter out such duplicates?
Update:
Here is a sample data subset: http://console.neo4j.org/r/f74pdq
Here are the queries that I'm running
MATCH (q:Item) WHERE q.ID <> 1234 WITH COLLECT(DISTINCT(q)) as qList
MATCH (p:Item{ID:1234})-[:HAS_AN]->(i:ItemType) WITH i as pItemType, qList
UNWIND qList as q
MATCH (q)-[:HAS_AN]->(i:ItemType) WITH q.ID as qID, pItemType, i,
CASE i
WHEN pItemType THEN 1
ELSE 0
END as SameType
RETURN DISTINCT i, qID, pItemType, SameType
In this example, Item with ID = 2 has two HAS_AN relations with 2 ItemType nodes with the same ID. I would like only one of them to be returned.
I've tried to simplify your query. Take a look:
MATCH (:Item {ID : 1234})-[:HAS_AN]->(target:ItemType)
MATCH (item:Item)-[:HAS_AN]->(itemType:ItemType)
WHERE item.ID <> 1234
WITH
itemType.ID AS i,
item.ID AS qID,
collect({
pItemType : target,
SameType : CASE exists((item)-[:HAS_AN]-(target))
WHEN true THEN 1 ELSE 0 END
})[0] as Item
RETURN i, qID, Item.pItemType AS pItemType, Item.SameType AS SameType
The trick is in the two lines after WITH. At this point I'm grouping by itemType.ID and item.ID and not ( and not itemType and item). In your original query you are using pItemType to group. This does not work because the two ItemTypes with ID = 34 are different nodes although they have the same ID.
The output from your console:
+-------------------------------------+
| i | qID | pItemType | SameType |
+-------------------------------------+
| 31 | 4 | Node[2]{ID:5} | 0 |
| 5 | 3 | Node[2]{ID:5} | 1 |
| 31 | 5 | Node[2]{ID:5} | 0 |
| 45 | 5 | Node[2]{ID:5} | 0 |
| 5 | 1 | Node[2]{ID:5} | 1 |
| 34 | 2 | Node[2]{ID:5} | 0 |
+-------------------------------------+
6 rows
33 ms
Thanks to #Bruno's solution, I was able to get the right answers. However, the original solution did not work right off the bat for me for two reasons - I needed the qList since I was referring to it later, and it had approximately 4 times the DB hits as the query in my question. So, I tried a few optimizations that brought the number of DB hits down to half, and am sharing it here for posterity.
MATCH (q:Item) WHERE q.ID <> 1234 WITH COLLECT(DISTINCT(q)) as qList
MATCH (p:Item{ID:1234})-[:HAS_AN]->(i:ItemType) WITH i as pItemType, qList
UNWIND qList as item
MATCH (item)-[:HAS_AN]->(i)
WITH
i, pItemType,
item.ID AS qID,
collect({
pItemType : pItemType,
SameType : CASE i.ID
WHEN pItemType.ID THEN 1 ELSE 0 END
})[0] as Item
RETURN i, qID, Item.pItemType AS pItemType, Item.SameType AS SameType
Turns out running MATCH (item:Item)-[:HAS_AN]->(itemType:ItemType) was adding a Filter operation that took almost as many DB hits as it had matches.

How to get the index of FOREACH iterations

Within a FOREACH statement [e.g. day in range(dayX, dayY)] is there an easy way to find out the index of the iteration ?
Yes, you can.
Here is an example query that creates 8 Day nodes that contain an index and day:
WITH 5 AS day1, 12 AS day2
FOREACH (i IN RANGE(0, day2-day1) |
CREATE (:Day { index: i, day: day1+i }));
This query prints out the resulting nodes:
MATCH (d:Day)
RETURN d
ORDER BY d.index;
and here is an example result:
+--------------------------+
| d |
+--------------------------+
| Node[54]{day:5,index:0} |
| Node[55]{day:6,index:1} |
| Node[56]{day:7,index:2} |
| Node[57]{day:8,index:3} |
| Node[58]{day:9,index:4} |
| Node[59]{day:10,index:5} |
| Node[60]{day:11,index:6} |
| Node[61]{day:12,index:7} |
+--------------------------+
FOREACH does not yield the index during iteration. If you want the index you can use a combination of range and UNWIND like this:
WITH ["some", "array", "of", "things"] AS things
UNWIND range(0,size(things)-2) AS i
// Do something for each element in the array. In this case connect two Things
MERGE (t1:Thing {name:things[i]})-[:RELATED_TO]->(t2:Thing {name:things[i+1]})
This example iterates a counter i over which you can use to access the item at index i in the array.

How to query a table which has a parent child relation

Situation:
I have a table called "word" which contains a word with the associated translations.
| ID | name | lang_id | parent_id |
|----|----------|---------|-----------|
| 1 | screw | 1 | null |
| 2 | schraube | 2 | 1 |
| 3 | vis | 3 | 1 |
So screw is the main word which has no parent. The other data sets have an association to the parent with the parent_id.
What I want:
I need a query which displays the word I searched for and the word which I typed in.
I want to get the datasets 2 and 3, if I query the word "schraube" from german to french.
I want to get the datasets 1 and 3, if I query the word "screw" from english to french.
...
What I tried:
select word.id, word.name, word.lang_id, word.parent_id
from word
left join word w2 on word.parent_id = w2.parent_id
WHERE w2.name = 'screw';
-- and word.lang_id = 2
Unfortunately the result doesn't contain the word I typed. Also this displays all datasets, not only the ones with the specific language.
You can modifiy th below query to get your answer.
DECLARE #FromLanguageId SMALLINT = 2; --german
DECLARE #ToLanguageId SMALLINT = 3; --french
DECLARE #NAME NVARCHAR(300) = 'schraube';
--DECLARE #FromLanguageId SMALLINT = 1; --english
--DECLARE #ToLanguageId SMALLINT = 3; --french
--DECLARE #NAME NVARCHAR(300) = 'screw';
--Get the mathing record
;
WITH ctematch
AS (
--gets the matching record (child or parent)
SELECT match.*
FROM [word] match
WHERE match.NAME LIKE #NAME),
--Join its sibling , parent and childs
ctefamilydata
AS (SELECT *
FROM ctematch match
UNION
--Parent
SELECT parent.*
FROM ctematch match
INNER JOIN [word] parent
ON match.[parent_id] = parent.[id]
UNION
--Child
SELECT child.*
FROM ctematch match
INNER JOIN [word] child
ON child.[parent_id] = match.[id]
UNION
--Siblings
SELECT siblings.*
FROM ctematch match
INNER JOIN [word] siblings
ON match.[parent_id] = siblings.[parent_id])
--Filter and get the data
SELECT *
FROM ctefamilydata Cte
WHERE Cte.[lang_id] = #ToLanguageId
OR Cte.[lang_id] = #FromLanguageId

Grouping in MDX Query

I am very newbie to MDX world..
I want to group the Columns based on only 3 rows. But, need join for 4th row also..
My query is :
SELECT
( {
[Measures].[Live Item Count]
}
) DIMENSION PROPERTIES parent_unique_name ON COLUMNS,
Crossjoin(
crossjoin(
[Item].[Class].&[Light],
[Item].[Style].&[Fav]
[Item].[Season Year].members),
[Item].[Count].children ) on rows
FROM Cube
Output comes as :
Light(Row) | FAV(Row) | ALL(Row) | 16(Row) | 2(col)
Light(Row) | FAV(Row) | ALL(Row) | 7(Row) | 1(col)
Light(Row) | FAV(Row) | 2012(Row) | 16(Row)| 2(col)
Light(Row) | FAV(Row) | 2011(Row) | 7(Row) | 1(col)
But, I want my output to be displayed as:
Light(Row) | FAV(Row) | ALL(Row) | | 3(col)
Light(Row) | FAV(Row) | 2012(Row) | 16(Row)| 2(col)
Light(Row) | FAV(Row) | 2011(Row) | 7(Row) | 1(col)
i.e., I want to group my first two rows such that there is no duplicate 'ALL' in 3rd column..
Thanks in advance
Try this - using the level name Season Year with the Attribute name Season Year will pick up every member without teh ALL member:
SELECT
( {
[Measures].[Live Item Count]
}
) DIMENSION PROPERTIES parent_unique_name ON COLUMNS,
Crossjoin(
crossjoin(
[Item].[Class].&[Light],
[Item].[Style].&[Fav]
[Item].[Season Year].[Season Year].members),
[Item].[Count].children ) on rows
FROM Cube
You can use this query if there is an All member on the [Item].[Count] hierarchy:
SELECT {[Measures].[Live Item Count]} DIMENSION PROPERTIES parent_unique_name ON COLUMNS,
Crossjoin(
Crossjoin([Item].[Class].&[Light], [Item].[Style].&[Fav]),
Union(
Crossjoin({"All member of [Item].[Season Year]"}, {"All member of [Item].[Count]"}),
Crossjoin(Except([Item].[Season Year].members, {"All member of [Item].[Season Year]"}), [Item].[Count].children),
) ON ROWS
FROM Cube

Resources