Kusto Query: Join tables with different datatypes - join

How would you join two tables based on two columns with same names, but different datatypes?
In this example, phone_number is string in table_1 and int64 in table_2. When I try to change datatype from string to int, it changes the values!
table_1
|project name, phone_number
|join kind=fullouter table_2 on $left.name==$right.name and $left.phone_number==$right.phone_number
Thanks

You have issues with your data to begin with.
A phone number is of type string, not an integer.
A phone number might have a leading zero, e.g., 050123456 or non-digit characters e.g., +972123456 or *1234.
If you will try to convert those strings to integer, you will get nulls.
If you convert your integers to string, you will discover that for some of the values you are missing a leading zero.
That said, in this specific case, I would recommend converting string to integer, paraphs after removing any non-digit character.
let table_1 = datatable(name:string, phone_number:string)
[
"John" ,"050123456"
,"Linda" ,"+972123456"
,"Ben" ,"*1234"
,"Pam" ,"012-333-444"
];
let table_2 = datatable(name:string, phone_number:long)
[
"John" ,50123456
,"Linda" ,972123456
,"Ben" ,1234
,"Pam" ,12333444
];
table_1
| project name, phone_number = tolong(replace_regex(phone_number, #"\D+", ""))
| join kind=fullouter table_2 on $left.name==$right.name and $left.phone_number==$right.phone_number
name
phone_number
name1
phone_number1
John
50123456
John
50123456
Linda
972123456
Linda
972123456
Ben
1234
Ben
1234
Pam
12333444
Pam
12333444
Fiddle

Related

Kusto join two tables based on latest available date

Is there a way to join two tables on Kusto, and join values based on latest available date from the second table?
Let's say we get distinct names from first table, and want to join values from the second table based on latest available dates.
I would also only keep matches from left column.
table1
names
-----
Alex
John
Mary
table2
name weight date
----- ------ ------
Alex. 160. 2023-01-20
Alex. 168. 2023-01-28
Mary. 120. 2022-08-28
Mary. 140. 2020-09-17
Sample code:
table1
|distinct names
|join kind=inner table2 on $left.names==$right.name
let table1 = datatable(names:string)
[
"Alex"
,"John"
,"Mary"
];
let table2 = datatable(name:string, weight:real ,["date"]:datetime)
[
"Alex" ,160 ,datetime(2023-01-20)
,"Alex" ,168 ,datetime(2023-01-28)
,"Mary" ,120 ,datetime(2022-08-28)
,"Mary" ,140 ,datetime(2020-09-17)
];
table1
| distinct names
| join kind=inner (table2 | summarize arg_max(['date'], *) by name) on $left.names==$right.name
names
name
date
weight
Mary
Mary
2022-08-28T00:00:00Z
120
Alex
Alex
2023-01-28T00:00:00Z
168
Fiddle

Many-to-many SELECT divided into multiple rows for some reason

I have two tables joined via third in a many-to-many relationship. To simplify:
Table A
ID-A (int)
Name (varchar)
Score (numeric)
Table B
ID-B (int)
Name (varchar)
Table AB
ID-AB (int)
A (foreign key ID-A)
B (foreign key ID-B)
What I want is to display the B-Name and a sum of the "Score" values of all the As belonging to the given B. However, the following code:
WITH "Data" AS(
SELECT "B."."Name" As "BName", "A"."Name", "Score"
FROM "AB"
LEFT OUTER JOIN "A" ON "AB"."A" = "A"."ID-A"
LEFT OUTER JOIN "B" ON "AB"."B" = "B"."ID-B")
SELECT "BName", SUM("Score") AS "Total"
FROM "Data"
GROUP BY "Name", "Score"
ORDER BY "Total" DESC
The results display several rows for every "BName" with the "score" divided into semingly random increments between these rows. For example, if the desired result for Johnny is 12 and for April it's 25, the query may shows something like:
Johnny | 7
Johnny | 3
Johnny | 2
April | 19
April | 5
April | 1
etc.
Even after trying to nest the query and doing another SELECT with SUM("Score"), the results are the same. I'm not sure what I'm doing wrong?
Remove Score from the GROUP BY clause:
SELECT BName, SUM(Score) AS Total
FROM Data
GROUP BY BName
ORDER BY Total DESC;
The purpose of your query is to summarize by name, so name alone should appear in the GROUP BY clause. By also including the score, you will get a record in the output for each unique name/score combination.
Okay, I figured out my problem. Indeed, I had to GROUP BY "Name" only, but Firebird I thought wasn't letting me do that. Turns out it was just a typo. Oops.

complex db2/sql query with time-sampling, group, map, join and csv export

I have data in a table (named: TESTING) on a dashDB2 on IBM bluemix (Db2 Warehouse on Cloud) which is looking like this:
ID TIMESTAMP NAME VALUE
abc 2017-12-21 19:55:38.762 test1 123
abc 2017-12-21 19:55:42.762 test2 456
abc 2017-12-21 19:57:38.762 test1 789
abc 2017-12-21 19:58:38.762 test3 345
def 2017-12-21 19:59:38.762 test1 678
I am looking for a query that:
samples the data (for each NAME) to a given timeformat (ex. to a 1 minute based timestamp)
VALUES in same timerange (in same minute) should be averaged, empty times should be NULL
for 1. and 2. something like (only for one NAME working):
with dummy(temporaer) as (
select TIMESTAMP('2017-12-01') from SYSIBM.SYSDUMMY1
union all
select temporaer + 1 MINUTES from dummy where temporaer < TIMESTAMP('2018-02-01')
)
select temporaer, avg(VALUE) as test1 from dummy
LEFT OUTER JOIN TESTING ON temporaer=date_trunc('minute', TIMESTAMP) and ID='abc' and NAME='test1'
group by temporaer
ORDER BY temporaer ASC;
join all different NAMES column-wise to a matrix, like:
TIMESTAMP test1 test2 test3
2017-12-01 00:00:00 null null null
...
2017-12-21 19:55:00 123 456 null
2017-12-21 19:56:00 null null null
2017-12-21 19:57:00 789 null null
2017-12-21 19:58:00 678 null 345
...
2018-01-31 23:59:00 null null null
the query result should be exportet as a csv. or given back as csv-string
Does anybody know how this could be done in one query or in a simple and fast way? Or is it necessary to save the data in another tabe-format - can you give me a hint?
here is a code snipped that does the job, but needs very long time:
WITH
-- get all distinct names in table:
header(names) AS (SELECT DiSTINCT name
FROM FIELDTEST
WHERE ID='7b9bbe44d45d8f2ac324849a4951da54' AND REGEXP_LIKE(trim(VALUE),'^\d+(\.\d*)?$') AND DATE(TIMESTAMP)>='2017-12-19' AND DATE(TIMESTAMP)<'2017-12-24'),
-- select data (names, values without stringvalues) from table dedicated by timestamp to bigger timeinterval (here minutes):
dummie(time, names, values) AS (SELECT date_trunc('minute', TIMESTAMP), NAME, VALUE
FROM FIELDTEST
WHERE ID='7b9bbe44d45d8f2ac324849a4951da54' AND REGEXP_LIKE(trim(VALUE),'^\d+(\.\d*)?$')),
-- generate a range of times from date to date in defined steps:
dummy(time, rangeEnd) AS (SELECT a, a + 1 MINUTE
FROM (VALUES(TIMESTAMP('2017-12-19'))) D(a)
UNION ALL
SELECT rangeEnd, rangeEnd + 1 MINUTE
FROM dummy
WHERE rangeEnd < TIMESTAMP('2017-12-24')),
-- add each name (from header) to each time/row (in dummy):
dumpy(time, names) AS (SELECT Dummy.time, Header.names
FROM Dummy
LEFT OUTER JOIN Header
ON Dummy.time IS NOT NULL),
-- averages values by name and timeinterval and sorts result to dummy:
dummj(time, names, avgvalues) AS (SELECT Dummy.time, Dummie.names, AVG(Dummie.values)
FROM Dummy
LEFT OUTER JOIN Dummie
ON Dummie.time = Dummy.time
GROUP BY Dummie.names, Dummy.time),
-- joins the averages (by time, name) values to the times and names in dumpy (on empty value use -9999):
testo(time, names, avgvalues) AS (SELECT Dumpy.time, Dumpy.names, COALESCE(Dummj.avgvalues,-9999)
FROM Dumpy
LEFT OUTER JOIN Dummj
ON Dummj.time = Dumpy.time AND Dummj.names = Dumpy.names),
-- converts the high amount of rows to less rows with delimited strings:
test(time, names, avgvalues) AS (SELECT time, LISTAGG(names,';') WITHIN GROUP(ORDER BY names), LISTAGG(avgvalues,';') WITHIN GROUP(ORDER BY names)
FROM Testo
GROUP BY time)
SELECT* FROM test ORDER BY time ASC, names ASC;
The performance problem is in the "testo" subquery. Does anybody have an idea what is the failure here or know how to improve the query?
Well, one problem I see is that you keep using functions on columns, but that shouldn't be too big a drain if id is reasonably unique. If this query is very common, it may also be worth it to permanently build and index the range table. Hmm, you probably need several indices (starting with FieldTest.id), but you might also try this version:
-- let's name things properly, too, to keep them straight.
WITH
-- generate a range of times from date to date in defined steps:
Range (rangeStart, rangeEnd) AS (SELECT a, a + 1 MINUTE
FROM (VALUES(TIMESTAMP('2017-12-19'))) D(a)
UNION ALL
SELECT rangeEnd, rangeEnd + 1 MINUTE
FROM Range
WHERE rangeEnd < TIMESTAMP('2017-12-24')),
-- get all distinct names in table:
Header(names) AS (SELECT DISTINCT name
FROM FieldTest
WHERE ID = '7b9bbe44d45d8f2ac324849a4951da54'
-- just make the white space check part of the regex
AND REGEXP_LIKE(VALUE, '^\s*\d+(\.\d*)?\s*$')
AND timestamp >= TIMESTAMP('2017-12-19')
AND timestamp < TIMESTAMP('2017-12-24')),
-- I'm assuming the (id, name) tuple is unique, which means we don't need to repeat the regex later
Data (rangeStart, name, averaged) AS (SELECT Range.rangeStart, Header.names, COALESCE(AVG(FieldTest.value), -9999)
FROM Range
CROSS JOIN Header
LEFT JOIN FieldTest
ON FieldTest.id = '7b9bbe44d45d8f2ac324849a4951da54'
AND FieldTest.names = Header.names
AND FieldTest.timestamp >= Range.rangeStart
AND FieldTest.timestamp < Range.rangeEnd
GROUP BY Range.rangeStart, Header.names),
-- I can't recall if DB2 allows using the new column name this way, you may need to wrap this again
SELECT rangeStart,
-- converts the high amount of rows to less rows with delimited strings:
LISTAGG(names,';') WITHIN GROUP(ORDER BY names) AS names,
LISTAGG(avgvalues,';') WITHIN GROUP(ORDER BY names)
GROUP BY rangeStart
ORDER BY rangeStart, names
(not tested)
the CROSS JOIN was defenitly a nice hint. Also I was not able to implement the following LEFT JOIN like you suggested, I found a workaround, which - I am sure - still keeps room for improvement but at this moment is acceptable for me (timesaving about factor 30 compared to my first query solution). Here the actual code:
WITH
-- generate a range of times from date to date in defined steps:
TimeRange(rangeStart, rangeEnd) AS (SELECT a, a + 1 MINUTE
FROM (VALUES(TIMESTAMP('2017-12-19'))) D(a)
UNION ALL
SELECT rangeEnd, rangeEnd + 1 MINUTE
FROM TimeRange
WHERE rangeEnd < TIMESTAMP('2017-12-24')),
-- get all distinct names in table:
Header(names) AS (SELECT DISTINCT name
FROM FIELDTEST
WHERE ID = '7b9bbe44d45d8f2ac324849a4951da54'
AND REGEXP_LIKE(VALUE, '^\s*\d+(\.\d*)?\s*$')
AND timestamp >= TIMESTAMP('2017-12-19')
AND timestamp < TIMESTAMP('2017-12-24')),
-- select data (names, values without stringvalues) from table dedicated by timestamp to bigger timeinterval (here minutes):
rawData(time, names, values) AS (SELECT date_trunc('minute', TIMESTAMP), NAME, VALUE
FROM FIELDTEST
WHERE ID = '7b9bbe44d45d8f2ac324849a4951da54'
AND REGEXP_LIKE(VALUE, '^\s*\d+(\.\d*)?\s*$')),
-- I'm assuming the (id, name) tuple is unique, which means we don't need to repeat the regex later
Data(rangeStart, name, averaged) AS (SELECT TimeRange.rangeStart, Header.names, COALESCE(AVG(rawData.values), -9999)
FROM TimeRange
CROSS JOIN Header
LEFT JOIN rawData
ON rawData.names = Header.names
AND rawData.time = TimeRange.rangeStart
GROUP BY TimeRange.rangeStart, Header.names),
test(time, names, avgvalues) AS (SELECT Data.rangeStart,
LISTAGG(Data.name,';') WITHIN GROUP(ORDER BY name),
LISTAGG(Data.averaged,';') WITHIN GROUP(ORDER BY name)
FROM Data
GROUP BY Data.rangeStart)
-- build my own delimited export-string:
SELECT CONCAT(CONCAT(SUBSTR(REPLACE(time,'.',':'),1,19),';'), REPLACE(CAST(avgvalues AS VARCHAR(3980)),'-9999',''))
FROM test
UNION ALL
SELECT CONCAT(CAST('TIME;' AS VARCHAR(5)), CAST(LISTAGG(names,';') WITHIN GROUP(ORDER BY names) AS VARCHAR(3980)))
FROM Header;

ActiveRecord query that groups by ID and sums rows

My Postgresql DB has this structure:
TABLE Orders
id (string)
userId (string)
prodId (string)
value (integer)
This is an example of data:
id userId prodId value
1 a#a.aaa prod1 5
2 b#b.bbb prod1 -1
3 a#a.aaa prod1 -4
4 a#a.aaa prod2 9
I want to do a query from ActiveRecord that sums all the values for a specific userId, so the query for (a#a.aaa) would return a LIST like this:
prod1 1
prod2 9
My first approach is this one, but it doesn't work:
orderList = Orders.select("SUM(orders.amount) AS num_prods").where((:userId => HERE_USER_ID).group(:prodId)
EDIT: rephrased thanks to feedback
Order.where(userId: id).group(:prodId).sum(:value) # replace `:id` with your value
This should give you a hash, like so
{1=>10, 2=>20, 5=>20}
the keys 1,2,5 represent the product id, and the values 10,20,20 represent the sum values.

Core data fetch request with GROUP BY

I have a table in my application which consists of some names and phone numbers and orderIDs and dates (4 columns).
I want to get an array of all distinct phone numbers (Note that someone with a phone number may have several orderIDs).
Test case: suppose that this is the current records of my table.
Name phoneNumber orderID date
John 1234 101 2014-12-12
Susan 9876 102 2014-12-08
John 1234 103 2014-12-17
I want an array of distinct phone numbers only, something like: {1234, 9876}
How can I perform such a fetch in core data?
Any help would be much appreciated. Thank you.
P.S: As I knew in SQL I could do something like:
SELECT phoneNumber FROM orders
GROUP BY phoneNumber
You can use Distinct Keyword. so your query will become
SELECT DISTINCT phoneNumber FROM orders

Resources