Many-to-many SELECT divided into multiple rows for some reason - join

I have two tables joined via third in a many-to-many relationship. To simplify:
Table A
ID-A (int)
Name (varchar)
Score (numeric)
Table B
ID-B (int)
Name (varchar)
Table AB
ID-AB (int)
A (foreign key ID-A)
B (foreign key ID-B)
What I want is to display the B-Name and a sum of the "Score" values of all the As belonging to the given B. However, the following code:
WITH "Data" AS(
SELECT "B."."Name" As "BName", "A"."Name", "Score"
FROM "AB"
LEFT OUTER JOIN "A" ON "AB"."A" = "A"."ID-A"
LEFT OUTER JOIN "B" ON "AB"."B" = "B"."ID-B")
SELECT "BName", SUM("Score") AS "Total"
FROM "Data"
GROUP BY "Name", "Score"
ORDER BY "Total" DESC
The results display several rows for every "BName" with the "score" divided into semingly random increments between these rows. For example, if the desired result for Johnny is 12 and for April it's 25, the query may shows something like:
Johnny | 7
Johnny | 3
Johnny | 2
April | 19
April | 5
April | 1
etc.
Even after trying to nest the query and doing another SELECT with SUM("Score"), the results are the same. I'm not sure what I'm doing wrong?

Remove Score from the GROUP BY clause:
SELECT BName, SUM(Score) AS Total
FROM Data
GROUP BY BName
ORDER BY Total DESC;
The purpose of your query is to summarize by name, so name alone should appear in the GROUP BY clause. By also including the score, you will get a record in the output for each unique name/score combination.

Okay, I figured out my problem. Indeed, I had to GROUP BY "Name" only, but Firebird I thought wasn't letting me do that. Turns out it was just a typo. Oops.

Related

Kusto join two tables based on latest available date

Is there a way to join two tables on Kusto, and join values based on latest available date from the second table?
Let's say we get distinct names from first table, and want to join values from the second table based on latest available dates.
I would also only keep matches from left column.
table1
names
-----
Alex
John
Mary
table2
name weight date
----- ------ ------
Alex. 160. 2023-01-20
Alex. 168. 2023-01-28
Mary. 120. 2022-08-28
Mary. 140. 2020-09-17
Sample code:
table1
|distinct names
|join kind=inner table2 on $left.names==$right.name
let table1 = datatable(names:string)
[
"Alex"
,"John"
,"Mary"
];
let table2 = datatable(name:string, weight:real ,["date"]:datetime)
[
"Alex" ,160 ,datetime(2023-01-20)
,"Alex" ,168 ,datetime(2023-01-28)
,"Mary" ,120 ,datetime(2022-08-28)
,"Mary" ,140 ,datetime(2020-09-17)
];
table1
| distinct names
| join kind=inner (table2 | summarize arg_max(['date'], *) by name) on $left.names==$right.name
names
name
date
weight
Mary
Mary
2022-08-28T00:00:00Z
120
Alex
Alex
2023-01-28T00:00:00Z
168
Fiddle

In Rails- how do you get the distinct rows from X column and from that sum Y column

I have a model view controller in my rails app, which pulls in user entered data from the view and places it into a table in the database. The table named "table" that looks like this:
__X__|__Y__|__Created_At__
A | 1 | 2021-01-02
B | 5 | 2021-01-02
C | 3 | 2021-01-02
A | 4 | 2021-01-01
What I need is a function to find the unique values in the X column of the table (i.e. a singular A, B, C... in order of the most recently entered value... so in this case the value I'd want pulled for A would be the first one there since it was created most recently), then it needs to take those unique valued rows and pull the Y values and sum them together.
This is what I was trying but it didn't work:
Table.distinct(:x).sum(:y)
The issue is that seems to just be summing all values in Y, instead of disregarding the bottom duplicate A.
If it makes any difference I'm using Rails 6.0.0
I'm not sure I'm understanding this perfectly: you have a Table model with an X:string column. You want to get all instances of Table
Table.all
and return an array of them
Table.all.collect {|x| x }
You then want to sort them by created_at
Table.all.collect {|x| x }.sort_by{|x| x.created_at}
and keep the most recent ones. This should do everything so far:
Table.all.collect {|x| x }.sort_by{|x| x.created_at}.uniq
What do you mean by "take these unique valued rows and pull the Y values and sum them together"? Sum all the Y values? Or sum the Xs and the Ys? Are the X integers to be added?
If you're using postgres it supports "distinct on" which makes it relatively easy to do this directly in the database. That would look like:
SELECT SUM(distinct_on_x.Y)
FROM (
SELECT distinct ON (X) Y
FROM table
ORDER BY created_at DESC
) AS distinct_on_x
You can execute this directly with select_values, something like:
select_distinct_count_sql = <<-SQL
SELECT SUM(distinct_on_x.Y)
FROM (
SELECT distinct ON (X) Y
FROM table
ORDER BY created_at DESC
) AS distinct_on_x
SQL
Table.connection.select_values(select_distinct_count_sql)
If you just want the distinct rows so you can sum them in memory you can do:
Table.select("DISTINCT ON (X) *").order("X, created_at")

complex db2/sql query with time-sampling, group, map, join and csv export

I have data in a table (named: TESTING) on a dashDB2 on IBM bluemix (Db2 Warehouse on Cloud) which is looking like this:
ID TIMESTAMP NAME VALUE
abc 2017-12-21 19:55:38.762 test1 123
abc 2017-12-21 19:55:42.762 test2 456
abc 2017-12-21 19:57:38.762 test1 789
abc 2017-12-21 19:58:38.762 test3 345
def 2017-12-21 19:59:38.762 test1 678
I am looking for a query that:
samples the data (for each NAME) to a given timeformat (ex. to a 1 minute based timestamp)
VALUES in same timerange (in same minute) should be averaged, empty times should be NULL
for 1. and 2. something like (only for one NAME working):
with dummy(temporaer) as (
select TIMESTAMP('2017-12-01') from SYSIBM.SYSDUMMY1
union all
select temporaer + 1 MINUTES from dummy where temporaer < TIMESTAMP('2018-02-01')
)
select temporaer, avg(VALUE) as test1 from dummy
LEFT OUTER JOIN TESTING ON temporaer=date_trunc('minute', TIMESTAMP) and ID='abc' and NAME='test1'
group by temporaer
ORDER BY temporaer ASC;
join all different NAMES column-wise to a matrix, like:
TIMESTAMP test1 test2 test3
2017-12-01 00:00:00 null null null
...
2017-12-21 19:55:00 123 456 null
2017-12-21 19:56:00 null null null
2017-12-21 19:57:00 789 null null
2017-12-21 19:58:00 678 null 345
...
2018-01-31 23:59:00 null null null
the query result should be exportet as a csv. or given back as csv-string
Does anybody know how this could be done in one query or in a simple and fast way? Or is it necessary to save the data in another tabe-format - can you give me a hint?
here is a code snipped that does the job, but needs very long time:
WITH
-- get all distinct names in table:
header(names) AS (SELECT DiSTINCT name
FROM FIELDTEST
WHERE ID='7b9bbe44d45d8f2ac324849a4951da54' AND REGEXP_LIKE(trim(VALUE),'^\d+(\.\d*)?$') AND DATE(TIMESTAMP)>='2017-12-19' AND DATE(TIMESTAMP)<'2017-12-24'),
-- select data (names, values without stringvalues) from table dedicated by timestamp to bigger timeinterval (here minutes):
dummie(time, names, values) AS (SELECT date_trunc('minute', TIMESTAMP), NAME, VALUE
FROM FIELDTEST
WHERE ID='7b9bbe44d45d8f2ac324849a4951da54' AND REGEXP_LIKE(trim(VALUE),'^\d+(\.\d*)?$')),
-- generate a range of times from date to date in defined steps:
dummy(time, rangeEnd) AS (SELECT a, a + 1 MINUTE
FROM (VALUES(TIMESTAMP('2017-12-19'))) D(a)
UNION ALL
SELECT rangeEnd, rangeEnd + 1 MINUTE
FROM dummy
WHERE rangeEnd < TIMESTAMP('2017-12-24')),
-- add each name (from header) to each time/row (in dummy):
dumpy(time, names) AS (SELECT Dummy.time, Header.names
FROM Dummy
LEFT OUTER JOIN Header
ON Dummy.time IS NOT NULL),
-- averages values by name and timeinterval and sorts result to dummy:
dummj(time, names, avgvalues) AS (SELECT Dummy.time, Dummie.names, AVG(Dummie.values)
FROM Dummy
LEFT OUTER JOIN Dummie
ON Dummie.time = Dummy.time
GROUP BY Dummie.names, Dummy.time),
-- joins the averages (by time, name) values to the times and names in dumpy (on empty value use -9999):
testo(time, names, avgvalues) AS (SELECT Dumpy.time, Dumpy.names, COALESCE(Dummj.avgvalues,-9999)
FROM Dumpy
LEFT OUTER JOIN Dummj
ON Dummj.time = Dumpy.time AND Dummj.names = Dumpy.names),
-- converts the high amount of rows to less rows with delimited strings:
test(time, names, avgvalues) AS (SELECT time, LISTAGG(names,';') WITHIN GROUP(ORDER BY names), LISTAGG(avgvalues,';') WITHIN GROUP(ORDER BY names)
FROM Testo
GROUP BY time)
SELECT* FROM test ORDER BY time ASC, names ASC;
The performance problem is in the "testo" subquery. Does anybody have an idea what is the failure here or know how to improve the query?
Well, one problem I see is that you keep using functions on columns, but that shouldn't be too big a drain if id is reasonably unique. If this query is very common, it may also be worth it to permanently build and index the range table. Hmm, you probably need several indices (starting with FieldTest.id), but you might also try this version:
-- let's name things properly, too, to keep them straight.
WITH
-- generate a range of times from date to date in defined steps:
Range (rangeStart, rangeEnd) AS (SELECT a, a + 1 MINUTE
FROM (VALUES(TIMESTAMP('2017-12-19'))) D(a)
UNION ALL
SELECT rangeEnd, rangeEnd + 1 MINUTE
FROM Range
WHERE rangeEnd < TIMESTAMP('2017-12-24')),
-- get all distinct names in table:
Header(names) AS (SELECT DISTINCT name
FROM FieldTest
WHERE ID = '7b9bbe44d45d8f2ac324849a4951da54'
-- just make the white space check part of the regex
AND REGEXP_LIKE(VALUE, '^\s*\d+(\.\d*)?\s*$')
AND timestamp >= TIMESTAMP('2017-12-19')
AND timestamp < TIMESTAMP('2017-12-24')),
-- I'm assuming the (id, name) tuple is unique, which means we don't need to repeat the regex later
Data (rangeStart, name, averaged) AS (SELECT Range.rangeStart, Header.names, COALESCE(AVG(FieldTest.value), -9999)
FROM Range
CROSS JOIN Header
LEFT JOIN FieldTest
ON FieldTest.id = '7b9bbe44d45d8f2ac324849a4951da54'
AND FieldTest.names = Header.names
AND FieldTest.timestamp >= Range.rangeStart
AND FieldTest.timestamp < Range.rangeEnd
GROUP BY Range.rangeStart, Header.names),
-- I can't recall if DB2 allows using the new column name this way, you may need to wrap this again
SELECT rangeStart,
-- converts the high amount of rows to less rows with delimited strings:
LISTAGG(names,';') WITHIN GROUP(ORDER BY names) AS names,
LISTAGG(avgvalues,';') WITHIN GROUP(ORDER BY names)
GROUP BY rangeStart
ORDER BY rangeStart, names
(not tested)
the CROSS JOIN was defenitly a nice hint. Also I was not able to implement the following LEFT JOIN like you suggested, I found a workaround, which - I am sure - still keeps room for improvement but at this moment is acceptable for me (timesaving about factor 30 compared to my first query solution). Here the actual code:
WITH
-- generate a range of times from date to date in defined steps:
TimeRange(rangeStart, rangeEnd) AS (SELECT a, a + 1 MINUTE
FROM (VALUES(TIMESTAMP('2017-12-19'))) D(a)
UNION ALL
SELECT rangeEnd, rangeEnd + 1 MINUTE
FROM TimeRange
WHERE rangeEnd < TIMESTAMP('2017-12-24')),
-- get all distinct names in table:
Header(names) AS (SELECT DISTINCT name
FROM FIELDTEST
WHERE ID = '7b9bbe44d45d8f2ac324849a4951da54'
AND REGEXP_LIKE(VALUE, '^\s*\d+(\.\d*)?\s*$')
AND timestamp >= TIMESTAMP('2017-12-19')
AND timestamp < TIMESTAMP('2017-12-24')),
-- select data (names, values without stringvalues) from table dedicated by timestamp to bigger timeinterval (here minutes):
rawData(time, names, values) AS (SELECT date_trunc('minute', TIMESTAMP), NAME, VALUE
FROM FIELDTEST
WHERE ID = '7b9bbe44d45d8f2ac324849a4951da54'
AND REGEXP_LIKE(VALUE, '^\s*\d+(\.\d*)?\s*$')),
-- I'm assuming the (id, name) tuple is unique, which means we don't need to repeat the regex later
Data(rangeStart, name, averaged) AS (SELECT TimeRange.rangeStart, Header.names, COALESCE(AVG(rawData.values), -9999)
FROM TimeRange
CROSS JOIN Header
LEFT JOIN rawData
ON rawData.names = Header.names
AND rawData.time = TimeRange.rangeStart
GROUP BY TimeRange.rangeStart, Header.names),
test(time, names, avgvalues) AS (SELECT Data.rangeStart,
LISTAGG(Data.name,';') WITHIN GROUP(ORDER BY name),
LISTAGG(Data.averaged,';') WITHIN GROUP(ORDER BY name)
FROM Data
GROUP BY Data.rangeStart)
-- build my own delimited export-string:
SELECT CONCAT(CONCAT(SUBSTR(REPLACE(time,'.',':'),1,19),';'), REPLACE(CAST(avgvalues AS VARCHAR(3980)),'-9999',''))
FROM test
UNION ALL
SELECT CONCAT(CAST('TIME;' AS VARCHAR(5)), CAST(LISTAGG(names,';') WITHIN GROUP(ORDER BY names) AS VARCHAR(3980)))
FROM Header;

How to join two tables and return one row in PL/SQL?

I have 2 tables which names are FIRST and USERS. In FIRST table i have USERID1,USERID2 and USERID3 columns. And USERS table i save USERID's names and surnames values. If i join tables;
SELECT f.userid1, f.userid2, f.userid3, u.name, u.surname
FROM users u, first f
WHERE f.userid1=u.userid AND f.userid2=u.userid AND f.userid3=u.userid
i want to return this result in 1 row like;
121314 name surname 131415 name surname 141516 name surname
FIRST table result example;
no no2 userid1 userid2 userid3 ...
7 100000545 121314 131415 141516 ....
USER TABLE
id name surname
121314 black smoke
131415 jack shephard
141516 john locke
i want to result like user table but i have to join because coming result must according to records from FIRST table
Generally, to retrieve the result in one row, you will need to use aggregate function. You might try adapting the following query to your requirements:
WITH first AS
(
SELECT
7 AS no
,100000545 AS no2
,121314 AS userid1
,131415 AS userid2
,141516 AS userid3
FROM
dual
)
, users AS
(
SELECT
DECODE(LEVEL
,1,121314
,2,131415
,3,141516
) AS id
,DECODE(LEVEL
,1,'black'
,2,'jack'
,3,'john'
) AS name
,DECODE(LEVEL
,1,'smoke'
,2,'shephard'
,3,'locke'
) AS surname
FROM
dual
CONNECT BY LEVEL < 4
)
SELECT
TRIM(XMLAGG(XMLELEMENT(e, s.id||' '||s.name||' '||s.surname||' ')).EXTRACT('//text()').getClobVal()) AS one_row_result
FROM
first f
,users s
WHERE
s.id IN (f.userid1, f.userid2, f.userid3)
Not necessarily about plsql, but sql in general
In your query "u.userid" will always be a field in the same row. So, you won't get any result as long as f.userid1, f.userid2, f.userid3 don't hold the same value. What i think you want to achieve is to get 3 users, whose ids are stored in first table, so you have to reference "user"table three times.
To get a row as in your example, a query would look like:
SELECT f.userid1, u_first.name , u_first.surname,
f.userid2, u_second.name, u_second.surname,
f.userid3, u_third.name, u_third.surname
FROM users u_first, users u_second, users u_third, first f
WHERE f.userid1=u_first.userid
AND f.userid2=u_second.userid
AND f.userid3=u_third.userid
Someone will surely provide a correct join here.
Edit
For your example data this select will result in
121314 black smoke 131415 jack shepard 141516 john locke
Note however, that you don't specify for which row in First table you want to get this join, so you will get a result row for each row in First table, if rows from User table can satisfy those conditions (I hope you have constraints set properly)
If you want to fetch only one row for one specific Field row, add
AND f.no=7
at the end of the query
If the output you wish to receive is:
121314 black smoke
131415 jack shephard
141516 john locke
try:
SELECT f.userid1, u.name ,u.surname,
FROM users u, first f
WHERE f.userid1=u.userid
OR f.userid2=u.userid
OR f.userid3=u.userid
AND f.no=7
This is a good way to actually test, whether all users with those IDs exist in the database.
But remember, that the order in result will be the same as order of items in User table, so it won't neccessarly be the same as f.first and then f.second and then f.third

Hive outer join: how to change the default NULL value

For hive outer join, if a joining key does not exist in one table,hive will put NULL. Is that possible to use another value for this? For example:
Table1:
user_id, name, age
1 Bob 23
2 Jim 43
Table2:
user_id, txn_amt, date
1 20.00 2013-12-10
1 10.00 2014-07-01
If I do a LEFT OUTER JOIN on user_id:
INSERT INTO TABLE user_txn
SELECT
Table1.user_id,
Table1.name,
Table2.txn_amt,
Table2.date
FROM
Table2
LEFT OUTER JOIN
Table1
ON
Table1.user_id = Table2.user_id;
I want the output be like this:
user_id, name, tnx_amt, date
1 Bob 20.00 2013-12-10
1 Bob 10.00 2014-07-01
2 Jim 0.00 2099-12-31
Note the txn_amt and date columns for Jim. Is there any way in hive to define default values like this?
You can use COALESCE for this, instead of solely Table2.txn_amt
COALESCE(Table2.txn_amt, 0.0)
What this does is returns the first value that is not null. So, if txn_amt is null, it'll go to the second value in the list. 0.0 is never null, so it'll pick that. If txn_amt has a value in it, it'll return that value.

Resources