I have one table (Table1) with some info and a string ID
I have another table (Table2) with some more info and a similar string ID (it is missing an extra char in the middle).
I was originally joining the tables on
t2.StringID = substring(t1.StringID,0,2)+substring(t1.StringID,4,7)
But that was too slow, so I decided to create a new column on Table1 which is already mapped to the PrimaryID of Table2, and then index that col.
So, to update that new column I do this:
select distinct PrimaryID,
substring(t2.StringID,0,2)+
substring(t2.StringID,4,7)) as StringIDFixed
into #temp
from Table2 t2
update Table1 tl
set t1.T2PrimaryID = isnull(t.PrimaryID, 0)
from Table1 t11, #temp t
where t11.StringID = t.StringIDFixed
and t1.T2PrimaryID is null
It creates the temp table in a few seconds, but the update has been running for 25 minutes now, and I dont know if it will even ever finish.
Table 1 has 45MM rows, Table 2 has 1.5MM
I know that's a chunky amount of data, but still, i feel like this shouldnt be that hard.
It's Sybase IQ 12.7
Any ideas?
Thanks.
Created an index on the temp table which took a few seconds, and then re ran the same update which then only took 7 seconds.
create index idx_temp_temp on #temp (StringIDFixed)
I hate Sybase.
select distinct isnull(t2.PrimaryID, 0),
substring(t2.StringID,0,2)+
substring(t2.StringID,4,7)) as StringIDFixed
into #temp
from Table2 t2
create HG index idx_temp_temp_HG on #temp (StringIDFixed)
or
create LF index idx_temp_temp_LF on #temp (StringIDFixed)
--check if in Table1 exists index HG or LF in StringID if not.. create index
update Table1 tl
set t1.T2PrimaryID = t.PrimaryID
from Table1 t11, #temp t
where t11.StringID = t.StringIDFixed
-- check if is necesary
-- and t1.T2PrimaryID is null replace for t11.T2PrimaryID is null
Consider replacing your update with an inner join to avoid the isnull() function on a big dataset.
update Table1
set a.T2PrimaryID = b.PrimaryID
from Table1 a
inner join #temp b
on a.StringID = b.StringIDFixed
Related
I'm trying to left join two stored procedures in a Firebird query.
In my example data the first returns 70 records, the second just 1 record.
select
--...
from MYSP1('ABC', 123) s1
left join MYSP2('DEF', 456) s2
on s1.FIELDA = s2.FIELDA
and s1.FIELDB = s2.FIELDB
The problem is performances: it takes 10 seconds, while each procedure takes less than 1 second. I suspect that procedures are run multiple times instead of just once. It would make sense to execute them just once, because I pass fixed parameters to them.
Is there a way to oblige Firebird to simply execute once each procedure and then join their results?
Since it seems there is no way, I solved this issue running this query inside a new stored procedure, where I cache all results from MYSP2 into a global temporary table and make the join between MYSP1 and the temporary table.
This is temporary table definition:
create global temporary table MY_TEMP_TABLE
(
FIELDA varchar(3) not null,
FIELDB smallint not null,
FIELDC varchar(10) not null
);
This is stored procedure body:
--cache MYSP2 results
delete from MY_TEMP_TABLE;
insert into MY_TEMP_TABLE
select *
from MYSP2('DEF', 456)
;
--join data
for
select
--...
from MYSP1('ABC', 123) s1
left join MY_TEMP_TABLE s2
on s1.FIELDA = s2.FIELDA
and s1.FIELDB = s2.FIELDB
into
--...
do
suspend;
But if there is another solution without temporary tables it would be great!
Maybe this can help:
with MYSP2W as (MYSP2('DEF', 456))
select
--...
from MYSP1('ABC', 123) s1
left join MYSP2W s2
on s1.FIELDA = s2.FIELDA
and s1.FIELDB = s2.FIELDB
sorry for the late response.
For a key in table A, there may be 2 or more records present in tables B and C. That is, one another column in these tables will have a date value which would be making the keys unique. So I want to extract the record that has maximum date value. And that's why I am using the max function. I know that the subquery which I have coded should not be included in the ON clause and it would do the filtering before the join statement. So eventually I want to know how to mention the max clause in the query.
Example:
Table A
Key - AAAAA
Table B:
Record 1
Key - AAAAA
Date - 2017-10-01
Record 2
Key - AAAAA
Date - 2017-10-05
I want the only the record AAAAA/2017-10-05 to be selected from the table B
Basically records from table A where A.c3 = 'Y' should be extracted first (assume it gives 500 records)
Then join these 500 records with tables B and C (left outer, to have all the matching records and the non-matching records should have nulls in the columns from the tables B and C)
In tables B and C, if more than 1 record present with different dates, the maximum date field should be extracted.
Hence final output should contain 500 records.
This is all you need for what you describe
SELECT A.A1, A.A2, B.B1, B.B2, C.C1, C.C2
FROM TABLE1 A
LEFT OUTER JOIN TABLE2 B
ON A.A1 = B.B1
LEFT OUTER JOIN TABLE3 C
ON A.A1 = C.C1
WHERE A.C3 = ‘Y’
These lines are causing your problem...basically forcing your outer joins to an inner joins.
AND B.C3 = (SELECT MAX(B3) FROM TABLE2 T1
WHERE T1.B1 = B.B1)
AND C.C3 = (SELECT MAX(C3) FROM TABLE3 T1
WHERE T1.C1 = C.C1)
If there's no match in B or C , then B.C3 and/or C.C3 will be NULL and NULL can't be = to anything (or <> to anything for that matter)
What are you trying to accomplish with the above that you've not included in the question?
Just do it?
SELECT A.A1, A.A2, B.B1, B.B2, C.C1, C.C2
FROM TABLE1 A
LEFT OUTER JOIN TABLE2 B
ON A.A1 = B.B1
LEFT OUTER JOIN TABLE3 C
ON A.A1 = C.C1
WHERE A.C3 = 'Y' and (B.B1 is null or C.B1 is null)
I have the need to join a huge table (10 million plus rows) to a lookup table (15k plus rows) with an OR condition. Something like:
SELECT t1.a, t1.b, nvl(t1.c, t2.c), nvl(t1.d, t2.d)
FROM table1 t1
JOIN table2 t2 ON t1.c = t2.c OR t1.d = t2.d;
This is because table1 can have c or d as NULL, and I'd like to join on whichever is available, leaving out the rest. The query plan says there is a Nested Loop, which I realize is because of the OR condition. Is there a clean, efficient way of solving this problem? I'm using Redshift.
EDIT: I am trying to run this with a UNION, but it doesn't seem to be any faster than before.
If you have a preferred column you can NVL() (aka COALESCE()) them and join on that.
SELECT t1.a, t1.b, nvl(t1.c, t2.c), nvl(t1.d, t2.d)
FROM table1 t1
JOIN table2 t2
ON t1.c = NVL(t2.c,t2.d);
I'd also suggest that you should set the lookup table to DISTSTYLE ALL to ensure that the larger table is not redistributed.
[ Also, 10 million rows isn't big for Redshift. Not trying to be snotty just saying that we get excellent performance on Redshift even when querying (and joining) tables with hundreds of billions of rows. ]
How about doing two (left) joins? With the small lookup table performance shouldn't be too bad even.
SELECT t1.a, t1.b, nvl(t1.c, t2.c), nvl(t1.d, t3.d)
FROM table1 t1
LEFT JOIN table2 t2 ON t1.d = t2.d and t1.c is null
LEFT JOIN table2 t3 ON t1.c = t3.c and t1.d is null
Your original query only returns rows that match at least one of c or d in the lookup table. If that's not guaranteed you may need to add filters...for example rows in t1 where both c and d are null or have values not present in table2.
Don't really need the null checks in the joins, but might be slightly faster.
I use a MYSQL command as follows:
UPDATE TABLE 1 FROM TABLE1 JOIN TABLE2 USING (COLUMN1)
SET TABLE1.AMOUNT = TABLE1.AMOUNT * TABLE2.FACTOR
According to this JOIN, there should be 3 rows returned from TABLE2 (say with factos 2, 3 and 4) but the TABLE1.AMOUNT only multiply the FACTOR in the first row and not the 2nd and 3rd row.
I expect to get the original AMOUNT x (2x3x4) BUT I get the value AMOUNT x 2
How do I solve this? Thanks for your help.
An UPDATE statement only updates a given row once. You need to replace TABLE2 with a subquery that produces the right multiplier. Unfortunately, MySQL doesn't have any multiplicative counterpart to SUM for multiplying a group of values together, but if you can accept some extra roundoff error, I suppose you could write:
UPDATE table1
FROM table1
JOIN ( SELECT column1,
EXP(SUM(LN(table2.factor))) AS total_factor
FROM table2
GROUP
BY column1
) subquery2
USING (column1)
SET table1.amount = table1.amount * subquery2.total_factor
;
(using the fact that Πak = eΣlnak).
I have a nightly job that runs and computes some data in hive. It is partitioned by day.
Fields:
id bigint
rank bigint
Yesterday
output/dt=2013-10-31
Today
output/dt=2013-11-01
I am trying to figure out if there is a easy way to get incremental changes between today and yesterday
I was thinking about doing a left outer join but not sure what that looks like since its the same table
This is what it might looks like when there are different tables
SELECT * FROM a LEFT OUTER JOIN b
ON (a.id=b.id AND a.dt='2013-11-01' and b.dt='2-13-10-31' ) WHERE a.rank!=B.rank
But on the same table it is
SELECT * FROM a LEFT OUTER JOIN a
ON (a.id=a.id AND a.dt='2013-11-01' and a.dt='2-13-10-31' ) WHERE a.rank!=a.rank
Suggestions?
This would work
SELECT a.*
FROM A a LEFT OUTER JOIN A b ON a.id = b.id
WHERE a.dt='2013-11-01' AND b.dt='2013-10-31' AND <your-rank-conditions>;
Efficiently, this would span 1 MapReduce job only.
So I figured it out... Using Subqueries and Joins
select * from (select * from table where dt='2013-11-01') a
FULL OUTER JOIN
(select * from table where dt='2013-10-31') b
on (a.id=b.id)
where a.rank!=b.rank or a.rank is null or b.rank is null
The above will give you the diff..
You can take the diff and figure out what you need to ADD/UPDATE/REMOVE
UPDATE If a.rank!=null and b.rank!=null i.e rank changed
DELETE IF a.rank=null and b.rank!=null i.e the user is no longer ranked
ADD if a.rank!=null and b.rank=null i.e this is a new user