Hive Create big table from small tables - join

I have 1000 tables in hive. All have same columns. These tables will be incrementally updated.The columns are ID, name, dno, loc,sal …..
I want to create a big table by selecting only Id, name and sal from each table.
Table 1:
ID name dno loc sal ………
1 sam 201 HYD 2000 ………
Table2
ID name dno loc sal ………
2 Ram 203 BAN 3000 ………
Table 3
ID name dno loc sal ………
3 Bam 301 NY 4000 ………
And So on….
Big table:
ID name sal
1 sam 2000
2 Ram 3000
3 Bam 4000
And so on
This is what I want to achieve.
Say If there is anew record tomorrow insert into table 3 say with Id 100, name jack ….
Table 3 with new records
Table 3
ID name dno loc sal ………
3 Bam 301 NY 4000 ………
100 Jack 101 LA 5000 ……….
The new big table should be
ID name sal
1 sam 2000
2 Ram 3000
3 Bam 4000
100 Jack 5000
This is what I want to achieve without deleting the big table every time a new record is being inserted into the original 1000 tables

Slightly modified version of ravinder.
Create your child external tables as below.
create external table table1 (
column1 String,
Column2 String
)
row format delimited
fields terminated by ','
LOCATION '/user/cloudera/data_p/table_name=data1/';
Now your parent table will be created with a partitioned column table_name.
create external table parent_table (
column1 String,
Column2 String
)
partitioned by (table_name String)
row format delimited
fields terminated by ','
LOCATION '/user/cloudera/data_p/';
msck repair table parent_table;

if the ID is unique between all your tables, you can update your big table like
with q1 as (
select * from T1 where ID not in (select ID from BIGTABLE)
union
select * from T2 where ID not in (select ID from BIGTABLE)
.... so on
)
from q1
insert into table BIGTABLE select *;
if you can find the same ID between different tables (for exampleID=1 in T1 and T4), I would suggest use a extra column in the big table to identify the record source (from what tables it comes) or partition the data (depends on the size of the data). Post a comment if something doesn't make sense
Regards!
EDIT: like I said in the comment, if is not possible do an union for all the tables, I would recommend create a partitioned table. The idea is add a partition to this table pointing to the location of the other tables, in that way you should be able to apply the logic that I wrote before, this is possible to do if you are using avro/parquet format using schema evolution.

You can do couple of ways, but I like below if no constraints. make sure all small table location is one directory.
if following does not work,Need more info(criteria of naming for small tables, row formatted of a table ) (other way is to create a script selecting from meta data and get small table name and do a a script with union from small data and insert into big table)
small tables r1 & r2 and big table dept
create table r1
(dept int
,dept_name string)
ROW FORMAT DELIMITED FIELDS TERMINATED BY '|'
STORED AS TEXTFILE
LOCATION '/apps/hive/warehouse/dept/r1';
create table r2
(dept int
,dept_name string)
ROW FORMAT DELIMITED FIELDS TERMINATED BY '|'
STORED AS TEXTFILE
LOCATION '/apps/hive/warehouse/dept/r2';
[root#sandbox ~]# hadoop fs -ls -R /apps/hive/warehouse/dept
drwxrwxrwx - root hdfs 0 2017-01-25 17:43 /apps/hive/warehouse/dept/r1
-rwxrwxrwx 3 root hdfs 105 2017-01-25 17:43 /apps/hive/warehouse/dept/r1/000000_0
-rwxrwxrwx 3 root hdfs 0 2017-01-25 17:43 /apps/hive/warehouse/dept/r1/000001_0
drwxrwxrwx - root hdfs 0 2017-01-25 17:44 /apps/hive/warehouse/dept/r2
-rwxrwxrwx 3 root hdfs 105 2017-01-25 17:44 /apps/hive/warehouse/dept/r2/000000_0
-rwxrwxrwx 3 root hdfs 0 2017-01-25 17:44 /apps/hive/warehouse/dept/r2/000001_0
create external table dept
(dept int
,dept_name string)
ROW FORMAT DELIMITED FIELDS TERMINATED BY '|'
STORED AS TEXTFILE
LOCATION '/apps/hive/warehouse/dept/';

Related

Rolling sum of all values associated with the ids from two tables BigQuery

I have two tables in BigQuery. One table has date, id and name columns. For each name, there are a couple of ids associated with it. The data looks like this:
date id name
7/11 1 A
7/11 2 A
7/11 3 B
7/11 4 B
The other table has date, id, comments, shares columns in it. The id in this table comes without a name associated to it. The table looks like this:
date id comments shares
7/11 1 2 null
7/11 2 4 2
7/11 3 1 1
7/11 4 5 3
The end goal is to grab all the ids associated with a specific name (table 1) and sum up the comments and shares for the name or rather for the list of the ids (table 2)
The desired output looks would look like this:
date name comments shares
7/11 A 6 2
7/11 B 6 4
You need a join of the 2 tables and aggregation:
SELECT t1.date, t1.name,
COALESCE(SUM(t2.comments), 0) comments,
COALESCE(SUM(t2.shares), 0) shares
FROM table1 t1 LEFT JOIN table2 t2
ON t2.date = t1.date AND t2.id = t1.id
GROUP BY t1.date, t1.name
See the demo.

Need help to update record with foreach loop in SSIS

I'm working on data cleansing project where I'm executing 1st stored procedure to get all the data which has issue and store it into a staging table1 with ID, IND_REF, CODE.
Table structure is:
ID | IND_REF|CODE
12 | 2333 |ABC
13 | 1222 |EFG
Now each code associated with IND_ref is primary key of the table2 and email table where data will be updated.
Next I wrote another stored procedure with an IF statement stating,
If code = ABC then update school email as main email where emailtable_ID = staging table IND_REF
Once it update all the row of email table by reference of staging table IND_REF I used another if statement,
IF code = 'EFG' do that.... where table2_ID = staging table IND_REF...
and so on..
Basically I want to update the row of live table by referencing CODE associated with each IND_REF...
Can I achieve this with a SSIS package? Can I loop through the staging table to update the live table? Any help would be much appreciated. I am new to the SQL world so I find it difficult to loop through each record by setting counter to update live table. any help with script would be very helpful.
I don't understand your issue but let me show you an example:
If we have a table like this:
TABLE1
ID ind_ref code
1 1 ABC
2 15 DEF
3 17 GHI
and a table like this:
TABLE2
ind_ref2 code
1 ZZZ
2 XXX
3 DDD
4 ZZZ
5 XXX
15 FFF
17 GGG
Then if we run this query:
UPDATE TABLE2
SET Code = TABLE1.Code
FROM TABLE1
WHERE TABLE1.ind_ref = TABLE2.ind_ref2;
Table 2 will end up like this:
TABLE2
ind_ref2 code
1 ABC <= I got updated
2 XXX
3 DDD
4 ZZZ
5 XXX
15 DEF <= me too
17 GHI <= and me
If this is not your data or your requirement, please take the time to lay out examples as I have: explain the data that you have and what you want it to look like.
Note: SSIS is not required here and neither is looping.

Delphi sort by sum of three fields - delphi

I have a database (*.mdb), scheme of connection, that I use in my program:
TADOConnection -> TADOTable
DB has a table named Table1, which is connected by ADOTable. In Table1 there are fields A, B, C - floating point values. I need to sort the table by sums of these numbers.
For example:
Name A B C
------ --- --- ---
John 1 2 5
Nick 1 5 3
Qwert 1 5 2
Yuiop 2 3 1
I need to sort them, so the name, which A+B+C is bigger, would be first.
Sorted variant:
Name A B C
------ --- --- ---
Nick 1 5 3
John 1 2 5
Qwert 1 5 2
Yuiop 2 3 1
How to do this ?
While writing this, I understood what to do: I need a calculated field in the table, which is equal to A+B+C, and I must sort the table using it.
I do not have MS Access but with other Data Base Systems, I would use SQL to achieve this:
There are several SO answers along these lines for MS Access (try Microsoft Access - grand total adding multiple fields together)
So start with something like this:
Select Name, (A+B+C) as total, A, B, C
from table1
order by total desc

Delphi Table sorting

I have a simple problem which gave em a headache
I need to sort integers in a Database table TDBGrid ( its ABS database from component ace ) with the following order
0
1
11
111
121
2
21
211
22
221
and so on
which means every number starting with 1 should be under 1
1
11
111
5
55
can anyone help me?
thanks
This should work to get stuff in the right order:
Convert the original number to a string;
Right-pad with zeroes until you have a string of 3 characters wide;
(optional) Convert back to integer.
Then sorting should always work the way you want. Probably it's best to let the database do that for you. In MySQL you'd do something like this:
select RPAD(orderid,3,'0') as TheOrder
from MyTable
order by 1
I just ran this in SQL Server Management Studio - note I mixed up the rows in the input so they were not in sorted order:
create table #temp( ID Char(3));
insert into #temp (ID)
select '111' union
select '221';
select '0' union
select '21' union
select '1' union
select '11' union
select '211' union
select '121' union
select '2' union
select '22' union
select * from #temp order by ID;
I got the following output:
ID
----
0
1
11
111
121
2
21
211
22
221
(10 row(s) affected)
If you're getting different results, you're doing something wrong. However, it's hard to say what because you didn't post anything about how you're retrieving the data from the database.
Edit: Some clarification by the poster indicates that the display is in a TDBGrid attached to a table using Component Ace ABS Database. If that indeed is the case, then the answer is to create an index on the indicated column, and then set the table's IndexName property to use that index.
select cast(INT_FIELD as varchar(9)) as I
from TABxxx
order by 1

sqlite2: Joining max values per column from another table (subquery reference)?

I'm using the following database:
CREATE TABLE datas (d_id INTEGER PRIMARY KEY, name_id numeric, countdata numeric);
INSERT INTO datas VALUES(1,1,20); //(NULL,1,20);
INSERT INTO datas VALUES(2,1,47); //(NULL,1,47);
INSERT INTO datas VALUES(3,2,36); //(NULL,2,36);
INSERT INTO datas VALUES(4,2,58); //(NULL,2,58);
INSERT INTO datas VALUES(5,2,87); //(NULL,2,87);
CREATE TABLE names (n_id INTEGER PRIMARY KEY, name text);
INSERT INTO names VALUES(1,'nameA'); //(NULL,'nameA');
INSERT INTO names VALUES(2,'nameB'); //(NULL,'nameB');
What I would like to do, is to select all values (rows) of names - to which all columns of datas will be appended, for the row where datas.countdata is maximum for n_id (and of course, where name_id = n_id).
I can somewhat get there with the following query:
sqlite> .header ON
sqlite> SELECT * FROM names AS n1
LEFT OUTER JOIN (
SELECT d_id, name_id, countdata FROM datas AS d1
WHERE d1.countdata IN (
SELECT MAX(countdata) FROM datas
WHERE name_id=1
)
) AS p1 ON n_id=name_id;
n1.n_id|n1.name|p1.d_id|p1.name_id|p1.countdata
1|nameA|2|1|47
2|nameB|||
... however - obviously - it only works for a single row (the one explicitly set by name_id=1).
The problem is, the SQL query fails whenever I try to somehow reference the "current" n_id:
sqlite> SELECT * FROM names AS n1
LEFT OUTER JOIN (
SELECT d_id, name_id, countdata FROM datas AS d1
WHERE d1.countdata IN (
SELECT MAX(countdata) FROM datas
WHERE name_id=n1.n_id
)
) AS p1 ON n_id=name_id;
SQL error: no such column: n1.n_id
Is there any way of achieving what I want in Sqlite2??
Thanks in advance,
Cheers!
Oh, well - that wasn't trivial at all, but here is a solution:
sqlite> SELECT * FROM names AS n1
LEFT OUTER JOIN (
SELECT d1.*
FROM datas AS d1, (
SELECT max(countdata) as countdata,name_id
FROM datas
GROUP BY name_id
) AS ttemp
WHERE d1.name_id = ttemp.name_id AND d1.countdata = ttemp.countdata
) AS p1 ON n1.n_id=p1.name_id;
n1.n n1.name p1.d_id p1.name_id p1.countdata
---- ------------ ---------- ---------- -----------------------------------
1 nameA 2 1 47
2 nameB 5 2 87
Well, hope this ends up helping someone, :)
Cheers!
Notes: note that just calling max(countdata) screws up competely d_id:
sqlite> select d_id,name_id,max(countdata) as countdata from datas group by name_id;
d_id name_id countdata
---- ------------ ----------
3 2 87
1 1 47
so to get correct corresponding d_id, we must do max() on datas separately - and then perform sort of an intersect with the full datas (except that intersect in sqlite requires that there are equal number of columns in both datasets, which is not the case here - and even if we made it that way, as seen above d_id will be wrong, so intersect will not work).
One way to do that is in using a sort of a temporary table, and then utilize a multiple table SELECT query so as to set conditions between full datas and the subset returned via max(countdata), as shown below:
sqlite> CREATE TABLE ttemp AS SELECT max(countdata) as countdata,name_id FROM datas GROUP BY name_id;
sqlite> SELECT d1.*, ttemp.* FROM datas AS d1, ttemp WHERE d1.name_id = ttemp.name_id AND d1.countdata = ttemp.countdata;
d1.d d1.name_id d1.countda ttemp.coun ttemp.name_id
---- ------------ ---------- ---------- -----------------------------------
2 1 47 47 1
5 2 87 87 2
sqlite> DROP TABLE ttemp;
or, we can rewrite the above so a SELECT subquery (sub-select?) is used, like this:
sqlite> SELECT d1.* FROM datas AS d1, (SELECT max(countdata) as countdata,name_id FROM datas GROUP BY name_id) AS ttemp WHERE d1.name_id = ttemp.name_id AND d1.countdata = ttemp.countdata;
d1.d d1.name_id d1.countda
---- ------------ ----------
2 1 47
5 2 87

Resources