(Hortonworks Sandbox) Pig Join operation duplicate primary key columns - join

I have two tables that I want to join.
table1 has id and value columns.
table2 has id and color columns.
final = join table1 by id, table2 by id;
dump final;
I received table whose columns are id, value, id, color. But I want to obtain a table which has columns like id, value and color. How can I remove this duplicate id column from this table?

If you do DESCRIBE final; you will see that the schema looks something like this:
final: {table1::id: chararray,table1::value: chararray,table2::id: chararray,table2::color: chararray}
To distinguish between the two ID columns, you can use table1::id or table2::id. So to remove one of the duplicate columns, you can do:
A = FOREACH final GENERATE
table1::id AS id,
table1::value AS value,
table2::color AS color;
(I've also renamed the fields to get rid of the table1:: and table2:: prefixes since they're no longer necessary.)
I could have also done:
A = FOREACH final GENERATE
table1::id AS id,
value AS value,
color AS color;
This would not have given me an error since value and color are unambiguous names.

Execute your final PIG script:
grunt> table1 = LOAD 'table1_input_path' USING PigStorage(',') as (id:int, value:int);
grunt> table2= LOAD 'table2_input_path' USING PigStorage(',') as (id:int, color:chararray);
grunt> joinlevel = JOIN table1 BY id, table2 BY id;
grunt> final = FOREACH joinlevel generate table1::id as id, table1::color as color, table2::value as value;
grunt> dump final;

Related

Replacing NULL values in Table1.Column1 with values from Table2.Column2 where Table1 has multiple rows of same values

Let me begin by apologizing for what may have been a confusing title. I an just beginning my data analyst journey. I am working in BIGQUERY with a Extreme Storm dataset (TABLE1) that has fields for LAT,LONG, and STATE. There are null values in the latitude and longitude fields that I want to replace with general LAT/LONG values from a State Information dataset(TABLE2) also containing LAT,LONG and STATE values. In TABLE1 each record is given a unique EVENT_ID and there are 1.4m rows. In TABLE2 each STATE is a unique record.
I've tried:
Update TABLE1
SET TABLE1.BEGIN_LAT=TABLE2.latitude
From TABLE1
INNER JOIN TABLE2
ON TABLE1.STATE = TABLE2.STATE
WHERE TABLE1.BEGIN_LAT IS NULL
I am getting an error because TABLE1 contains multiple rows with the same STATE and I am trying to use it as my primary key. I know what I am doing wrong but can't figure out how to do it the correct way. Is what I am trying to do possible in BigQuery?
Any help would be appreciated. Even advice on how to ask questions! :)
Thank you.
I believe you have in your query some alias for TABLE1 in Update and for TABLE1 in From. In this case you can add condition to the WHERE clause to also match on EVENT_ID. Like this:
UPDATE TABLE1 TABLE1_U
SET TABLE1_U.BEGIN_LAT=TABLE2.latitude
FROM TABLE1 TABLE1_F
INNER JOIN TABLE2
ON TABLE1_F.STATE = TABLE2.STATE
WHERE TABLE1_U.BEGIN_LAT IS NULL AND TABLE1_U.EVENT_ID = TABLE1_F.EVENT_ID
Also, I would prefer to do SELECT query instead of update and save query results to the new table.

How do I maintain a Set of event properites in a window?

Is it possible to create a table with a primary key and a Set as a secondary column that would be like a list in a value of a hashtable?
something like this:
create table T (id int primary key, list HashSet )
where the list would hold all properties related to the primary key that happened over a window size.
EDIT:
This is the output I get. What I want is to keep count of unique Occurences arriving at id 1,2 and 3.
If Occurence 2 arrived 3 times at ID 1 I still only want 1 as unique, not 3
{unique=3, id=1}
{unique=3, id=2}
{unique=4, id=3}
****************
In java it is no problem, but I dont understand how to implement this in Esper. Im not even sure if using tables is the correct approach.
Tables can have aggregation-state-type columns. So the "window" aggregation is available. For example like this:
create table MyTable (id int primary key, theWindow window(*) #type(MyEvent))
into table MyTable select window(*) as theWindow from MyEvent group by id
Or the table could declare a list-type column "create table MyTable (id int primary key, somelist java.util.List)" and it is up to you to maintain the list via function calls in EPL.

please help for CCJSqlParser issue

I use below code to get selected columns. But in the column item, why the table.getName() is alias name t1 or t2 and table.getAlias() is null?
Is any sample code to get the table name(Spark_Test_1, Spark_Test_2) and the alias table name(t1,t2) in the same time?
String sql = "SELECT t1.AsOfD,t1.ValidD,t1.urn,t1.Money FROM Spark_Test_1 as t1 join Spark_Test_2 as t2 on ( t1.AsOfD = t2.AsOfD)";
Statement statement = CCJSqlParserUtil.parse(sqlStr);
Select selectStatement = (Select) statement;
for (int i = 0; i < size; i++) {
Expression expression = ((SelectExpressionItem) selectitems.get(i))
.getExpression();
//System.out.println("Expression:" + expression);
if(expression instanceof Column){
Column col = (Column) expression;
Table table = col.getTable();
logger.info(table.getFullyQualifiedName());
logger.info(table.getAlias());
logger.info(table.getName());
}
}
This is not an issue but normal JSqlParser behaviour. JSqlParser gives you a structured way to look at your SQL but does no semantic processing. It is a parser.
Therefore for a column the tablename is in your example indeed the alias. JSqlParser does not resolve this alias to the real table name. You have to process the from items to get the tablenames its aliases and map it to your columns.
IMHO you should follow the path of TableNamesFinder to build a visitor that extracts your columns and additional gets your tables including name and alias. You have to be careful to use only the tables that are valid within your columns context, e.g.
select data.a from (select a from mydata) as data
Here data is an alias for a subsql and not for a table.

Emulating an interval join in hive

I am using hive 0.13.
I have two tables:
data table. columns: id, time. 1E10 rows.
mymap table. columns: id, name, start_time, end_time. 1E6 rows.
For each row in the data table I want to get the name from the mymap table matching the id and the time interval. So I want to do a join like:
select data.id, time, name from data left outer join mymap on data.id = mymap.id and time>=start_time and time<end_time
It is known that for every row in data there are 0 or 1 matches in mymap.
The above query is not supported in hive as it is a non-equi-join. Moving the inequality conditions into a where filter does not work cause the join explodes before the filter is applied:
select data.id, time, name from data left outer join mymap on data.id = mymap.id where mymap.id is null or (time>=start_time and time<end_time)
(I am aware that the queries are not exactly equivalent due to cases where there is a match for id but no matching interval. This can be solved as I describe here: Hive: work around for non equi left join)
How can I go about this?
You could perform your join and then query from that table. I didn't test this code, but it would read something like
select id
,time
,name
from (
select d.id
,d.time
,m.name
,m.start_time
,m.end_time
from data as d LEFT OUTER JOIN mymap as m
ON d.id = m.id
) x
where time>=start_time
AND time<end_time
You could potentially get around this issue by flattening out the data structure in table2 and using a UDF to process the joined records.
select
id,
time,
nameFinderUDF(b.name_list, time) as name
from
data a
LEFT OUTER JOIN
(
select
id,
collect_set(array(name,cast(start_time as string),cast(end_time as string))) as name_list
from
mymap
group by
id
) b
ON (a.id=b.id)
With a UDF that does something like:
public String evaluate(ArrayList<ArrayList<String>> name_list,Long time) {
for (int i;i<name_list.length;i++) {
if (time >= Long.parseLong(name_list[i][1]) && time <= Long.parseLong(name_list[i][2])) {
return name_list[i][0]
return null;
}
This approach should make the merge 1 to 1, but it could create a fairly large data structure repeated many times. It is still quite a bit more efficient than a straight join.

Stored procedure in Oracle with Case and When

I am having a scenario, where I am having 5 different tables:
Table 1 - Product, Columns - ProductId, BatchNummer, Status, GroupId, OrderNummer
Table 2 - ProductGrop, Columns - GropId, ProductType, Description
Table 3 - Electronics, Columns - EId, Description, BatchNummer, OrderNummer, OrderData
Table 4 - Manual, Columns - MId, Description, Status, OrderNummer, ProcessStep
Table 5 - ProcessedProduct, columns same as Product with one extra column of datetime
Now, according to business flow, I need to populate all the data from Product table, and have to check if the underlying table (Electronics or Manual, which depends on ProductType column of ProductGoup) has ordernuumer value, then Insert a record in table 5 "ProcessedProduct" else skip the records.
For this requirement, i want to create a procedure. But I am stuck on how to check which underlying table (Electronics/Manual) shall i have to refer and how it can be achieved.
Moreover how should i write the loop for inserting the records.
Note: I cannot change the tables schema.
With a PL/SQL procedure you can just switch within a LOOP, but you don't need an imperative algoritm if you just need to check if OrderNummer is either into Electronics or Manuals.
Supposing the detail table is chosen by ProductType value either "Electronics" or "Manuals", you could:
INSERT INTO ProcessedProduct (ProductId, BatchNummer, Status, GroupId, OrderNummer, TS)
SELECT ProductId, BatchNummer, Status, GroupId, OrderNummer, SYSDATE
FROM Product p
INNER JOIN ProductGroup pg USING (GroupId)
WHERE EXISTS (
SELECT NULL FROM Electronics e
WHERE p.OrderNummer = e.OrderNummer
AND pg.ProductType = 'Electronics'
UNION
SELECT NULL FROM Manuals m
WHERE m.OrderNummer = m.OrderNummer
AND pg.ProductType = 'Manuals')
Plain SQL is always the fastest way, and "WHERE EXISTS" is usually the fastest condition.

Resources