Multipy after joining data in PIG - join

I am trying to multiply two fields and take their sum after joining three tables in Pig. However I keep on getting this error:
<file loyalty_program.pig, line 30, column 74> (Name: Multiply Type: null Uid: null)incompatible types in Multiply Operator left hand side:bag :tuple(new_details1::new_details::potential_customers::num_of_orders:long) right hand side:bag :tuple(products::price:int)
-- load the data sets
orders = LOAD '/dualcore/orders' AS (order_id:int,
cust_id:int,
order_dtm:chararray);
details = LOAD '/dualcore/order_details' AS (order_id:int,
prod_id:int);
products = LOAD '/dualcore/products' AS (prod_id:int,
brand:chararray,
name:chararray,
price:int,
cost:int,
shipping_wt:int);
recent = FILTER orders by order_dtm matches '2012-.*$';
customer = GROUP recent by cust_id;
cust_orders = FOREACH customer GENERATE group as cust_id, (int)COUNT(recent) as num_of_orders;
potential_customers = FILTER cust_orders by num_of_orders>=5;
new_details = join potential_customers by cust_id, recent by cust_id;
new_details1 = join new_details by order_id, details by order_id;
new_details2 = join new_details1 by prod_id, products by prod_id;
--DESCRIBE new_details2;
final_details = FOREACH new_details2 GENERATE potential_customers::cust_id, potential_customers::num_of_orders as num_of_orders,recent::order_id as order_id,recent::order_dtm,details::prod_id,products::brand,products::name,products::price as price,products::cost,products::shipping_wt;
grouped_data = GROUP final_details by cust_id;
member = FOREACH grouped_data GENERATE SUM(final_details.num_of_orders * final_details.price) ;
lim = limit member 10;
dump lim;
I even casted the result of count to int. It still keeps on throwing this error at me. I have no clue how to go about it.

Ok.. I think at first, you want to multiply no.of purchases with the price of each product and then you need total SUM of that multiplied value..
Even though this is a strange requirement, but you can go with below approach..
All you need to do is calculate the multiplication in final_details Foreach statement itself and simply apply the SUM for that multiplied amount..
Based on your load statements I created the below input files
main_orders.txt
6666,100,2012-01-01
7777,101,2012-09-02
8888,100,2012-01-09
9999,101,2012-12-08
6666,101,2012-09-02
9999,100,2012-07-12
9999,100,2012-08-01
6666,100,2012-01-02
7777,100,2012-09-09
orders_details.txt
6666,6000
7777,7000
8888,8000
9999,9000
main_products.txt
6000,Nike,Shoes,3000,3000,1
7000,Adidas,Cap,1000,1000,1
8000,Rebook,Shoes,4000,4000,1
9000,Puma,Shoes,25000,2500,1
Below is the code
orders = LOAD '/user/cloudera/inputfiles/main_orders.txt' USING PigStorage(',') AS (order_id:int,cust_id:int,order_dtm:chararray);
details = LOAD '/user/cloudera/inputfiles/orders_details.txt' USING PigStorage(',') AS (order_id:int,prod_id:int);
products = LOAD '/user/cloudera/inputfiles/main_products.txt' USING PigStorage(',') AS(prod_id:int,brand:chararray,name:chararray,price:int,cost:int,shipping_wt:int);
recent = FILTER orders by order_dtm matches '2012-.*';
customer = GROUP recent by cust_id;
cust_orders = FOREACH customer GENERATE group as cust_id, (int)COUNT(recent) as num_of_orders;
potential_customers = FILTER cust_orders by num_of_orders>=5;
new_details = join potential_customers by cust_id, recent by cust_id;
new_details1 = join new_details by order_id, details by order_id;
new_details2 = join new_details1 by prod_id, products by prod_id;
DESCRIBE new_details2;
final_details = FOREACH new_details2 GENERATE potential_customers::cust_id, potential_customers::num_of_orders as num_of_orders,recent::order_id as order_id,recent::order_dtm,details::prod_id,products::brand,products::name,products::price as price,products::cost,products::shipping_wt, (potential_customers::num_of_orders * products::price ) as multiplied_price;// multiplication is achived in last variable
dump final_details;
grouped_data = GROUP final_details by cust_id;
member = FOREACH grouped_data GENERATE SUM(final_details.multiplied_price) ;
lim = limit member 10;
dump lim;
Just for clarity I am dumping the output of final_details foreach statement as well.
(100,6,6666,2012-01-01,6000,Nike,Shoes,3000,3000,1,18000)
(100,6,6666,2012-01-02,6000,Nike,Shoes,3000,3000,1,18000)
(100,6,7777,2012-09-09,7000,Adidas,Cap,1000,1000,1,6000)
(100,6,8888,2012-01-09,8000,Rebook,Shoes,4000,4000,1,24000)
(100,6,9999,2012-07-12,9000,Puma,Shoes,25000,2500,1,150000)
(100,6,9999,2012-08-01,9000,Puma,Shoes,25000,2500,1,150000)
final output is below
(366000)
This code may help you, but Please clarify your requirement again

Related

Count After A join relation in Pig

I am trying to load two files from hdfs to pig.
After I join Driver Relation with Truck Relation, I would like to count.
How can I count the rows in relation ?
I tried this but it gives me count with group not a single count:
truck_temp = FOREACH (GROUP truck_join BY drivers_info::driverId) { GENERATE group, COUNT(truck_join); };
drivers_load = LOAD '/Pig-Practice/drivers.csv' USING PigStorage(',') AS (driverId:int,name:chararray,ssn:biginteger,location:chararray,certified:chararray,wageplan:chararray);
drivers_info = FOREACH ( GROUP drivers_load BY (driverId,name)) GENERATE group.driverId,group.name;
event_load = LOAD '/Pig-Practice/truck_event_text_partition.csv' USING PigStorage(',') AS (driverId:int, truckId:int, eventTime:chararray,
eventType:chararray, longitude:double, latitude:double,
eventKey:chararray, correlationId:long, driverName:chararray,
routeId:long,routeName:chararray,eventDate:chararray);
truck_events1 = FILTER event_load BY $0 >1;
truck_events2 = FOREACH (GROUP truck_events1 BY (driverId,driverName,routeId,routeName) ) GENERATE group.driverId,group.driverName,group.routeId,group.routeName;
truck_join = JOIN drivers_info BY driverId, truck_events2 BY driverId;
For getting the total count after a join, you need to group all.
COUNT requires a preceding GROUP ALL statement for global counts and a GROUP BY statement for group counts.
Reference: COUNT
truck_temp = FOREACH (GROUP truck_join ALL)
{
GENERATE COUNT(truck_join);
};

JOIN two data set on the basis of string matching condition in Pig

I am new in Pig and I have two data sets, "highspender" and "feedback".
Highspender:
Price,fname,lname
$50,Jack,Brown
$30,Rovin,Pall
Feedback:
date,Name,rate
2015-01-02,Jack B Brown,5
2015-01-02,Pall,4
Now I have to join these two datasets on the basis of their name. My condition should be fname or lname of Highspender should match with the Name of feedback. How to join these two datasets? Any idea?
You can try below script to do the same all you need is to replace the names according to your data
highs = LOAD 'highs' using PigStorage(',') as (Price:chararray,fname:chararray,lname:chararray);
feedback = LOAD 'feeds' using PigStorage(',') as (date:chararray,Name:chararray,rate:chararray);
out = JOIN highs BY fname, feedback BY Name;
out1 = JOIN highs BY lname, feedback BY Name;
final_out = UNION out,out1;
For further help you can refer this Pig Reference manual
EDIT
As per the comment script for joining data with string function is as bellow:
highs = LOAD 'highs' using PigStorage(',') as (Price:chararray,fname:chararray,lname:chararray);
feedback = LOAD 'feeds' using PigStorage(',') as (date:chararray,Name:chararray,rate:chararray);
crossout = cross highs, feedback;
final_lname = filter crossout by ( REPLACE (feedback::Name,highs::lname ,'') != feedback::Name);
final_fname = filter crossout by ( REPLACE (feedback::Name,highs::fname ,'') != feedback::Name);
final = UNION final_lname, final_fname;

How can this SQL subquery be expressed using Squeel/ActiveRecord?

I'm having a bit of brain fade today and can't figure out how I should express this SQL query correctly using ActiveRecord/Squeel/ARel:
SELECT `d1`.* FROM `domain_names` d1
WHERE `d1`.`created_at` = (
SELECT MAX(`d2`.`created_at`)
FROM `domain_names` d2
WHERE `d2`.`owner_type` = `d1`.`owner_type`
AND `d2`.`owner_id` = `d1`.`owner_id`
AND `d2`.`key` = `d1`.`key`
)
Any ideas?
Background: The DomainName model has a polymorphic owner as well as a "key" field that allows owners to have many different types of domain name. The query above fetches the latest domain name for each unique [owner_type, owner_id, key] tuple.
Edit:
Here's the same query using JOIN:
SELECT `d1`.* FROM `domain_names` d1
JOIN (
SELECT `owner_type`, `owner_id`, `key`, MAX(`created_at`) max_created_at
FROM `domain_names`
GROUP BY `owner_type`, `owner_id`, `key`
) d2
ON `d2`.`owner_type` = `d1`.`owner_type`
AND `d2`.`owner_id` = `d1`.`owner_id`
AND `d2`.`key` = `d1`.`key`
WHERE `d1`.`created_at` = `d2`.`max_created_at`

Strange execution time for summary query

I am giving here part of the query I am executing:
SELECT SUM(ParentTable.Field1),
(SELECT SUM(ChildrenTable.Field1)
FROM ChildrenRable INNER JOIN
GrandChildrenTable ON ChildrenTable.Id = GrandChildrenTable.ChildrenTableId INNER JOIN
AnotherTable ON GrandChildrenTable.AnotherTableId = AnotherTable.Id
WHERE ChildrenTable.ParentBaleId = ParentTable.Id
AND AnotherTable.Type=1),
----
FROM ParentTable
WHERE some_conditions
Relationships:
ParentTable -> ChildrenTable = 1-to-many
ChildrenTable -> GrandChildrenTable = 1-to-many
GrandChildrenTable -> AnotherTable = 1-to-1
I am executing this query three times, while changing only the Type condition, and here are the results:
Number of records that are returned:
Condition Total execution time (ms)
Type = 1 : 973
Type = 2 : 78810
Type = 3 : 648318
If I execute just the inner join query, here is the count of joined records:
SELECT p.Type, COUNT(*)
FROM CycleActivities ca INNER JOIN
CycleActivityProducts cap ON ca.Id = CAP.CycleActivityId INNER JOIN
Products p ON cap.ProductId = p.Id
GROUP BY p.Type
Type
---- -----------
1 55152
2 13401
4 102730
So, why would the query with Type = 1 condition execute much faster than the query with Type = 2, although it is querying 4x larger resultset (Type is tinyint)?
The way your query is written instructs SQL Server to execute the sub-query with JOIN for every row of the output.
This way it should be faster, if I understand what you want correctly (UPDATED):
with cte_parent as (
select
Id,
SUM (ParentTable.Field1) as Parent_Sum
from ParentTable
group by Id
),
cte_child as (
SELECT
Id,
SUM (ChildrenTable.Field1) as as Child_Sum
FROM ChildrenRable
INNER JOIN
GrandChildrenTable ON ChildrenTable.Id = GrandChildrenTable.ChildrenTableId
INNER JOIN
AnotherTable ON GrandChildrenTable.AnotherTableId = AnotherTable.Id
WHERE
AnotherTable.Type=1
AND
some_conditions
GROUP BY Id
)
select cte_parent.id, Parent_Sum, Child_Sum
from parent_cte
join child_cte on parent_cte.id = child_cte.id

I want to union together four queries and set this to be the repeater's data source

Based on user design I have to union together four queries and put them in a repeater.
var qryIssuer = from l in dbRRSP.LOA
join lrb in dbRRSP.LOAOrReferredBy on l.LOAOrReferredById equals lrb.LoaOrReferredById
join lat in dbRRSP.LOAAccessType on l.LOAAccessTypeId equals lat.LOAAccessTypeId
join iss in dbRRSP.Issuer on l.IssuerId equals iss.IssuerId
where
l.PersonId == personId
select new
{
LOAOrReferredByDescription = lrb.LoaOrReferredByDescription,
lat.LOAAccessTypeDescription,
PersonType = "Issuer",
LOAName = iss.CompanyName,
l.DateAdded
};
var qryEMD = from l in dbRRSP.LOA
join lrb in dbRRSP.LOAOrReferredBy on l.LOAOrReferredById equals lrb.LoaOrReferredById
join lat in dbRRSP.LOAAccessType on l.LOAAccessTypeId equals lat.LOAAccessTypeId
join emd in dbRRSP.Agent on l.AgentId equals emd.AgentId
where
l.PersonId == personId
select new
{
LOAOrReferredByDescription = lrb.LoaOrReferredByDescription,
lat.LOAAccessTypeDescription,
PersonType = "EMD",
LOAName = emd.CompanyName,
l.DateAdded
};
var qryEmdRep = from l in dbRRSP.LOA
join lrb in dbRRSP.LOAOrReferredBy on l.LOAOrReferredById equals lrb.LoaOrReferredById
join lat in dbRRSP.LOAAccessType on l.LOAAccessTypeId equals lat.LOAAccessTypeId
join ar in dbRRSP.AgentRepresentative on l.EMDRepresentativeId equals ar.AgentRepresentativeId
join arp in dbRRSP.Person on ar.PersonId equals arp.PersonId
where
l.PersonId == personId
select new
{
LOAOrReferredByDescription = lrb.LoaOrReferredByDescription,
lat.LOAAccessTypeDescription,
PersonType = "EMD Rep",
LOAName = arp.FirstName + ' ' + arp.LastName, l.DateAdded
};
var qryLOAPerson = from l in dbRRSP.LOA
join lrb in dbRRSP.LOAOrReferredBy on l.LOAOrReferredById equals lrb.LoaOrReferredById
join lat in dbRRSP.LOAAccessType on l.LOAAccessTypeId equals lat.LOAAccessTypeId
join lp in dbRRSP.LOAPerson on l.LOAPersonId equals lp.LOAPersonId
where
l.PersonId == personId
select new
{
LOAOrReferredByDescription = lrb.LoaOrReferredByDescription, lat.LOAAccessTypeDescription,
PersonType = "Person",
LOAName = lp.LOAPersonName,
l.DateAdded
};
This is the four queries. And the trickiest part is that the last field is a datetime, which is causing me some issues. I know how to union two of them together like this:
var qryMultipleLOA = qryIssuer.Union(qryEMD).ToList().Select(loa => new ExtendedLOA
{
LOAOrReferredByDescription = loa.LOAOrReferredByDescription,
LOAAccessTypeDescription = loa.LOAAccessTypeDescription,
PersonType = loa.PersonType,
LOAName = loa.LOAName,
DateAdded = DateTime.Parse(loa.DateAdded.ToString()).ToString("MM/dd/yyyy")
});
But I'm at a loss on how to add the last two queries - first I tried wrapping it in brackets and adding a .Union which didn't work, and then when I tried to nest them with appropriate .ToLists, that didn't work either.
Below is the code to bind it to the repeater.
rptLOA.DataSource = qryMultipleLOA;
rptLOA.DataBind();
Suggestions would be greatly appreciated.
Did you try something like?
var qryMultipleLOA = qryIssuer.Union(qryEMD).Union(qryEmdRep).Union(qryLOAPerson).ToList();
Provided your queries' footprints are the same, this shouldn't be an issue to chain them upon each other.
Edit:
I would also recommend the following:
Create a class to hold an instance of the resultant data.
Instead of creating lists of dynamic variables generated from Linq and hoping they all match, funnel the linq results into a List. That way you can tell immediately if you have a type mismatch.
Once you have four lists of the same List, Unions as per my syntax above will be a snap.
Dynamic Linq lists can be a pain, unwieldy and a single property type change can throw of your code at runtime rather than design time. If you follow the steps above, your code will be much more maintainable and clear to you and others.
I hope this helps in some way.

Resources