Multipy after joining data in PIG - join
I am trying to multiply two fields and take their sum after joining three tables in Pig. However I keep on getting this error:
<file loyalty_program.pig, line 30, column 74> (Name: Multiply Type: null Uid: null)incompatible types in Multiply Operator left hand side:bag :tuple(new_details1::new_details::potential_customers::num_of_orders:long) right hand side:bag :tuple(products::price:int)
-- load the data sets
orders = LOAD '/dualcore/orders' AS (order_id:int,
cust_id:int,
order_dtm:chararray);
details = LOAD '/dualcore/order_details' AS (order_id:int,
prod_id:int);
products = LOAD '/dualcore/products' AS (prod_id:int,
brand:chararray,
name:chararray,
price:int,
cost:int,
shipping_wt:int);
recent = FILTER orders by order_dtm matches '2012-.*$';
customer = GROUP recent by cust_id;
cust_orders = FOREACH customer GENERATE group as cust_id, (int)COUNT(recent) as num_of_orders;
potential_customers = FILTER cust_orders by num_of_orders>=5;
new_details = join potential_customers by cust_id, recent by cust_id;
new_details1 = join new_details by order_id, details by order_id;
new_details2 = join new_details1 by prod_id, products by prod_id;
--DESCRIBE new_details2;
final_details = FOREACH new_details2 GENERATE potential_customers::cust_id, potential_customers::num_of_orders as num_of_orders,recent::order_id as order_id,recent::order_dtm,details::prod_id,products::brand,products::name,products::price as price,products::cost,products::shipping_wt;
grouped_data = GROUP final_details by cust_id;
member = FOREACH grouped_data GENERATE SUM(final_details.num_of_orders * final_details.price) ;
lim = limit member 10;
dump lim;
I even casted the result of count to int. It still keeps on throwing this error at me. I have no clue how to go about it.
Ok.. I think at first, you want to multiply no.of purchases with the price of each product and then you need total SUM of that multiplied value..
Even though this is a strange requirement, but you can go with below approach..
All you need to do is calculate the multiplication in final_details Foreach statement itself and simply apply the SUM for that multiplied amount..
Based on your load statements I created the below input files
main_orders.txt
6666,100,2012-01-01
7777,101,2012-09-02
8888,100,2012-01-09
9999,101,2012-12-08
6666,101,2012-09-02
9999,100,2012-07-12
9999,100,2012-08-01
6666,100,2012-01-02
7777,100,2012-09-09
orders_details.txt
6666,6000
7777,7000
8888,8000
9999,9000
main_products.txt
6000,Nike,Shoes,3000,3000,1
7000,Adidas,Cap,1000,1000,1
8000,Rebook,Shoes,4000,4000,1
9000,Puma,Shoes,25000,2500,1
Below is the code
orders = LOAD '/user/cloudera/inputfiles/main_orders.txt' USING PigStorage(',') AS (order_id:int,cust_id:int,order_dtm:chararray);
details = LOAD '/user/cloudera/inputfiles/orders_details.txt' USING PigStorage(',') AS (order_id:int,prod_id:int);
products = LOAD '/user/cloudera/inputfiles/main_products.txt' USING PigStorage(',') AS(prod_id:int,brand:chararray,name:chararray,price:int,cost:int,shipping_wt:int);
recent = FILTER orders by order_dtm matches '2012-.*';
customer = GROUP recent by cust_id;
cust_orders = FOREACH customer GENERATE group as cust_id, (int)COUNT(recent) as num_of_orders;
potential_customers = FILTER cust_orders by num_of_orders>=5;
new_details = join potential_customers by cust_id, recent by cust_id;
new_details1 = join new_details by order_id, details by order_id;
new_details2 = join new_details1 by prod_id, products by prod_id;
DESCRIBE new_details2;
final_details = FOREACH new_details2 GENERATE potential_customers::cust_id, potential_customers::num_of_orders as num_of_orders,recent::order_id as order_id,recent::order_dtm,details::prod_id,products::brand,products::name,products::price as price,products::cost,products::shipping_wt, (potential_customers::num_of_orders * products::price ) as multiplied_price;// multiplication is achived in last variable
dump final_details;
grouped_data = GROUP final_details by cust_id;
member = FOREACH grouped_data GENERATE SUM(final_details.multiplied_price) ;
lim = limit member 10;
dump lim;
Just for clarity I am dumping the output of final_details foreach statement as well.
(100,6,6666,2012-01-01,6000,Nike,Shoes,3000,3000,1,18000)
(100,6,6666,2012-01-02,6000,Nike,Shoes,3000,3000,1,18000)
(100,6,7777,2012-09-09,7000,Adidas,Cap,1000,1000,1,6000)
(100,6,8888,2012-01-09,8000,Rebook,Shoes,4000,4000,1,24000)
(100,6,9999,2012-07-12,9000,Puma,Shoes,25000,2500,1,150000)
(100,6,9999,2012-08-01,9000,Puma,Shoes,25000,2500,1,150000)
final output is below
(366000)
This code may help you, but Please clarify your requirement again
Related
Count After A join relation in Pig
I am trying to load two files from hdfs to pig. After I join Driver Relation with Truck Relation, I would like to count. How can I count the rows in relation ? I tried this but it gives me count with group not a single count: truck_temp = FOREACH (GROUP truck_join BY drivers_info::driverId) { GENERATE group, COUNT(truck_join); }; drivers_load = LOAD '/Pig-Practice/drivers.csv' USING PigStorage(',') AS (driverId:int,name:chararray,ssn:biginteger,location:chararray,certified:chararray,wageplan:chararray); drivers_info = FOREACH ( GROUP drivers_load BY (driverId,name)) GENERATE group.driverId,group.name; event_load = LOAD '/Pig-Practice/truck_event_text_partition.csv' USING PigStorage(',') AS (driverId:int, truckId:int, eventTime:chararray, eventType:chararray, longitude:double, latitude:double, eventKey:chararray, correlationId:long, driverName:chararray, routeId:long,routeName:chararray,eventDate:chararray); truck_events1 = FILTER event_load BY $0 >1; truck_events2 = FOREACH (GROUP truck_events1 BY (driverId,driverName,routeId,routeName) ) GENERATE group.driverId,group.driverName,group.routeId,group.routeName; truck_join = JOIN drivers_info BY driverId, truck_events2 BY driverId;
For getting the total count after a join, you need to group all. COUNT requires a preceding GROUP ALL statement for global counts and a GROUP BY statement for group counts. Reference: COUNT truck_temp = FOREACH (GROUP truck_join ALL) { GENERATE COUNT(truck_join); };
JOIN two data set on the basis of string matching condition in Pig
I am new in Pig and I have two data sets, "highspender" and "feedback". Highspender: Price,fname,lname $50,Jack,Brown $30,Rovin,Pall Feedback: date,Name,rate 2015-01-02,Jack B Brown,5 2015-01-02,Pall,4 Now I have to join these two datasets on the basis of their name. My condition should be fname or lname of Highspender should match with the Name of feedback. How to join these two datasets? Any idea?
You can try below script to do the same all you need is to replace the names according to your data highs = LOAD 'highs' using PigStorage(',') as (Price:chararray,fname:chararray,lname:chararray); feedback = LOAD 'feeds' using PigStorage(',') as (date:chararray,Name:chararray,rate:chararray); out = JOIN highs BY fname, feedback BY Name; out1 = JOIN highs BY lname, feedback BY Name; final_out = UNION out,out1; For further help you can refer this Pig Reference manual EDIT As per the comment script for joining data with string function is as bellow: highs = LOAD 'highs' using PigStorage(',') as (Price:chararray,fname:chararray,lname:chararray); feedback = LOAD 'feeds' using PigStorage(',') as (date:chararray,Name:chararray,rate:chararray); crossout = cross highs, feedback; final_lname = filter crossout by ( REPLACE (feedback::Name,highs::lname ,'') != feedback::Name); final_fname = filter crossout by ( REPLACE (feedback::Name,highs::fname ,'') != feedback::Name); final = UNION final_lname, final_fname;
How can this SQL subquery be expressed using Squeel/ActiveRecord?
I'm having a bit of brain fade today and can't figure out how I should express this SQL query correctly using ActiveRecord/Squeel/ARel: SELECT `d1`.* FROM `domain_names` d1 WHERE `d1`.`created_at` = ( SELECT MAX(`d2`.`created_at`) FROM `domain_names` d2 WHERE `d2`.`owner_type` = `d1`.`owner_type` AND `d2`.`owner_id` = `d1`.`owner_id` AND `d2`.`key` = `d1`.`key` ) Any ideas? Background: The DomainName model has a polymorphic owner as well as a "key" field that allows owners to have many different types of domain name. The query above fetches the latest domain name for each unique [owner_type, owner_id, key] tuple. Edit: Here's the same query using JOIN: SELECT `d1`.* FROM `domain_names` d1 JOIN ( SELECT `owner_type`, `owner_id`, `key`, MAX(`created_at`) max_created_at FROM `domain_names` GROUP BY `owner_type`, `owner_id`, `key` ) d2 ON `d2`.`owner_type` = `d1`.`owner_type` AND `d2`.`owner_id` = `d1`.`owner_id` AND `d2`.`key` = `d1`.`key` WHERE `d1`.`created_at` = `d2`.`max_created_at`
Strange execution time for summary query
I am giving here part of the query I am executing: SELECT SUM(ParentTable.Field1), (SELECT SUM(ChildrenTable.Field1) FROM ChildrenRable INNER JOIN GrandChildrenTable ON ChildrenTable.Id = GrandChildrenTable.ChildrenTableId INNER JOIN AnotherTable ON GrandChildrenTable.AnotherTableId = AnotherTable.Id WHERE ChildrenTable.ParentBaleId = ParentTable.Id AND AnotherTable.Type=1), ---- FROM ParentTable WHERE some_conditions Relationships: ParentTable -> ChildrenTable = 1-to-many ChildrenTable -> GrandChildrenTable = 1-to-many GrandChildrenTable -> AnotherTable = 1-to-1 I am executing this query three times, while changing only the Type condition, and here are the results: Number of records that are returned: Condition Total execution time (ms) Type = 1 : 973 Type = 2 : 78810 Type = 3 : 648318 If I execute just the inner join query, here is the count of joined records: SELECT p.Type, COUNT(*) FROM CycleActivities ca INNER JOIN CycleActivityProducts cap ON ca.Id = CAP.CycleActivityId INNER JOIN Products p ON cap.ProductId = p.Id GROUP BY p.Type Type ---- ----------- 1 55152 2 13401 4 102730 So, why would the query with Type = 1 condition execute much faster than the query with Type = 2, although it is querying 4x larger resultset (Type is tinyint)?
The way your query is written instructs SQL Server to execute the sub-query with JOIN for every row of the output. This way it should be faster, if I understand what you want correctly (UPDATED): with cte_parent as ( select Id, SUM (ParentTable.Field1) as Parent_Sum from ParentTable group by Id ), cte_child as ( SELECT Id, SUM (ChildrenTable.Field1) as as Child_Sum FROM ChildrenRable INNER JOIN GrandChildrenTable ON ChildrenTable.Id = GrandChildrenTable.ChildrenTableId INNER JOIN AnotherTable ON GrandChildrenTable.AnotherTableId = AnotherTable.Id WHERE AnotherTable.Type=1 AND some_conditions GROUP BY Id ) select cte_parent.id, Parent_Sum, Child_Sum from parent_cte join child_cte on parent_cte.id = child_cte.id
I want to union together four queries and set this to be the repeater's data source
Based on user design I have to union together four queries and put them in a repeater. var qryIssuer = from l in dbRRSP.LOA join lrb in dbRRSP.LOAOrReferredBy on l.LOAOrReferredById equals lrb.LoaOrReferredById join lat in dbRRSP.LOAAccessType on l.LOAAccessTypeId equals lat.LOAAccessTypeId join iss in dbRRSP.Issuer on l.IssuerId equals iss.IssuerId where l.PersonId == personId select new { LOAOrReferredByDescription = lrb.LoaOrReferredByDescription, lat.LOAAccessTypeDescription, PersonType = "Issuer", LOAName = iss.CompanyName, l.DateAdded }; var qryEMD = from l in dbRRSP.LOA join lrb in dbRRSP.LOAOrReferredBy on l.LOAOrReferredById equals lrb.LoaOrReferredById join lat in dbRRSP.LOAAccessType on l.LOAAccessTypeId equals lat.LOAAccessTypeId join emd in dbRRSP.Agent on l.AgentId equals emd.AgentId where l.PersonId == personId select new { LOAOrReferredByDescription = lrb.LoaOrReferredByDescription, lat.LOAAccessTypeDescription, PersonType = "EMD", LOAName = emd.CompanyName, l.DateAdded }; var qryEmdRep = from l in dbRRSP.LOA join lrb in dbRRSP.LOAOrReferredBy on l.LOAOrReferredById equals lrb.LoaOrReferredById join lat in dbRRSP.LOAAccessType on l.LOAAccessTypeId equals lat.LOAAccessTypeId join ar in dbRRSP.AgentRepresentative on l.EMDRepresentativeId equals ar.AgentRepresentativeId join arp in dbRRSP.Person on ar.PersonId equals arp.PersonId where l.PersonId == personId select new { LOAOrReferredByDescription = lrb.LoaOrReferredByDescription, lat.LOAAccessTypeDescription, PersonType = "EMD Rep", LOAName = arp.FirstName + ' ' + arp.LastName, l.DateAdded }; var qryLOAPerson = from l in dbRRSP.LOA join lrb in dbRRSP.LOAOrReferredBy on l.LOAOrReferredById equals lrb.LoaOrReferredById join lat in dbRRSP.LOAAccessType on l.LOAAccessTypeId equals lat.LOAAccessTypeId join lp in dbRRSP.LOAPerson on l.LOAPersonId equals lp.LOAPersonId where l.PersonId == personId select new { LOAOrReferredByDescription = lrb.LoaOrReferredByDescription, lat.LOAAccessTypeDescription, PersonType = "Person", LOAName = lp.LOAPersonName, l.DateAdded }; This is the four queries. And the trickiest part is that the last field is a datetime, which is causing me some issues. I know how to union two of them together like this: var qryMultipleLOA = qryIssuer.Union(qryEMD).ToList().Select(loa => new ExtendedLOA { LOAOrReferredByDescription = loa.LOAOrReferredByDescription, LOAAccessTypeDescription = loa.LOAAccessTypeDescription, PersonType = loa.PersonType, LOAName = loa.LOAName, DateAdded = DateTime.Parse(loa.DateAdded.ToString()).ToString("MM/dd/yyyy") }); But I'm at a loss on how to add the last two queries - first I tried wrapping it in brackets and adding a .Union which didn't work, and then when I tried to nest them with appropriate .ToLists, that didn't work either. Below is the code to bind it to the repeater. rptLOA.DataSource = qryMultipleLOA; rptLOA.DataBind(); Suggestions would be greatly appreciated.
Did you try something like? var qryMultipleLOA = qryIssuer.Union(qryEMD).Union(qryEmdRep).Union(qryLOAPerson).ToList(); Provided your queries' footprints are the same, this shouldn't be an issue to chain them upon each other. Edit: I would also recommend the following: Create a class to hold an instance of the resultant data. Instead of creating lists of dynamic variables generated from Linq and hoping they all match, funnel the linq results into a List. That way you can tell immediately if you have a type mismatch. Once you have four lists of the same List, Unions as per my syntax above will be a snap. Dynamic Linq lists can be a pain, unwieldy and a single property type change can throw of your code at runtime rather than design time. If you follow the steps above, your code will be much more maintainable and clear to you and others. I hope this helps in some way.