Count After A join relation in Pig - join

I am trying to load two files from hdfs to pig.
After I join Driver Relation with Truck Relation, I would like to count.
How can I count the rows in relation ?
I tried this but it gives me count with group not a single count:
truck_temp = FOREACH (GROUP truck_join BY drivers_info::driverId) { GENERATE group, COUNT(truck_join); };
drivers_load = LOAD '/Pig-Practice/drivers.csv' USING PigStorage(',') AS (driverId:int,name:chararray,ssn:biginteger,location:chararray,certified:chararray,wageplan:chararray);
drivers_info = FOREACH ( GROUP drivers_load BY (driverId,name)) GENERATE group.driverId,group.name;
event_load = LOAD '/Pig-Practice/truck_event_text_partition.csv' USING PigStorage(',') AS (driverId:int, truckId:int, eventTime:chararray,
eventType:chararray, longitude:double, latitude:double,
eventKey:chararray, correlationId:long, driverName:chararray,
routeId:long,routeName:chararray,eventDate:chararray);
truck_events1 = FILTER event_load BY $0 >1;
truck_events2 = FOREACH (GROUP truck_events1 BY (driverId,driverName,routeId,routeName) ) GENERATE group.driverId,group.driverName,group.routeId,group.routeName;
truck_join = JOIN drivers_info BY driverId, truck_events2 BY driverId;

For getting the total count after a join, you need to group all.
COUNT requires a preceding GROUP ALL statement for global counts and a GROUP BY statement for group counts.
Reference: COUNT
truck_temp = FOREACH (GROUP truck_join ALL)
{
GENERATE COUNT(truck_join);
};

Related

JOIN two data set on the basis of string matching condition in Pig

I am new in Pig and I have two data sets, "highspender" and "feedback".
Highspender:
Price,fname,lname
$50,Jack,Brown
$30,Rovin,Pall
Feedback:
date,Name,rate
2015-01-02,Jack B Brown,5
2015-01-02,Pall,4
Now I have to join these two datasets on the basis of their name. My condition should be fname or lname of Highspender should match with the Name of feedback. How to join these two datasets? Any idea?
You can try below script to do the same all you need is to replace the names according to your data
highs = LOAD 'highs' using PigStorage(',') as (Price:chararray,fname:chararray,lname:chararray);
feedback = LOAD 'feeds' using PigStorage(',') as (date:chararray,Name:chararray,rate:chararray);
out = JOIN highs BY fname, feedback BY Name;
out1 = JOIN highs BY lname, feedback BY Name;
final_out = UNION out,out1;
For further help you can refer this Pig Reference manual
EDIT
As per the comment script for joining data with string function is as bellow:
highs = LOAD 'highs' using PigStorage(',') as (Price:chararray,fname:chararray,lname:chararray);
feedback = LOAD 'feeds' using PigStorage(',') as (date:chararray,Name:chararray,rate:chararray);
crossout = cross highs, feedback;
final_lname = filter crossout by ( REPLACE (feedback::Name,highs::lname ,'') != feedback::Name);
final_fname = filter crossout by ( REPLACE (feedback::Name,highs::fname ,'') != feedback::Name);
final = UNION final_lname, final_fname;

Multipy after joining data in PIG

I am trying to multiply two fields and take their sum after joining three tables in Pig. However I keep on getting this error:
<file loyalty_program.pig, line 30, column 74> (Name: Multiply Type: null Uid: null)incompatible types in Multiply Operator left hand side:bag :tuple(new_details1::new_details::potential_customers::num_of_orders:long) right hand side:bag :tuple(products::price:int)
-- load the data sets
orders = LOAD '/dualcore/orders' AS (order_id:int,
cust_id:int,
order_dtm:chararray);
details = LOAD '/dualcore/order_details' AS (order_id:int,
prod_id:int);
products = LOAD '/dualcore/products' AS (prod_id:int,
brand:chararray,
name:chararray,
price:int,
cost:int,
shipping_wt:int);
recent = FILTER orders by order_dtm matches '2012-.*$';
customer = GROUP recent by cust_id;
cust_orders = FOREACH customer GENERATE group as cust_id, (int)COUNT(recent) as num_of_orders;
potential_customers = FILTER cust_orders by num_of_orders>=5;
new_details = join potential_customers by cust_id, recent by cust_id;
new_details1 = join new_details by order_id, details by order_id;
new_details2 = join new_details1 by prod_id, products by prod_id;
--DESCRIBE new_details2;
final_details = FOREACH new_details2 GENERATE potential_customers::cust_id, potential_customers::num_of_orders as num_of_orders,recent::order_id as order_id,recent::order_dtm,details::prod_id,products::brand,products::name,products::price as price,products::cost,products::shipping_wt;
grouped_data = GROUP final_details by cust_id;
member = FOREACH grouped_data GENERATE SUM(final_details.num_of_orders * final_details.price) ;
lim = limit member 10;
dump lim;
I even casted the result of count to int. It still keeps on throwing this error at me. I have no clue how to go about it.
Ok.. I think at first, you want to multiply no.of purchases with the price of each product and then you need total SUM of that multiplied value..
Even though this is a strange requirement, but you can go with below approach..
All you need to do is calculate the multiplication in final_details Foreach statement itself and simply apply the SUM for that multiplied amount..
Based on your load statements I created the below input files
main_orders.txt
6666,100,2012-01-01
7777,101,2012-09-02
8888,100,2012-01-09
9999,101,2012-12-08
6666,101,2012-09-02
9999,100,2012-07-12
9999,100,2012-08-01
6666,100,2012-01-02
7777,100,2012-09-09
orders_details.txt
6666,6000
7777,7000
8888,8000
9999,9000
main_products.txt
6000,Nike,Shoes,3000,3000,1
7000,Adidas,Cap,1000,1000,1
8000,Rebook,Shoes,4000,4000,1
9000,Puma,Shoes,25000,2500,1
Below is the code
orders = LOAD '/user/cloudera/inputfiles/main_orders.txt' USING PigStorage(',') AS (order_id:int,cust_id:int,order_dtm:chararray);
details = LOAD '/user/cloudera/inputfiles/orders_details.txt' USING PigStorage(',') AS (order_id:int,prod_id:int);
products = LOAD '/user/cloudera/inputfiles/main_products.txt' USING PigStorage(',') AS(prod_id:int,brand:chararray,name:chararray,price:int,cost:int,shipping_wt:int);
recent = FILTER orders by order_dtm matches '2012-.*';
customer = GROUP recent by cust_id;
cust_orders = FOREACH customer GENERATE group as cust_id, (int)COUNT(recent) as num_of_orders;
potential_customers = FILTER cust_orders by num_of_orders>=5;
new_details = join potential_customers by cust_id, recent by cust_id;
new_details1 = join new_details by order_id, details by order_id;
new_details2 = join new_details1 by prod_id, products by prod_id;
DESCRIBE new_details2;
final_details = FOREACH new_details2 GENERATE potential_customers::cust_id, potential_customers::num_of_orders as num_of_orders,recent::order_id as order_id,recent::order_dtm,details::prod_id,products::brand,products::name,products::price as price,products::cost,products::shipping_wt, (potential_customers::num_of_orders * products::price ) as multiplied_price;// multiplication is achived in last variable
dump final_details;
grouped_data = GROUP final_details by cust_id;
member = FOREACH grouped_data GENERATE SUM(final_details.multiplied_price) ;
lim = limit member 10;
dump lim;
Just for clarity I am dumping the output of final_details foreach statement as well.
(100,6,6666,2012-01-01,6000,Nike,Shoes,3000,3000,1,18000)
(100,6,6666,2012-01-02,6000,Nike,Shoes,3000,3000,1,18000)
(100,6,7777,2012-09-09,7000,Adidas,Cap,1000,1000,1,6000)
(100,6,8888,2012-01-09,8000,Rebook,Shoes,4000,4000,1,24000)
(100,6,9999,2012-07-12,9000,Puma,Shoes,25000,2500,1,150000)
(100,6,9999,2012-08-01,9000,Puma,Shoes,25000,2500,1,150000)
final output is below
(366000)
This code may help you, but Please clarify your requirement again

Strange execution time for summary query

I am giving here part of the query I am executing:
SELECT SUM(ParentTable.Field1),
(SELECT SUM(ChildrenTable.Field1)
FROM ChildrenRable INNER JOIN
GrandChildrenTable ON ChildrenTable.Id = GrandChildrenTable.ChildrenTableId INNER JOIN
AnotherTable ON GrandChildrenTable.AnotherTableId = AnotherTable.Id
WHERE ChildrenTable.ParentBaleId = ParentTable.Id
AND AnotherTable.Type=1),
----
FROM ParentTable
WHERE some_conditions
Relationships:
ParentTable -> ChildrenTable = 1-to-many
ChildrenTable -> GrandChildrenTable = 1-to-many
GrandChildrenTable -> AnotherTable = 1-to-1
I am executing this query three times, while changing only the Type condition, and here are the results:
Number of records that are returned:
Condition Total execution time (ms)
Type = 1 : 973
Type = 2 : 78810
Type = 3 : 648318
If I execute just the inner join query, here is the count of joined records:
SELECT p.Type, COUNT(*)
FROM CycleActivities ca INNER JOIN
CycleActivityProducts cap ON ca.Id = CAP.CycleActivityId INNER JOIN
Products p ON cap.ProductId = p.Id
GROUP BY p.Type
Type
---- -----------
1 55152
2 13401
4 102730
So, why would the query with Type = 1 condition execute much faster than the query with Type = 2, although it is querying 4x larger resultset (Type is tinyint)?
The way your query is written instructs SQL Server to execute the sub-query with JOIN for every row of the output.
This way it should be faster, if I understand what you want correctly (UPDATED):
with cte_parent as (
select
Id,
SUM (ParentTable.Field1) as Parent_Sum
from ParentTable
group by Id
),
cte_child as (
SELECT
Id,
SUM (ChildrenTable.Field1) as as Child_Sum
FROM ChildrenRable
INNER JOIN
GrandChildrenTable ON ChildrenTable.Id = GrandChildrenTable.ChildrenTableId
INNER JOIN
AnotherTable ON GrandChildrenTable.AnotherTableId = AnotherTable.Id
WHERE
AnotherTable.Type=1
AND
some_conditions
GROUP BY Id
)
select cte_parent.id, Parent_Sum, Child_Sum
from parent_cte
join child_cte on parent_cte.id = child_cte.id

Translating SQL statement with dates to Linq-to-Sql for use with EF4

I have the following SQL command:
SELECT CONVERT(varchar, Logged, 103) AS Visited, COUNT(ID) AS Totals
FROM tblStats
GROUP BY CONVERT(varchar, Logged, 103)
ORDER BY Visited DESC
I want to translate this into a L2S statement that can be used with the Entity Framework, but in working with datetime types, I'm getting various errors depending on how I try to attack the problem.
Approach:
var results = from s in db.Stats
group s by (s.Logged.Date.ToString()) into Grp
select new { Day = Grp.Key, Total = Grp.Count() };
Error:
LINQ to Entities does not recognize
the method 'System.String ToString()'
method, and this method cannot be
translated into a store expression.
Approach:
var results = from s in db.Stats
group s by (s.Logged.Date) into Grp
select new { Day = Grp.Key, Total = Grp.Count() };
Error:
The specified type member 'Date' is
not supported in LINQ to Entities.
Only initializers, entity members, and
entity navigation properties are
supported.
What syntax do I need to make the query work?
Try using the EntityFunctions.TruncateTime method:
var results = from s in db.Stats
group s by EntityFunctions.TruncateTime(s.Logged) into Grp
select new { Day = Grp.Key, Total = Grp.Count() };
Do you need the Date section in s.Logged.Date?
Have you tried this:
var results = from s in db.Stats
group s by (s.Logged) into Grp
select new { Day = Grp.Key, Total = Grp.Count() };
I'm assuming that Logged is the property (column in the table).
EDIT: you guys are way to quick.

ASP.Net MVC Linq Grouping Query

I am getting the last 20 updated records in the database by using the following
var files = (from f in filesContext.Files
join ur in filesContext.aspnet_Roles on f.Authority equals ur.RoleId
join u in filesContext.aspnet_Users on f.Uploader equals u.UserId
orderby f.UploadDate descending
select new FileInfo { File = f, User = u, UserRole = ur }).Take(20);
I am then splitting the results in my view:
<%foreach(var group in Model.GroupBy(f => f.UserRole.RoleName)) {%>
//output table here
This is fine as a table is rendered for each of my roles. However as expected I get the last 20 records overall, how could I get the last 20 records per role?
So I end up with:
UserRole1
//Last 20 Records relating to this UserRole1
UserRole2
//Last 20 Records relating to this UserRole2
UserRole3
//Last 20 Records relating to this UserRole3
I can think of three possible ways to do this. First, get all the roles, then perform a Take(20) query per role, aggregating the results into your model. This may or may not be a lot of different queries depending on the number of roles you have. Second, get all the results, then filter the last 20 per role in your view. This could be a very large query, taking lots of time. Third, get some large number of results that will likely have at least 20 entries per role (but is not guaranteed) and then filter the last 20 per role in your view. I would probably use the first or third options depending how important it is to get 20 results.
var files = (from f in filesContext.Files
join ur in filesContext.aspnet_Roles on f.Authority equals ur.RoleId
join u in filesContext.aspnet_Users on f.Uploader equals u.UserId
orderby f.UploadDate descending
select new FileInfo { File = f, User = u, UserRole = ur })
.Take(2000);
<% foreach (var group in Model.GroupBy( f => f.UserRole.RoleName,
(role,infos) =>
new {
Key = role.RoleName,
Selected = infos.Take(20)
} )) { %>
<%= group.Key %>
<% foreach (var selection in group.Selected)
{ %>
...
You could either count the elements and skip the first (length - 20) elements or just reverse/take 20/reverse.
foreach (var group in Model.GroupBy(f => f.UserRole.RoleName))
{
// draw table header
foreach (item in group.Reverse().Take(20).Reverse())
{
// draw item
}
// Or
int skippedElementCount = group.Count() - 20;
if (skippedElementCount < 0) skippedElementCount = 0;
foreach (item in group.Skip(skippedElementCount))
{
// draw item
}
// draw table footer
}

Resources