PIG parsing string inputs

PIG parsing string inputs - parsing

I have two files,
one is titles.csv and has a movie ID and title with this format:
999: Title
734: Another_title
the other is a list of user IDs who link to the movie
categoryID: user1_id, ....
222: 120
227: 414 551
249: 555
Of different sizes (minimum is one user per genre category)
The goal is to first parse the strings so that they are each split into two (for both files), everything before the ':' and everything after.
I have tried doing this
movies = LOAD .... USING PigStorage('\n') AS (line: chararray)
users = LOAD .... USING PigStorage('\n') AS (line: chararray)
-- parse 'users'/outlinks, make a list and count fields
tokenized = FOREACH users GENERATE FLATTEN(TOKENIZE(line, ':')) AS parameter;
filtered = FILTER tokenized BY INDEXOF(parameter, ' ') != -1;
result = FOREACH filtered GENERATE SUBSTRING(parameter, 2, (int)SIZE(parameter)) AS number;
But this is where I got stuck/confused. Thoughts?
I'm also supposed to output the top 10 entries who have the most user IDs in the second part of the string.

try like this
movies = LOAD 'file1' AS titleLine;
A = FOREACH movies GENERATE FLATTEN(REGEX_EXTRACT_ALL(titleLine,'^(.*):\\s+(.*)$')) AS (movieId:chararray,title:chararray);
users = LOAD 'file2' AS userLine;
B = FOREACH users GENERATE FLATTEN(REGEX_EXTRACT_ALL(userLine,'^(.*):\\s+(.*)$')) AS (categoryId:chararray,userId:chararray);
Output1:
(999,Title)
(734,Another_title)
Output2:
(222,120)
(227,414 551)
(249,555 )

Related

Lua how to access specific data in a nested table

I parsed some data from a CSV file to a Lua table.
Lets say the table looks like this just bigger
tab {
{ id = 1761, anotherID=2, ping=pong}
{ id = 2071, anotherID=4, ping=notpong}
}
Now I want to know every ID (without displaying any other data yet) to store them in another table for some time.
I am completely lost here for now..
Using what you wrote I rewrote it a bit and went to have:
minitab = {}
for i, value in ipairs(tab) do
local id = value.id
local anotherID = value.anotherID
minitab[id] = anotherID
end
Would that work? In fact i later want to get just 2 values of a way larger array (around 30 datas) - but I can only push a single array to a GUI dropdown. I want to save the ID as a key and the "anotherID" value wich will be a text after that key so if a ask for the 2071st value it displays the "name" 4

The code below stores the ids as keys in another table:
id={}
for k,v in ipairs(tab) do
id[v.id]=true
end
You can then traverse id with pairs to list the ids.
If you want to remember where each id came from, use id[v.id]=k in the loop.

Based on your question, you can use this code to traverse your data table tab and get minitab to be used for your GUI array:
--data
tab = {
{id = "4204", label = "2", desc = "Roancyme"},
{id = "5517", label = "9", desc = "Bicktuft"},
{id = "1035", label = "3", desc = "Pipyalum"},
}
--temporary table
local minitab = {}
for i, option in ipairs(tab) do
minitab[option.id] = option.label
end
--print minitab
print('<select>')
for id, label in pairs(minitab) do
print(string.format('<option value="%s">%s</option>', id, label)) --> <option value="1035">3</option>
end
print('</select>')
print()
However, I don't think it is necessary to create a temporary table to store those values, because you can easily traverse your original table tab and directly pull out the output you need; like this:
--print directly from tab
print('<select>')
for i, option in ipairs(tab) do
print(string.format('<option value="%s">%s</option>', option.id, option.label)) --> <option value="1035">3</option>
end
print('</select>')
print()
Unless you need to work with it before displaying the list on the drop down (e.g. add some prefix to the label, sort minitab by the label, etc); but you don't want to disturb the original data table tab. In this case, it would make sense to use the temporary table.
--format values in temporary table
local minitab = {}
for i, option in ipairs(tab) do
local minitabID = option.id
local minitabLabel = string.format('Item %s - %s', option.label, option.desc)
table.insert(minitab, {id = minitabID, label = minitabLabel})
end
--sort temporary table
table.sort(minitab, function (o1, o2) return o2.label > o1. label end)
--print formatted values from temporary table
print('<select>')
for i, option in ipairs(minitab) do
print(string.format('<option value="%s">%s</option>', option.id, option.label)) --> <option value="4204">Item 2 - Roancyme</option>
end
print('</select>')
NB: Please take a note on which table iteration uses ipairs and which one uses pairs. See the complete code snippet here.

How to display list_filter with count of related objects in django admin?

How can I display the count of related objects after each filter in list_filter in django admin?
class Application(TimeStampModel):
name = models.CharField(verbose_name='CI Name', max_length=100, unique=True)
description = models.TextField(blank=True, help_text="Business application")
class Server(TimeStampModel):
name = models.CharField(max_length=100, verbose_name='Server Name', unique=True)
company = models.CharField(max_length=3, choices=constants.COMPANIES.items())
online = models.BooleanField(default=True, blank=True, verbose_name='OnLine')
application_members = models.ManyToManyField('Application',through='Rolemembership',
through_fields = ('server', 'application'),
)
class Rolemembership(TimeStampModel):
server = models.ForeignKey(Server, on_delete = models.CASCADE)
application = models.ForeignKey(Application, on_delete = models.CASCADE)
name = models.CharField(verbose_name='Server Role', max_length=50, choices=constants.SERVER_ROLE.items())
roleversion = models.CharField(max_length=100, verbose_name='Version', blank=True)
Admin.py
#admin.register(Server)
class ServerAdmin(admin.ModelAdmin):
save_on_top = True
list_per_page = 30
list_max_show_all = 500
inlines = [ServerInLine]
list_filter = (
'region',
'rolemembership__name',
'online',
'company',
'location',
'updated_on',
)
i.e After each filter in list filter, I want to show the count of related objects.
Now it only shows the list of filter
i.e location filter list
Toronto
NY
Chicago
I want the filter to show the count like below:
Toronto(5)
NY(3)
Chicago(2)
And if the filter has 0 related objects, don't display the filter.

This is possible with a custom list filter by combining two ideas.
One: the lookups method lets you control the value used in the query string and the text displayed as filter text.
Two: you can inspect the data set when you build the list of filters. The docs at https://docs.djangoproject.com/en/1.11/ref/contrib/admin/#django.contrib.admin.ModelAdmin.list_filter shows examples for a start decade list filter (always shows «80s» and «90s») and a dynamic filter (shows «80s» if there are matching records, same for «90s»).
Also as a convenience, the ModelAdmin object is passed to the lookups
method, for example if you want to base the lookups on the available
data
This is a filter I wrote to filter data by language:
class BaseLanguageFilter(admin.SimpleListFilter):
title = _('language')
parameter_name = 'lang'
def lookups(self, request, model_admin):
# Inspect the existing data to return e.g. ('fr', 'français (11)')
# Note: the values and count are computed from the full data set,
# ignoring currently applied filters.
qs = model_admin.get_queryset(request)
for lang, name in settings.LANGUAGES:
count = qs.filter(language=lang).count()
if count:
yield (lang, f'{name} ({count})')
def queryset(self, request, queryset):
# Apply the filter selected, if any
lang = self.value()
if lang:
return queryset.filter(language=lang)
You can start from that and adapt it for your cities by replacing the part with settings.LANGUAGES with a queryset aggregation + values_list that will return the distinct values and counts for cities.

Éric's code got me 80% of what I needed. To address the comment he left in his code (about ignoring currently applied filters), I ended up using the following for my use case:
from django.db.models import Count
class CountAnnotatedFeedFilter(admin.SimpleListFilter):
title = 'feed'
parameter_name = 'feed'
def lookups(self, request, model_admin):
qs = model_admin.get_queryset(request).filter(**request.GET.dict())
for pk, name, count in qs.values_list('feed__feed_id', 'feed__feed_name').annotate(total=Count('feed')).order_by('-total'):
if count:
yield pk, f'{name} ({count})'
def queryset(self, request, queryset):
feed_id = self.value()
if feed_id:
return queryset.filter(feed_id=feed_id)
And then, in the admin model:
class FeedEntryAdmin(admin.ModelAdmin):
list_filter = (CountAnnotatedFeedFilter,)
Note: As Éric also mentioned, this can impact the speed of the admin panel quite heavily, as it may have to perform expensive queries.

Multipy after joining data in PIG

I am trying to multiply two fields and take their sum after joining three tables in Pig. However I keep on getting this error:
<file loyalty_program.pig, line 30, column 74> (Name: Multiply Type: null Uid: null)incompatible types in Multiply Operator left hand side:bag :tuple(new_details1::new_details::potential_customers::num_of_orders:long) right hand side:bag :tuple(products::price:int)
-- load the data sets
orders = LOAD '/dualcore/orders' AS (order_id:int,
cust_id:int,
order_dtm:chararray);
details = LOAD '/dualcore/order_details' AS (order_id:int,
prod_id:int);
products = LOAD '/dualcore/products' AS (prod_id:int,
brand:chararray,
name:chararray,
price:int,
cost:int,
shipping_wt:int);
recent = FILTER orders by order_dtm matches '2012-.*$';
customer = GROUP recent by cust_id;
cust_orders = FOREACH customer GENERATE group as cust_id, (int)COUNT(recent) as num_of_orders;
potential_customers = FILTER cust_orders by num_of_orders>=5;
new_details = join potential_customers by cust_id, recent by cust_id;
new_details1 = join new_details by order_id, details by order_id;
new_details2 = join new_details1 by prod_id, products by prod_id;
--DESCRIBE new_details2;
final_details = FOREACH new_details2 GENERATE potential_customers::cust_id, potential_customers::num_of_orders as num_of_orders,recent::order_id as order_id,recent::order_dtm,details::prod_id,products::brand,products::name,products::price as price,products::cost,products::shipping_wt;
grouped_data = GROUP final_details by cust_id;
member = FOREACH grouped_data GENERATE SUM(final_details.num_of_orders * final_details.price) ;
lim = limit member 10;
dump lim;
I even casted the result of count to int. It still keeps on throwing this error at me. I have no clue how to go about it.

Ok.. I think at first, you want to multiply no.of purchases with the price of each product and then you need total SUM of that multiplied value..
Even though this is a strange requirement, but you can go with below approach..
All you need to do is calculate the multiplication in final_details Foreach statement itself and simply apply the SUM for that multiplied amount..
Based on your load statements I created the below input files
main_orders.txt
6666,100,2012-01-01
7777,101,2012-09-02
8888,100,2012-01-09
9999,101,2012-12-08
6666,101,2012-09-02
9999,100,2012-07-12
9999,100,2012-08-01
6666,100,2012-01-02
7777,100,2012-09-09
orders_details.txt
6666,6000
7777,7000
8888,8000
9999,9000
main_products.txt
6000,Nike,Shoes,3000,3000,1
7000,Adidas,Cap,1000,1000,1
8000,Rebook,Shoes,4000,4000,1
9000,Puma,Shoes,25000,2500,1
Below is the code
orders = LOAD '/user/cloudera/inputfiles/main_orders.txt' USING PigStorage(',') AS (order_id:int,cust_id:int,order_dtm:chararray);
details = LOAD '/user/cloudera/inputfiles/orders_details.txt' USING PigStorage(',') AS (order_id:int,prod_id:int);
products = LOAD '/user/cloudera/inputfiles/main_products.txt' USING PigStorage(',') AS(prod_id:int,brand:chararray,name:chararray,price:int,cost:int,shipping_wt:int);
recent = FILTER orders by order_dtm matches '2012-.*';
customer = GROUP recent by cust_id;
cust_orders = FOREACH customer GENERATE group as cust_id, (int)COUNT(recent) as num_of_orders;
potential_customers = FILTER cust_orders by num_of_orders>=5;
new_details = join potential_customers by cust_id, recent by cust_id;
new_details1 = join new_details by order_id, details by order_id;
new_details2 = join new_details1 by prod_id, products by prod_id;
DESCRIBE new_details2;
final_details = FOREACH new_details2 GENERATE potential_customers::cust_id, potential_customers::num_of_orders as num_of_orders,recent::order_id as order_id,recent::order_dtm,details::prod_id,products::brand,products::name,products::price as price,products::cost,products::shipping_wt, (potential_customers::num_of_orders * products::price ) as multiplied_price;// multiplication is achived in last variable
dump final_details;
grouped_data = GROUP final_details by cust_id;
member = FOREACH grouped_data GENERATE SUM(final_details.multiplied_price) ;
lim = limit member 10;
dump lim;
Just for clarity I am dumping the output of final_details foreach statement as well.
(100,6,6666,2012-01-01,6000,Nike,Shoes,3000,3000,1,18000)
(100,6,6666,2012-01-02,6000,Nike,Shoes,3000,3000,1,18000)
(100,6,7777,2012-09-09,7000,Adidas,Cap,1000,1000,1,6000)
(100,6,8888,2012-01-09,8000,Rebook,Shoes,4000,4000,1,24000)
(100,6,9999,2012-07-12,9000,Puma,Shoes,25000,2500,1,150000)
(100,6,9999,2012-08-01,9000,Puma,Shoes,25000,2500,1,150000)
final output is below
(366000)
This code may help you, but Please clarify your requirement again

Pig Script to parse and compute cumulative value

Input file with two column:
Visit ProductString
101 ;Cross Trainers;1;69.95,;Athletic Socks;10;29.99
102 ;Amplifier;1;120.90,;Headphone;2;59.99;leather wallet;1;99.99;
I am looking for Pig script that can parse "ProductString" value in each row and provide cumulative revenue.
ie.,Output:
69.95+29.99+120.90+59.99+99.99=380.82

I'll assume there should be a , after 59.99 and that there shouldn't be a ; after 99.99. If so, you need to tokenize and flatten on the , to extract products and then split on the ; to get item prices and qty.
Query:
data = LOAD 'db.table';
A = FOREACH data GENERATE visit, FLATTEN(TOKENIZE(product_string, ',')) AS tmp_col;
B = FOREACH A GENERATE visit, STRSPLIT(tmp_col, ';') AS prod;
C = FOREACH B GENERATE visit, prod.$1 AS item:chararray
, (int)prod.$2 AS qty:int, (double)prod.$3 AS revenue:double;
grpd = GROUP C all;
D = FOREACH grpd GENERATE SUM(C.revenue);
DUMP D;
Output:
(380.82)

ASP.Net MVC Linq Grouping Query

I am getting the last 20 updated records in the database by using the following
var files = (from f in filesContext.Files
join ur in filesContext.aspnet_Roles on f.Authority equals ur.RoleId
join u in filesContext.aspnet_Users on f.Uploader equals u.UserId
orderby f.UploadDate descending
select new FileInfo { File = f, User = u, UserRole = ur }).Take(20);
I am then splitting the results in my view:
<%foreach(var group in Model.GroupBy(f => f.UserRole.RoleName)) {%>
//output table here
This is fine as a table is rendered for each of my roles. However as expected I get the last 20 records overall, how could I get the last 20 records per role?
So I end up with:
UserRole1
//Last 20 Records relating to this UserRole1
UserRole2
//Last 20 Records relating to this UserRole2
UserRole3
//Last 20 Records relating to this UserRole3

I can think of three possible ways to do this. First, get all the roles, then perform a Take(20) query per role, aggregating the results into your model. This may or may not be a lot of different queries depending on the number of roles you have. Second, get all the results, then filter the last 20 per role in your view. This could be a very large query, taking lots of time. Third, get some large number of results that will likely have at least 20 entries per role (but is not guaranteed) and then filter the last 20 per role in your view. I would probably use the first or third options depending how important it is to get 20 results.
var files = (from f in filesContext.Files
join ur in filesContext.aspnet_Roles on f.Authority equals ur.RoleId
join u in filesContext.aspnet_Users on f.Uploader equals u.UserId
orderby f.UploadDate descending
select new FileInfo { File = f, User = u, UserRole = ur })
.Take(2000);
<% foreach (var group in Model.GroupBy( f => f.UserRole.RoleName,
(role,infos) =>
new {
Key = role.RoleName,
Selected = infos.Take(20)
} )) { %>
<%= group.Key %>
<% foreach (var selection in group.Selected)
{ %>
...

You could either count the elements and skip the first (length - 20) elements or just reverse/take 20/reverse.
foreach (var group in Model.GroupBy(f => f.UserRole.RoleName))
{
// draw table header
foreach (item in group.Reverse().Take(20).Reverse())
{
// draw item
}
// Or
int skippedElementCount = group.Count() - 20;
if (skippedElementCount < 0) skippedElementCount = 0;
foreach (item in group.Skip(skippedElementCount))
{
// draw item
}
// draw table footer
}

Develop Reference

ios ruby-on-rails asp.net-mvc docker delphi jenkins grails google-sheets machine-learning dart

PIG parsing string inputs - parsing

Related

Lua how to access specific data in a nested table

How to display list_filter with count of related objects in django admin?

Multipy after joining data in PIG

Pig Script to parse and compute cumulative value

ASP.Net MVC Linq Grouping Query

Categories

Resources