How does foreach work in Pig? - foreach

I have a sample data looks like:
1950,0,1
1950,22,1
1950,-11,1
1949,111,1
1949,78,1
and I used following commands:
A = load 'path/to/the/sample';
B = foreach A generate $0,$1;
which should only generate first 2 columns of the A.
then I used
describe B
to check how it works, it returns: B: {a: bytearray,b: bytearray}, that is correct.
HOWEVER, when I run the command
dump B
why it returns:
(1950,0,1,)
(1950,22,1,)
(1950,-11,1,)
(1949,111,1,)
(1949,78,1,)
as the result??? It's sooooo weird. I'v tried it several time... but still the same result

The reason this happens is because Pig by default tries to separate your data by tabs. So when you pass it a line like
1950,0,1
it thinks it has found just a single field, 1950,0,1. Since you indicated that each line has two fields, the second field is just set to NULL.
So when you GENERATE the two fields you loaded, it prints out the tuple
(1950,0,1,)
If you were to STORE this instead of DUMPing it you would see it more clearly. Pig would store the data separated by tabs (again, the default), and your output file would look like
1950,0,1
1950,22,1
1950,-11,1
1949,111,1
1949,78,1
That's not very enlightening, so look instead what happens if you were to do this:
B = foreach A generate $0, "test";
store B into 'output';
Now the data in output would be
1950,0,1 test
1950,22,1 test
1950,-11,1 test
1949,111,1 test
1949,78,1 test
You can control what Pig uses as the field separator for both LOAD and STORE by using the clause USING PigStorage(','). The argument to PigStorage can be whatever character you like. One other common one is USING PigStorage('\n'), which will load in each line as a whole.

Use PigStorage Clause in your Load statement.
A = load 'path/to/the/sample' using PigStorage(',');
B = foreach A generate $0,$1;
dump B
now you will get the result that what u expect
(1950,0)
(1950,22)
(1950,-11)
(1949,111)
(1949,78)

Related

We giving a task for Lua table but it is not working as expectable

Our task is create a table, and read values to the table using a loop. Print the values after the process is complete. - Create a table. - Read the number of values to be read to the table. - Read the values to the table using a loop. - Print the values in the table using another loop. for this we had written code as
local table = {}
for value in ipairs(table) do
io.read()
end
for value in ipairs(table) do
print(value)
end
not sure where we went wrong please help us. Our exception is
Input (stdin)
3
11
22
abc
Your Output (stdout)
~ no output ~
Expected Output
11
22
abc
Correct Code is
local table1 = {}
local x = io.read()
for line in io.lines() do
table.insert(table1, line)
end
for K, value in ipairs(table1) do
print(value)
end
Let's walk through this step-by-step.
Create a table.
Though the syntax is correct, table is a reserved pre-defined global name in Lua, and thus cannot should not be declared a variable name to avoid future issues. Instead, you'll need to want to use a different name. If you're insistent on using the word table, you'll have to distinguish it from the function global table. The easiest way to do this is change it to Table, as Lua is a case-sensitive language. Therefore, your table creation should look something like:
local Table = {}
Read values to the table using a loop.
Though Table is now established as a table, your for loop is only iterating through an empty table. It seems your goal is to iterate through the io.read() instead. But io.read() is probably not what you want here, though you can utilize a repeat loop if you wish to use io.read() via table.insert. However, repeat requires a condition that must be met for it to terminate, such as the length of the table reaching a certain amount (in your example, it would be until (#Table == 4)). Since this is a task you are given, I will not provide an example, but allow you to research this method and use it to your advantage.
Print the values after the process is complete.
You are on the right track with your printing loop. However, it must be noted that iterating through a table always returns two results, an index and a value. In your code, you would only return the index number, so your output would simply return:
1
2
3
4
If you are wanting the actual values, you'll need a placeholder for the index. Oftentimes, the placeholder for an unneeded variable in Lua is the underscore (_). Modify your for loop to account for the index, and you should be set.
Try modifying your code with the suggestions I've given and see if you can figure out how to achieve your end result.
Edited:
Thanks, Piglet, for corrections on the insight! I'd forgotten table itself wasn't a function, and wasn't reserved, but still bad form to use it as a variable name whether local or global. At least, it's how I was taught, but your comment is correct!

Iterating through CSV::Rows

I'm going to preface that I'm still learning ruby.
I'm writing a script to parse a .csv and identify possible duplicate records in the data-set.
I have a .csv file with headers, so I'm parsing the data so that I can access each row using a header title as such:
#contact_table = CSV.parse(File.read("app/data/file.csv"), headers: true)
# Prints all last names in table
puts contact_table['last_name']
I'm trying to iterate over each row in the table and identify if the last name I'm currently iterating over is similar to the next last name, but I'm having trouble doing this. I guess the way I'm handling it is as if it's an array, but I checked the type and it's a CSV::Row.
example (this doesn't work):
#contact_table.each_with_index do |c, i|
puts "first contact is #{c['last_name']}, second contact is #{c[i + 1]['last_name']}"
end
I realized this doesn't work like this because the table isn't an array, it's a CSV::Row like I previously mentioned. Is there any method that can achieve this? I'm really blanking right now.
My csv looks something like this:
id,first_name,last_name,company,email,address1,address2,zip,city,state_long,state,phone
1,Donalt,Canter,Gottlieb Group,dcanter0#nydailynews.com,9 Homewood Alley,,50335,Des Moines,Iowa,IA,515-601-4495
2,Daphene,McArthur,"West, Schimmel and Rath",dmcarthur1#twitter.com,43 Grover Parkway,,30311,Atlanta,Georgia,GA,770-271-7837
#contact_table should be a CSV::Table which is a collection of CSV::Rows so in this:
#contact_table.each_with_index do |c, i|
...
end
c is a CSV::Row. That's why c['last_name'] works. The problem is that here:
c[i + 1]['last_name']
you're looking at c (a single row) instead of #contact_table, if you said:
#contact_table[i + 1]['last_name']
then you'd get the next last name or, when c is the last row, an exception because #contact_table[i+1] will be nil.
Also, inside the iteration, c is the current (or (i+1)th) row and won't always be the first.
What is your use case for this? Seems like a school project?
I recommend for_each instead of parse (see this comparison). I would probably use a Set for this.
Create a Set outside of the scope of parsing the file (i.e., above the parsing code). Let's call it rows.
Call rows.include?(row) during each iteration while parsing the file
If true, then you know you have a duplicate
If false, then call rows.add(row) to add the new row to the set
You could also just fill your set with an individual value from a column that must be distinct (e.g., row.field(:some_column_name)), such as email or phone number, and do the same inclusion check for that.
(If this is for a real app, please don't do this. Use model validations instead.)
I would use #read instead of #parse and do something like this:
require 'csv'
LASTNAME_INDEX = 2
data = CSV.read('data.csv')
data[1..-1].each_with_index do |row, index|
puts "Contact number #{index + 1} has the following last name : #{row[LASTNAME_INDEX]}"
end
#~> Contact number 1 has the following last name : Canter
#~> Contact number 2 has the following last name : McArthur

How to read out a list of cases in one variable in SPSS and use that to add data?

To explain my problem I use this example data set:
SampleID Date Project Problem
03D00173 03-Dec-2010 1,00
03D00173 03-Dec-2010 1,00
03D00173 28-Sep-2009 YNTRAD
03D00173 28-Sep-2009 YNTRAD
Now, the problem is that I need to replace the text "YNTRAD" with "YNTRAD_PILOT" but only for the cases with Date = 28-Sep-2009.
This is example is part of a much larger database, with many more cases having Project=YNTRAD and Data=28-Sep-2009, so I can not simply select first all cases with 28-Sep-2009, then check which of these cases have Project=YNTRAD and then replace. Instead, what I need to do is:
Look at each case that has a 1,00 in Problem (these are problem
cases)
Then find the SampleID that corresponds with that sample
Then find all other cases with the same SampleID BUT WITH
Date=28-Sep-2009 (this is needed because only those samples are part
of a pilot study) and then replace YNTRAD in Project to
YNTRAD_PILOT.
I read a lot about:
LOOP
- DO REPEAT
- DO IF
but I don't know how to use these in solving this problem.
I first tried making a list containing only the sample ID's that need eventually to be changed (again, this is part of a much larger database).
STRING SampleID2 (A20).
IF (Problem=1) SampleID2=SampleID.
EXECUTE.
AGGREGATE
/OUTFILE=*
/BREAK=SampleID2
/n_SampleID2=N.
This gives a dataset with only the SampleID's for which a change should be made. However I don't know how to read out this dataset case by case and looking up each SampleID in the overall file with all the date and then change only those cases were Date = 28-Sep-2009.
It sounds like once we can identify the IDs that need to be changed we've done the tricky part here. We can use AGGREGATE with MODE=ADDVARIABLES to add a problem Id counter variable to our dataset. From there, it's as you'd expect.
* Add var IdProblemCnt to your database . Stores # of times a given Id had a record with Problem = 1.
AGGREGATE
/OUTFILE=* MODE=ADDVARIABLES
/BREAK=SampleId
/IdProblemCnt=CIN(Problem, 1, 1) .
EXE .
* once we've identified the "problem" Ids we can use `RECODE` Project var.
DO IF (IdProblemCnt>0 AND Date = DATE.MDY(9,28,2009) .
RECODE Project ('YNTRAD' = 'YNTRAD_PILOT') .
END IF .
EXE .

Java 8- forEach method iterator behaviour

I recently started checking new Java 8 features.
I've come across this forEach iterator-which iterates over the Collection.
Let's take I've one ArrayList of type <Integer> having values= {1,2,3,4,5}
list.forEach(i -> System.out.println(i));
This statement iteates over a list and prints the values inside it.
I'd like to know How am I going to specify that I want it to iterate over some specific values only.
Like, I want it to start from 2nd value and iterate it till 2nd last value. or something like that- or on alternate elements.
How am I going to do that?
To iterate on a section of the original list, use the subList method:
list.subList(1, list.length()-1)
.stream() // This line is optional since List already has a foreach method taking a Consumer as parameter
.forEach(...);
This is the concept of streams. After one operation, the results of that operation become the input for the next.
So for your specific example, you can follow #Joni's command. But if you're asking in general, then you can create a filter to only get the values you want to loop over.
For example, if you only wanted to print the even numbers, you could create a filter on the streams before you forEached them. Like this:
List<Integer> intList = Arrays.asList(1,2,3,4,5);
intList.stream()
.filter(e -> (e & 1) == 0)
.forEach(System.out::println);
You can similarly pick out the stuff you want to loop over before reaching your terminal operation (in your case the forEach) on the stream. I suggest you read this stream tutorial to get a better idea of how they work: http://winterbe.com/posts/2014/07/31/java8-stream-tutorial-examples/

How do I use TADOQuery.Parameters with integer parameter types that have to be put in two or more places in a query?

I have a complex query that contains more than one place where the same primary key value must be substituted. It looks like this:
select Foo.Id,
Foo.BearBaitId,
Foo.LinkType,
Foo.BugId,
Foo.GooNum,
Foo.WorkOrderId,
(case when Goo.ZenID is null or Goo.ZenID=0 then
IsNull(dbo.EmptyToNull(Bar.FanName),dbo.EmptyToNull(Bar.BazName))+' '+Bar.Strength else
'#'+BarZen.Description end) as Description,
Foo.Init,
Foo.DateCreated,
Foo.DateChanged,
Bug.LastName,
Bug.FirstName,
Goo.BarID,
(case when Goo.ZenID is null or Goo.ZenID=0 then
IsNull(dbo.EmptyToNull(Bar.BazName),dbo.EmptyToNull(Bar.FanName))+' '+Bar.Strength else
'#'+BarZen.Description end) as BazName,
GooTracking.Status as GooTrackingStatus
from
Foo
inner join Bug on (Foo.BugId=Bug.Id)
inner join Goo on (Foo.GooNum=Goo.GooNum)
left join Bar on (Bar.Id=Goo.BarID)
left join BarZen on (Goo.ZenID=BarZen.ID)
inner join GooTracking on(Goo.GooNum=GooTracking.GooNum )
where (BearBaitId = :aBaitid)
UNION
select Foo.Id,
Foo.BearBaitId,
Foo.LinkType,
Foo.BugId,
Foo.GooNum,
Foo.WorkOrderId,
Foo.Description,
Foo.Init,
Foo.DateCreated,
Foo.DateChanged,
Bug.LastName,
Bug.FirstName,
0,
NULL,
0
from Foo
inner join Bug on (Foo.BugId=Bug.Id)
where (LinkType=0) and (BearBaitId= :aBaitid )
order by BearBaitId,LinkType desc, GooNum
When I try to use an integer parameter on this non-trivial query, it seems impossible to me. I get this error:
Error
Incorrect syntax near ':'.
The query works fine if I take out the :aBaitid and substitute a literal 1.
Is there something else I can do to this query above? When I test with simple tests like this:
select * from foo where id = :anid
These simple cases work fine. The component is TADOQuery, and it works fine until you add any :parameters to the SQL string.
Update: when I use the following code at runtime, the parameter substitutions are actually done (some glitch in the ADO components is worked around) and a different error surfaces:
adoFooContentQuery.Parameters.FindParam('aBaitId').Value := 1;
adoFooContentQuery.Active := true;
Now the error changes to:
Incorrect syntax near the keyword 'inner''.
Note again, that this error goes away if I simply stop using the parameter substitution feature.
Update2: The accepted answer suggests I have to find two different copies of the parameter with the same name, which bothered me so I reworked the query like this:
DECLARE #aVar int;
SET #aVar = :aBaitid;
SELECT ....(long query here)
Then I used #aVar throughout the script where needed, to avoid the repeated use of :aBaitId. (If the number of times the parameter value is used changes, I don't want to have to find all parameters matching a name, and replace them).
I suppose a helper-function like this would be fine too: SetAllParamsNamed(aQuery:TAdoQuery; aName:String;aValue:Variant)
FindParam only finds one parameter, while you have two with the same name. Delphi dataset adds each parameter as a separate one to its collection of parameters.
It should work if you loop through all parameters, check if the name matches, and set the value of each one that matches, although I normally choose to give each same parameter a follow-up number to distingish between them.

Resources