Filtering out based on count using Apache Beam - google-cloud-dataflow

I am using Dataflow and Apache Beam to process a dataset and store the result in a headerless csv file with two columns, something like this:
A1,a
A2,a
A3,b
A4,a
A5,c
...
I want to filter out certain entries based on the following two conditions:
1- In the second column, if the number of occurrences of a certain value is less than N, then remove all such rows. For instance if N=10 and c only appears 7 times, then I want all those rows to be filtered out.
2- In the second column, if the number of occurrences of a certain value is more than M, then only keep M many of such rows and filter out the rest. For instance if M=1000 and a appears 1200 times, then I want 200 of such entries to be filtered out, and the other 1000 cases to be stored in the csv file.
In other words, I want to make sure all elements of the second columns appear more than N and less than M many times.
My question is whether this is possible by using some filter in Beam? Or should it be done as a post-process step once the csv file is created and saved?

You can use beam.Filter to filter out all the second column values that matches your range's lower bound condition into a PCollection.
Then correlate that PCollection (as a side input) with your original PCollection to filter out all the lines that need to be excluded.
As for the upperbound, since you want to keep any upperbound amount of elements instead of excluding them completely, you should do some post processing or come up with some combine transforms to do that.
An example with Python SDK using word count.
class ReadWordsFromText(beam.PTransform):
def __init__(self, file_pattern):
self._file_pattern = file_pattern
def expand(self, pcoll):
return (pcoll.pipeline
| beam.io.ReadFromText(self._file_pattern)
| beam.FlatMap(lambda line: re.findall(r'[\w\']+', line.strip(), re.UNICODE)))
p = beam.Pipeline()
words = (p
| 'read' >> ReadWordsFromText('gs://apache-beam-samples/shakespeare/kinglear.txt')
| "lower" >> beam.Map(lambda word: word.lower()))
import random
# Assume this is the data PCollection you want to do filter on.
data = words | beam.Map(lambda word: (word, random.randint(1, 101)))
counts = (words
| 'count' >> beam.combiners.Count.PerElement())
words_with_counts_bigger_than_100 = counts | beam.Filter(lambda count: count[1] > 100) | beam.Map(lambda count: count[0])
Now you get a pcollection like
def cross_join(left, rights):
for x in rights:
if left[0] == x:
yield (left, x)
data_with_word_counts_bigger_than_100 = data | beam.FlatMap(cross_join, rights=beam.pvalue.AsIter(words_with_counts_bigger_than_100))
Now you filtered out elements below lowerbound from the data set and get
Note the 66 from ('king', 66) is the fake random data I put in.
To debug with such visualizations, you can use interactive beam. You can setup your own notebook runtime following instructions; Or you can use hosted solutions provided by Google Dataflow Notebooks.

Related

SPSS - Filter columns based on specific criteria

I have a dataset (See below) where I want to filter out any observations where there is only a 1 in the McDonalds column, such as for ID#3 (I do not want Mcdonalds in my analyses). I want to keep any observations where there is a 1 in other columns (eventhough there is a 1 in the McDonalds column - such as ID #1-2). I have tried using the select cases option, and just putting McDonalds=0, but this filters out any observations where there are 1s in the other columns as well. Below is a sample of my dataset, I actually have many more columns and was trying to avoid having to individually name every other column in the "Select Cases" option in SPSS. Would anyone be able to help me please? Thanks.
Data:
To avoid naming each of the other columns separately you can use to in the syntax. Also, basically, you want to keep lines that have 1 in any of the other columns regardless of the value in the Mcdonald's column, so there is no need to mention it in the syntax.
So say for example that your column names are McDonalds, RedBull, var3, var4, var5, TacoBell, you could use either of these following options:
select if any(1, RedBull to TacoBell).
or this :
select if sum(RedBull to TacoBell)>1.
Note: using the to convention requires that the relevant variables be contiguous in the data.
You just need to add the "OR" operator (which is the vertical bar: |) between all the mentioned conditions.
So basically, you want to keep the cases when McDonalds = 0 | RedBull = 1 | TacoBell = 1.
You can either copy the above line into the Select cases -> If option, or write the following lines into the SPSS syntax file, replacing the DataSet1 for the name of your dataset:
DATASET ACTIVATE DataSet1.
USE ALL.
COMPUTE filter_$=(McDonalds = 0 | RedBull = 1 | TacoBell = 1).
VARIABLE LABELS filter_$ 'McDonalds = 0 | RedBull = 1 | TacoBell = 1 (FILTER)'.
VALUE LABELS filter_$ 0 'Not Selected' 1 'Selected'.
FORMATS filter_$ (f1.0).
FILTER BY filter_$.
EXECUTE.

How to aggregate data using apache beam api with multiple keys

I am new to google cloud data platform as well as to Apache beam api. I would like aggregate data based on multiple keys. In my requirement I will get a transaction feed having fields like customer id,customer name,transaction amount and transaction type. I would like to aggregate the data based on customer id & transaction type. Here is an example.
customer id,customer name,transction amount,transaction type
cust123,ravi,100,D
cust123,ravi,200,D
cust234,Srini,200,C
cust444,shaker,500,D
cust123,ravi,100,C
cust123,ravi,300,C
O/p should be
cust123,ravi,300,D
cust123,ravi,400,C
cust234,Srini,200,C
cust444,shaker,500,D
In google most of the examples are based on single key like group by single key. Can any please help me on how my PTransform look like in my requirement and how to produce aggregated data along with rest of the fields.
Regards,
Ravi.
Here is an easy way. I concatenated all the keys together to form a single key and then did the the sub and after than split the key to organize the output to a way you wanted. Please let me know if any question.
The code does not expect header in the CSV file. I just kept it short to show the main point you are asking.
import apache_beam as beam
import sys
class Split(beam.DoFn):
def process(self, element):
"""
Splits each row on commas and returns a tuple representing the row to process
"""
customer_id, customer_name, transction_amount, transaction_type = element.split(",")
return [
(customer_id +","+customer_name+","+transaction_type, float(transction_amount))
]
if __name__ == '__main__':
p = beam.Pipeline(argv=sys.argv)
input = 'aggregate.csv'
output_prefix = 'C:\\pythonVirtual\\Mycodes\\output'
(p
| 'ReadFile' >> beam.io.ReadFromText(input)
| 'parse' >> beam.ParDo(Split())
| 'sum' >> beam.CombinePerKey(sum)
| 'convertToString' >>beam.Map(lambda (combined_key, total_balance): '%s,%s,%s,%s' % (combined_key.split(",")[0], combined_key.split(",")[1],total_balance,combined_key.split(",")[2]))
| 'write' >> beam.io.WriteToText(output_prefix)
)
p.run().wait_until_finish()
it will produce output as below:
cust234,Srini,200.0,C
cust444,shaker,500.0,D
cust123,ravi,300.0,D
cust123,ravi,400.0,C

How to apply content based filtering in ne04j

I have a data in below format where 1st column represents the products node, all the following columns represent properties of the products. I want to apply content based filtering algo using cosine similarity in Neo4j. For that, I believe, I need to define the fx columns as the properties of each product node and then call these properties as a vector and then apply cosine similarity between the products. I am having trouble doing two things:
1. How to define these columns as properties in one go(as the columns could be more than 100).
2. How to call all the property values as a vector to be able to apply cosine similarity.
Product f1 f2 f3 f4 f5
P1 0 1 0 1 1
P2 1 0 1 1 0
P3 1 1 1 1 1
P4 0 0 0 1 0
You can use LOAD CSS to input your data.
For example, this query will read in your data file and output for each input line (ignoring the header line) a name string and a props collection:
LOAD CSV FROM 'file:///data.csv' AS line FIELDTERMINATOR ' '
WITH line SKIP 1
RETURN HEAD(line) AS name, [p IN TAIL(line) | TOFLOAT(p)] AS props
Even though your data has a header line, the above query skips over it, as it is not needed. In fact, we don't want to use the WITH HEADERS option of LOAD CSV, since that would convert each data line into a map, whereas it is more convenient for our current purposes to get each data line as a collection of values.
The above query assumes that all the columns are space-separated, that the first column will always contain a name string, and that all other columns contain the numeric values that should be put into the same collection (named props).
If you replace RETURN with WITH, you can append additional clauses to the query that make use of the name and props values.

How to get the first elements of COLLECT whithout limiting the global query?

In a twitter like app, I would like to get only the 3 last USERS which has PUBLISH a tweet for particular HASHTAG (A,B,C,D,E)
START me=node(X), hashtag=node(A,B,C,D,E)
MATCH n-[USED_IN]->tweet<-[p:PUBLISH]-user-[FRIEND_OF]->me
WITH p.date? AS date,hashtag,user ORDER BY date DESC
WITH hashtag, COLLECT(user.name) AS users
RETURN hashtag._id, users;
This is the result I get with this query. This is good but if the friend list is big, I could have a very large array in the second column.
+-------------------------------------------+
| hashtag | users |
+-------------------------------------------+
| "paradis" | ["Alexandre","Paul"] |
| "hello" | ["Paul"] |
| "public" | ["Alexandre"] |
+-------------------------------------------+
If I add a LIMIT clause, at the end of the query, the entire result set is limited.
Because a user can have a very large number of friends, I do not want to get back all those USER, but only the last 2 or 3 which has published in those hashtags
Is the any solution with filter/reduce to get what I expect?
Running neo4j 1.8.2
Accessing sub-collection will be worked on,
meanwhile you can use this workaround: http://console.neo4j.org/r/f7lmtk
start n=node(*)
where has(n.name)
with collect(n.name) as names
return reduce(a=[], x in names : a + filter(y in [x] : length(a)<2)) as two_names
Reduce is used to build up the result list in the aggregator
And filter is used instead of the conditional case ... when ... which is only available in 2.0
filter(y in [x] : length(a)<2) returns a list with the element when the condition is true and an empty list when the condition is false
adding that result to the accumulator with reduce builds up the list incrementally
Be careful, the new filter syntax is:
filter(x IN a.array WHERE length(x)= 3)

Is it possible to make a nested FOREACH without COGROUP in PigLatin?

I want to use the FOREACH like:
a:{a_attr:chararray}
b:{b_attr:int}
FOREACH a {
res = CROSS a, b;
-- some processing
GENERATE res;
}
By this I mean to make for each element of a a cross-product with all the elements of b, then perform some custom filtering and return tuples.
==EDIT==
Custom filetering = res_filtered = FILTER res BY ...;
GENERATE res_filtered.
==EDIT-2==
How to do it with a nested CROSS no more no less inside a FOR loop without prior GROUP or COGROUP?
Depending on the specifics of your filtering, you may be able to design a limited set of disjoint classes of elements in a and b, and then JOIN on those. For example:
If your filtering rules are
if a_attr starts with "Foo" and b is 4, accept
if a_attr starts with "Bar" and b is greater than 17, accept
if a_attr begins with a letter in [m-z] and b is less than 0, accept
otherwise, reject
Then you can write a UDF that will return 1 for items satisfying the first rule, 2 for the second, 3 for the third, and NULL otherwise. Your CROSS/FILTER then becomes
res = JOIN a BY myUDF(a), b BY myUDF(b);
Pig drops null values in JOINs, so only pairs satisfying your filtering criteria will be passed.
CROSS generates a cross-product of all the tuples in each relation. So there is no need to have a nested FOREACH. Just do the CROSS and then FILTER:
a: {a_attr: chararray}
b: {b_attr: int}
crossed = CROSS a, b;
crossed: {a::a_attr: chararray,b::b_attr: int}
res = FILTER crossed BY ... -- your custom filtering
If you have the FILTER immediately after the CROSS, you should not have (unnecessary) excessive IO trouble from the CROSS writing the entire cross-product to disk before filtering. Records that get filtered will never be written at all.

Resources