i am new in using dataflow. I have following logic :
Event is added to pubsub
Dataflow reads pubsub and gets the event
From event i am looking into MySQL to find relations in which segments this event have relation and list of relations is returned with this step. This segments are independent from one another.
Each segment can be divided to two tables in MySQL results for email and mobile and they are independent as well.
Each segment have rules that can be 1 to n . I would like to process this step in parallel and collect all results. I have tried to use Windows but i am not sure how to write the logic so when i get the combined results from all rules inside one segment all of them will be collected at end function and write the final logic inside MySQL depending from rule results ( boolean ).
Here is so far what i have :
testP = beam.Pipeline(options=options)
ReadData = (
testP | 'ReadData' >> beam.io.ReadFromPubSub(subscription=str(options.pubsubsubscriber.get())).with_output_types(bytes)
| 'Decode' >> beam.Map(lambda x: x.decode('utf-8'))
| 'GetSegments' >> beam.ParDo(getsegments(options))
)
processEmails = (ReadData
| 'GetSubscribersWithRulesForEmails' >> beam.ParDo(GetSubscribersWithRules(options, 'email'))
| 'ProcessSubscribersSegmentsForEmails' >> beam.ParDo(ProcessSubscribersSegments(options, 'email'))
)
processMobiles = (ReadData
| 'GetSubscribersWithRulesForMobiles' >> beam.ParDo(GetSubscribersWithRules(options, 'mobile'))
| 'ProcessSubscribersSegmentsForMobiles' >> beam.ParDo(ProcessSubscribersSegments(options, 'mobile'))
)
#for sake of testing only window for email is written
windowThis = (processEmails
| beam.WindowInto(
beam.window.FixedWindows(1),
trigger=beam.transforms.trigger.Repeatedly(
beam.transforms.trigger.AfterProcessingTime(1 * 10)),
accumulation_mode=beam.transforms.trigger.AccumulationMode.DISCARDING)
| beam.CombinePerKey(beam.combiners.ToListCombineFn())
| beam.ParDo(print_windows)
)
In this case, because all of your elements have the exact same timestamp, I would use their message ID, and their timestamp to group them with Session windows. It would be something like this:
testP = beam.Pipeline(options=options)
ReadData = (
testP | 'ReadData' >> beam.io.ReadFromPubSub(subscription=str(options.pubsubsubscriber.get())).with_output_types(bytes)
| 'Decode' >> beam.Map(lambda x: x.decode('utf-8'))
| 'GetSegments' >> beam.ParDo(getsegments(options))
)
# At this point, ReadData contains (key, value) pairs with a timestamp.
# (Now we perform all of the processing
processEmails = (ReadData | ....)
processMobiles = (ReadData | .....)
# Now we window by sessions with a 1-second gap. This is okay because all of
# the elements for any given key have the exact same timestamp.
windowThis = (processEmails
| beam.WindowInto(beam.window.Sessions(1)) # Default trigger is fine
| beam.CombinePerKey(beam.combiners.ToListCombineFn())
| beam.ParDo(print_windows)
)
I have a query that reads a set of ID's from a csv file, searches for those nodes in the database and writes the results to a csv. I'm trying to get this query to run as quickly as possible and was wondering if I could parallelise the read operation using apoc.periodic.iterate:
http://neo4j-contrib.github.io/neo4j-apoc-procedures/3.5/cypher-execution/commit-batching/
I've written a query that does what I need but really I just want to find out how to run this query as quickly as possible.
Here's the current, version of the query:
CALL apoc.export.csv.query('CALL apoc.load.csv(\'file:///edge.csv\') YIELD map as edge
MATCH (n:paper)
WHERE n.paper_id = edge.`From` OR n.paper_id = edge.`To`
RETURN n.paper_title',
'node.csv', {});
This query creates the resulting node.csv file that I want but as edge.csv grows in size the operation can slow down considerably.
What I was hoping to do was something like this:
CALL apoc.periodic.iterate(
'LOAD CSV WITH HEADERS FROM \'file:///edge.csv\' as row RETURN row',
'CALL apoc.export.csv.query(\'MATCH (n:paper) WHERE n.paper_id = row.`From` OR n.paper_id = row.`To` RETURN DISTINCT(n.paper_id) AS paper_id\', \'nodePar.csv\', {})'
, {batchSize:10, iterateList:true, parallel:true, failedParams:0})
;
This query will run but produce no output except for the following message:
+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| batches | total | timeTaken | committedOperations | failedOperations | failedBatches | retries | errorMessages | batch | operations | wasTerminated | failedParams |
+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| 14463 | 144629 | 0 | 144629 | 0 | 0 | 0 | {} | {total: 14463, committed: 14463, failed: 0, errors: {}} | {total: 144629, committed: 144629, failed: 0, errors: {}} | FALSE | {} |
+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
My main question is: can apoc.periodic.iterate be used in this way to accelerate this query and if so, how?
And then alternatively, is there any other way to speedup this query as the edge.csv file grows in size?
Using Google Sheets, I want to automatically number rows like so:
The key is that I want this to use built-in functions only.
I have an implementation working where child items are in separate columns (e.g. "Foo" is in column B, "Bar" is in column C, and "Baz" is in column D). However, it uses a custom JavaScript function, and the slow way that custom JavaScript functions are evaluated, combined with the dependencies, possibly combined with a slow Internet connection, means that my solution can take over one second per row (!) to calculate.
For reference, here's my custom function (that I want to abandon in favor of native code):
/**
* Calculate the Work Breakdown Structure id for this row.
*
* #param {range} priorIds IDs that precede this one.
* #param {range} names The names for this row.
* #return A WBS string id (e.g. "2.1.5") or an empty string if there are no names.
* #customfunction
*/
function WBS_ID(priorIds,names){
if (Array.isArray(names[0])) names = names[0];
if (!names.join("")) return "";
var lastId,pieces=[];
for (var i=priorIds.length;i-- && !lastId;) lastId=priorIds[i][0];
if (lastId) pieces = (lastId+"").split('.').map(function(s){ return s*1 });
for (var i=0;i<names.length;i++){
if (names[i]){
var s = pieces.concat();
pieces.length=i+1;
pieces[i] = (pieces[i]||0) + 1;
return pieces.join(".");
}
}
}
For example, cell A7 would use the formula:
=WBS_ID(A$2:A6,B7:D7)
...to produce the result "1.3.2"
Note that in the above example blank rows are skipped during numbering. An answer that does not honor this—where the ID is calculated determinstically from the ROW())—is acceptable (and possibly even desirable).
Edit: Yes, I've tried to do this myself. I have a solution that uses three extra columns which I chose not to include in the question. I have been writing equations in Excel for at least 25 years (and Google Spreadsheets for 1 year). I have looked through the list of functions for Google Spreadsheets and none of them jumps out to me as making possible something that I didn't think of before.
When the question is a programming problem and the problem is an inability to see how to get from point A to point B, I don't know that it's useful to "show what I've done". I've considered splitting by periods. I've looked for a map equivalent function. I know how to use isblank() and counta().
Lol this is hilariously the longest (and very likely the most unnecessarily complicated way to combine formulas) but because I thought it was interesting that it does in fact work, so long as you just add a 1 in the first row then in the second row you add:
=if(row()=1,1,if(and(istext(D2),counta(split(A1,"."))=3),left(A1,4)&n(right(A1,1)+1),if(and(isblank(B2),isblank(C2),isblank(D2)),"",if(and(isblank(B2),isblank(C2),isnumber(indirect(address(row()-1,column())))),indirect(address(row()-1,column()))&"."&if(istext(D2),round(max(indirect(address(1,column())&":"&address(row()-1,column())))+0.1,)),if(and(isblank(B2),istext(C2)),round(max(indirect(address(1,column())&":"&address(row()-1,column())))+0.1,2),if(istext(B2),round(max(indirect(address(1,column())&":"&address(row()-1,column())))+1,),))))))
in my defense ive had a very long day at work - complicating what should be a simple thing seems to be my thing today :)
Foreword
Spreadsheet built-in functions doesn't include an equivalent to JavaScript .map. The alternative is to use the spreadsheets array handling features and iteration patterns.
A "complete solution" could include the use of built-in functions to automatically transform the user input into a simple table and returning the Work Breakdown Structure number (WBS) . Some people refer to transforming the user input into a simple table as "normalization" but including this will make this post to be too long for the Stack Overflow format, so it will be focused in presenting a short formula to obtain the WBS.
It's worth to say that using formulas for doing the transformation of large data sets into a simple table as part of the continuous spreadsheet calculations, in this case, of WBS, will make the spreadsheet to slow to refresh.
Short answer
To keep the WBS formula short and simple, first transform the user input into a simple table including task name, id and parent id columns, then use a formula like the following:
=ArrayFormula(
IFERROR(
INDEX($D$2:$D,MATCH($C2,$B$2:$B,0))
&"."
&COUNTIF($C$2:$C2,C2),
RANK($B2,FILTER($B$2:B,LEN($C$2:$C)=0),TRUE)&"")
)
Explanation
First, prepare your data
Put each task in one row. Include a General task / project to be used as the parent of all the root level tasks.
Add an ID to each task.
Add a reference to the ID of the parent task for each task. Left blank for the General task / project.
After the above steps the data should look like the following:
+---+--------------+----+-----------+
| | A | B | C |
+---+--------------+----+-----------+
| 1 | Task | ID | Parent ID |
| 2 | General task | 1 | |
| 3 | Substast 1 | 2 | 1 |
| 4 | Substast 2 | 3 | 1 |
| 5 | Subsubtask 1 | 4 | 2 |
| 6 | Subsubtask 2 | 5 | 2 |
+---+--------------+----+-----------+
Remark: This also could help to reduce of required processing time of a custom funcion.
Second, add the below formula to D2, then fill down as needed,
=ArrayFormula(
IFERROR(
INDEX($D$2:$D,MATCH($C2,$B$2:$B,0))
&"."
&COUNTIF($C$2:$C2,C2),
RANK($B2,FILTER($B$2:B,LEN($C$2:$C)=0),TRUE)&"")
)
The result should look like the following:
+---+--------------+----+-----------+----------+
| | A | B | C | D |
+---+--------------+----+-----------+----------+
| 1 | Task | ID | Parent ID | WBS |
| 2 | General task | 1 | | 1 |
| 3 | Substast 1 | 2 | 1 | 1.1 |
| 4 | Substast 2 | 3 | 1 | 1.2 |
| 5 | Subsubtask 1 | 4 | 2 | 1.1.1 |
| 6 | Subsubtask 2 | 5 | 2 | 1.1.2 |
+---+--------------+----+-----------+----------+
Here's an answer that does not allow a blank line between items, and requires that you manually type "1" into the first cell (A2). This formula is applied to cell A3, with the assumption that there are at most three levels of hierarchy in columns B, C, and D.
=IF(
COUNTA(B3), // If there is a value in the 1st column
INDEX(SPLIT(A2,"."),1)+1, // find the 1st part of the prior ID, plus 1
IF( // ...otherwise
COUNTA(C3), // If there's a value in the 2nd column
INDEX(SPLIT(A2,"."),1) // find the 1st part of the prior ID
& "." // add a period and
& IFERROR(INDEX(SPLIT(A2,"."),2),0)+1, // add the 2nd part of the prior ID (or 0), plus 1
INDEX(SPLIT(A2,"."),1) // ...otherwise find the 1st part of the prior ID
& "." // add a period and
& IFERROR(INDEX(SPLIT(A2,"."),2),1) // add the 2nd part of the prior ID or 1 and
& "." // add a period and
& IFERROR(INDEX(SPLIT(A2,"."),3)+1,1) // add the 3rd part of the prior ID (or 0), plus 1
)
) & "" // Ensure the result is a string ("1.2", not 1.2)
Without comments:
=IF(COUNTA(B3),INDEX(SPLIT(A2,"."),1)+1,IF(COUNTA(C3),INDEX(SPLIT(A2,"."),1)& "."& IFERROR(INDEX(SPLIT(A2,"."),2),0)+1,INDEX(SPLIT(A2,"."),1)& "."& IFERROR(INDEX(SPLIT(A2,"."),2),1)& "."& IFERROR(INDEX(SPLIT(A2,"."),3)+1,1))) & ""
Desired output
Each User has child Plan which has child PlanDate objects. PlanDate has an attribute ddate which is just a date. Plan has an attribute mtype that can either be M, V, or C (haha MVC, subconscious techy much?). For a given week (let's just say the current week), I'd like to print out a table that looks like this:
----------------------------------------------------------------------------
| User | Mon | Tue | Wed | Thu | Fri | Other attributes of User
----------------------------------------------------------------------------
| Eric | M | | M | | M | ...
----------------------------------------------------------------------------
| Erin | V | V | V | V | V | ...
----------------------------------------------------------------------------
| Jace | | C | C | | | ...
----------------------------------------------------------------------------
| Kris | C | | | | | ...
----------------------------------------------------------------------------
| Tina | V | | V | | V | ...
----------------------------------------------------------------------------
| Lily | M | M | M | M | M | ...
----------------------------------------------------------------------------
The order of the Users on the rows doesn't really matter to me; I may add Ransack gem to make it ordered, but for now ignore. A given User may not have PlanDates with a ddate for every day in a given week, and certainly there's no relationship between the PlanDates across Users.
Proposed options
I feel like there are two options:
In the view, print the column headers with a data-attribute of the day in question, and print the row headers with a data-attribute of the user id in question (will have to first select all the users who do have a grandchild PlanDate with a ddate somewhere in the current week). Then in the intersection, use the two data-attributes to query ActiveRecord.
In the model, generate a data hash that can create the table, and then pass that hash to the view via the controller.
Option 1 makes more intuitive sense to me having been a lifelong Excel user, but it breaks MVC entirely, so I'm trying to go with Option 2, and the challenges below are related to Option 2. That said if you can make a case for Option 1 go for it! I feel like if you can convince me to do Option 1, I can implement it without the same problems...
Challenges with Option 2
I can build a hash with one dimension as a key, and a hash of the other dimension as an array. For example, if the days of the current week were used as the key:
{
Mon => [Eric, Erin, Kris, Tina, Lily],
Tue => [Erin, Jace, Lily]
Wed => [Eric, Erin, Jace, Kris, Lily],
Thu => [Erin, Lily],
Fri => [Eric, Erin, Tina, Lily]
}
But the problem is once I get there, I'm not sure how to deal with the fact that there are blanks in the data... If I were to convert the hash above into a table, I would only know how to make the Users appear as a list under each date; but then that wouldn't look like my desired output at all, because there wouldn't be gaps in the data. For example on Monday, there's no Jace, but there needs to be a blank space for Jace so that it's easy for the Viewer to look across and see, ah there's no Jace on Monday, but there is a Jace on Tuesday and Wednesday.
Oh actually I just needed a minute to think logically about this... I just need a nested hash, one with the first dimension, one with the second. So a method like this:
def plan_table
# 1 : get an array of the current week's dates
week = (Date.today.at_beginning_of_week..(Date.today.at_end_of_week-2.days)).map { |d| d }
# 2 : find all the users (plucking the id) that have a plan date in the current week
current_week_users = User.select { |u| u.plans.select { |p| p.plan_dates.select { |pd| week.include? pd.ddate.to_date }.count > 0 }.count > 0 }.map(&:id)
# 3: build the hash
#week_table = Hash.new
week.each do |day|
#week_table[day] = {}
current_week_users.each do |user_id|
# for each user in the has we built already, we have to check if that user has a pd that: 1) falls on this date, 2) has a non canceled plan, 3) has a user that matches this current user. The extra checks for user_id and plan_id are my ole' paranoia about objects being created without parents
potential_pd = PlanDate.select { |pd| pd.ddate.to_date == day && pd.plan_id != nil && pd.plan.status != "canceled" && pd.plan.user_id != nil && pd.plan.user.id == user_id }
if potential_pd == []
#week_table[day][user_id] = ""
else
#week_table[day][user_id] = potential_pd.first.plan.mtype
end
end
end
return #week_table
end
I'm writing a hook to be run before the execution of every step. The hook function basically manipulates the arguments given to the step.
Here is the code I'm using (the last two lines are for testing):
/** #BeforeStep */
public function beforeStep($event) {
$step_node = $event->getStep();
$args = $step_node->getArguments();
print_r($args);
die();
}
$step_node is an instance of StepNode
$args is supposed to be an array of arguments relating to that step.
For any given step I test this on, the argument array is always empty. I also tried printing out the arguments using the AfterStep hook and the array is still empty.
Am I missing something as to how behat grabs arguments and deals with steps?
getArguments() returns an array of Behat\Gherkin\Node\TableNode, allowing access to table rows. For example :
Given the following users:
| name | followers |
| everzet | 147 |
| avalanche123 | 142 |
| kriswallsmith | 274 |
| fabpot | 962 |
You can try parsing the arguments from step_node.getText() but it would probably be better to use a transformation. This will allow you to process any arguments before the step is run.
One example from the Behat Mink documentation :
/**
* #Transform /^user (.*)$/
*/
public function castUsernameToUser($username)
{
return new User($username);
}