I'm looking for plugin where I could have aggregation of settings and view for many cases, the same way it is in multi-branch pipeline. But instead of basing on various branches I want to base on one branch but varying on parameters. Below picture is from mentioned multi-branch pipeline, instead of "Branches" I'm looking for "Cases" and instead of "Name" column I need to have configurable Parameter.
Additionally to it, I need to have various Periodic build triggers in way
H 22 * * 5 %param1=value1 %param2=value3
H 22 * * 5 %param1=value2 %param2=value3
The second case could be done in standard job, but since there will be many such cases launched periodically every week or two weeks or every month, and difference in param1 is crucial and is important to have it readable and easily visible to quickly distinguish which case have failed.
I was looking for such plugin but couldn't find something like this. Maybe someone knows such plugin or way to solve it.
I have alternative of creating "super"job which in build steps would launch my current job with specific parameters. Then my readability would change from many rows to many columns since the number is over 20 it will IMHO significantly decrease readability(in column solution) and additionally not all cases would be launched with same periodicity. So there would be necessity to have some ready sets assigned by parameter, and most often the super build cases would have mostly skips in it. What would result that one might not see last result for one of the cases.
Note, that param2, has always same value for periodic launches. Other values are used only with manual trigger. Param2 can but doesn't have to be visible on "multi-branch pipeline" like solution.
I hope my explanation of issue is clear. Looking forward for answers\suggestions etc. :)
Related
I have SPSSmodeler stream which is now used and updated every week constantly to generate a certain dataset. A raw data for this stream is also renewed on a weekly basis.
In part of this stream, there is a chunk of nodes that were necessary to modify and update manually every week, and the sequence of this part is below: Type Node => Restructure Node => Aggregate Node
To simplify the explanation of those nodes' role, I drew an image of them as bellow.
Because the original raw data is changed weekly basis, the range of Unit value above is always varied, sometimes more than 6 (maybe 100) others less than 6 (maybe 3). That is why somebody has to modify there and update those chunk of nodes on a weekly basis until now. *Unit value has a certain limitation (300 for now)
However, now we are aiming to run this stream automatically without touching any human operations on it that we need to customize there to work perfectly, automatically. Please help and will appreciate your efforts, thanks!
In order to automatize, I suggest to try to use global nodes combined with clem scripts inside the execution (default script). I have a stream that calculates the first date and the last date and those variables are used to rename files at the end of execution. I think you could use something similar as explained here:
1) Create derive nodes to bring the unit values used in the weekly stream
2) Save this information in a table named 'count_variable'
3) Use a Global node named Global with a query similar to this:
#GLOBAL_MAX(variable created in (2)) (only to record the number of variables. The step 2 created a table with only 1 values, so the GLOBAL_MAX will only bring the number of variables).
4) The query inside the execution tab will be similar to this:
execute count_variable
var tabledata
var fn
set tabledata = count_variable.output
set count_variable = value tabledata at 1 1
execute Global
5) You now can use the information of variables just using the already creatde "count_variable"
It's not easy to explain just by typing, but I hope to have been helpful.
Please mark as +1 in this answer if it was relevant one.
I think there is a better, simpler and more effective (yet risky, due to node's requirements to input data) solution to your problem. It is called Transpose node and does exactly that - pivot your table. But just from version 18.1 on. Here's an example:
https://developer.ibm.com/answers/questions/389161/how-does-new-feature-partial-transpose-work-in-sps/
I have a pipeline that looks like
pipeline.apply(PubsubIO.read.subscription("some subscription"))
.apply(Window.into(SlidingWindow.of(10 mins).every(20 seconds)
.triggering(AfterProcessingTime.pastFirstElementInPane()
.plusDelayOf(20 seconds))
.withAllowedLateness(Duration.ZERO)
.accumulatingFiredPanes()))
.apply(RemoveDuplicates.create())
.apply(Window.discardingFiredPanes()) // this is suggested in the warnings under https://cloud.google.com/dataflow/model/triggers#window-accumulation-modes
.apply(Count.<String>globally().withoutDefaults())
This pipeline overcounts distinct values significantly (20x normal value). Initially, I was suspecting that the default trigger may have caused this issue. I have tweaked to use triggers that allow no lateness/discard fired panes/use processing time, all of which have similar overcount issues.
I've also tried ApproximateUnique.globally: it failed during pipeline construction because of an exception that looks like
Default values are not supported in Combine.globally() if the output PCollection is not windowed by GlobalWindows. There seems no way to add withoutDefaults to it (like we did with Count.globally).
Is there a recommended way to do COUNT(DISTINCT) in dataflow/beam streaming pipeline with reasonable precision?
P.S. I'm using Java Dataflow SDK 1.9.0.
Your code looks OK; it shouldn't overcount. Note that you are placing each element into 30 windows, so if you have a window-unaware sink (equivalent to collapsing all the sliding windows) you would expect precisely 30 times as many elements. If you could show a bit more of the pipeline or how you are observing the counts, that might help.
Beyond that, I have a few suggestions for the pipeline:
I suggest changing your trigger for RemoveDuplicates to AfterPane.elementCountAtLeast(1); this will get you the same result at lower latency, since later elements arriving will have no impact. This trigger, and your current trigger, will never fire repeatedly. So it does not actually matter whether you set accumulatingFiredPanes() or discardingFiredPanes(). This is good, because neither one would work with the rest of your pipeline.
I'd install a new trigger prior to the Count. The reason is a bit technical, but I'll try to describe it:
In your current pipeline, the trigger installed there (the "continuation trigger" of the trigger for RemoveDuplicates) notes the arrival time of the first element and waits until it has received all elements that were produced at or before that processing time, as measured by the upstream worker. There is some nondeterminism because it puns the local processing time and the processing time of other workers.
If you take my advice and switch the trigger for RemoveDuplicates, then the continuation trigger will be AfterPane.elementCountAtLeast(1) so it will always emit a count as soon as possible and then discard further data, which is very wrong.
I want to de-dupe a stream of data based on an ID in a windowed fashion. The stream we receive has and we want to remove data with matching within N-hour time windows. A straight-forward approach is to use an external key-store (BigTable or something similar) where we look-up for keys and write if required but our qps is extremely large making maintaining such a service pretty hard. The alternative approach I came up with was to groupBy within a timewindow so that all data for a user within a time-window falls within the same group and then, in each group, we use a separate key-store service where we look up for duplicates by the key. So, I have a few questions about this approach
[1] If I run a groupBy transform, is there any guarantee that each group will be processed in the same slave? If guaranteed, we can group by the userid and then within each group compare the sessionid for each user
[2] If it is feasible, my next question is to whether we can run such other services in each of the slave machines that run the job - in the example above, I would like to have a local Redis running which can then be used by each group to look up or write an ID too.
The idea seems off what Dataflow is supposed to do but I believe such use cases should be common - so if there is a better model to approach this problem, I am looking forward to that too. We essentially want to avoid external lookups as much as possible given the amount of data we have.
1) In the Dataflow model, there is no guarantee that the same machine will see all the groups across windows for the key. Imagine that a VM dies or new VMs are added and work is split across them for scaling.
2) Your welcome to run other services on the Dataflow VMs since they are general purpose but note that you will have to contend with resource requirements of the other applications on the host potentially causing out of memory issues.
Note that you may want to take a look at RemoveDuplicates and use that if it fits your usecase.
It also seems like you might want to be using session windows to dedupe elements. You would call:
PCollection<T> pc = ...;
PCollection<T> windowed_pc = pc.apply(
Window<T>into(Sessions.withGapDuration(Duration.standardMinutes(N hours))));
Each new element will keep extending the length of the window so it won't close until the gap closes. If you also apply an AfterCount speculative trigger of 1 with an AfterWatermark trigger on a downstream GroupByKey. The trigger would fire as soon as it could which would be once it has seen at least one element and then once more when the session closes. After the GroupByKey you would have a DoFn that filters out an element which isn't an early firing based upon the pane information ([3], [4]).
DoFn(T -> KV<session key, T>)
|
\|/
Window.into(Session window)
|
\|/
Group by key
|
\|/
DoFn(Filter based upon pane information)
It is sort of unclear from your description, can you provide more details?
Sorry for not being clear. I gave the setup you mentioned a try, except for the early and late firings part, and it is working on smaller samples. I have a couple of follow up questions, related to scaling this up. Also, I was hoping I could give you more information on what the exact scenario is.
So, we have incoming data stream, each item of which can be uniquely identified by their fields. We also know that duplicates occur pretty far apart and for now, we care about those within a 6 hour window. And regarding the volume of data, we have atleast 100K events every second, which span across a million different users - so within this 6 hour window, we could get a few billion events into the pipeline.
Given this background, my questions are
[1] For the sessioning to happen by key, I should run it on something like
PCollection<KV<key, T>> windowed_pc = pc.apply(
Window<KV<key,T>>into(Sessions.withGapDuration(Duration.standardMinutes(6 hours))));
where key is a combination of the 3 ids I had mentioned earlier. Based on the definition of Sessions, only if I run it on this KV would I be able to manage sessions per-key. This would mean that Dataflow would have too many open sessions at any given time waiting for them to close and I was worried if it would scale or I would run into any bottle-necks.
[2] Once I perform Sessioning as above, I have already removed the duplicates based on the firings since I will only care about the first firing in each session which already destroys duplicates. I no longer need the RemoveDuplicates transform which I found was a combination of (WithKeys, Combine.PerKey, Values) transforms in order, essentially performing the same operation. Is this the right assumption to make?
[3] If the solution in [1] going to be a problem, the alternative is to reduce the key for sessioning to be just user-id, session-id ignoring the sequence-id and then, running a RemoveDuplicates on top of each resulting window by sequence-id. This might reduce the number of open sessions but still would leave a lot of open sessions (#users * #sessions per user) which can easily run into millions. FWIW, I dont think we can session only by user-id since then the session might never close as different sessions for same user could keep coming in and also determining the session gap in this scenario becomes infeasible.
Hope my problem is a little more clear this time. Please let me know any of my approaches make the best use of Dataflow or if I am missing something.
Thanks
I tried out this solution at a larger scale and as long as I provide sufficient workers and disks, the pipeline scales well although I am seeing a different problem now.
After this sessionization, I run a Combine.perKey on the key and then perform a ParDo which looks into c.pane().getTiming() and only rejects anything other than an EARLY firing. I tried counting both EARLY and ONTIME firings in this ParDo and it looks like the ontime-panes are actually deduped more precisely than the early ones. I mean, the #early-firings still has some duplicates whereas the #ontime-firings is less than that and has more duplicates removed. Is there any reason this could happen? Also, is my approach towards deduping using a Combine+ParDo the right one or could I do something better?
events.apply(
WithKeys.<String, EventInfo>of(new SerializableFunction<EventInfo, String>() {
#Override
public java.lang.String apply(EventInfo input) {
return input.getUniqueKey();
}
})
)
.apply(
Window.named("sessioner").<KV<String, EventInfo>>into(
Sessions.withGapDuration(mSessionGap)
)
.triggering(
AfterWatermark.pastEndOfWindow()
.withEarlyFirings(AfterPane.elementCountAtLeast(1))
)
.withAllowedLateness(Duration.ZERO)
.accumulatingFiredPanes()
);
I have 2 files stored on a HDFS filesystem:
tbl_userlog: <website url (non canonical)> <tab> <username> <tab> <timestamp>
example: www.website.com, foobar87, 201101251456
tbl_websites: <website url (canonical)> <tab> <total hits>
example: website.com, 25889
I have written an Hadoop sequence of jobs which joins the 2 files on the website, performs a filter on the amount of total hits > n per website and then counts for each user the amount of websites he has visited which has > n total hits. The details of the sequence are as following:
A Map-only job which canonicizes the url in tbl_userlog (i.e. removes www, http:// and https:// from the url field)
A Map-only job which sorts tbl_websites on the url
An identity Map-Reduce job which takes the output of the 2 previous jobs as KeyValueTextInput and feeds them to a CompositeInput in order to make use of Hadoop native joining feature defined with jobConf.set("mapred.join.expr", CompositeInputFormat.compose("inner" (...))
A Map and Reduce job which filters the result of the previous job on total hits > n in its Map phase, groups the results on the in the shuffling phase, and performs the count on the number of websites for each user in the Reduce phase.
In order to chain these steps, I just call the jobs sequentially in the described order. Each individual job outputs its results into HDFS which the following job in the chain then retrieves and processes in turn.
As I am new to Hadoop, I would like to ask for your counseling:
Is there a better way to chain these jobs? In this configuration all intermediate results are written to HDFS and then read back.
Do you see any design flaw in this job, or could it be written more elegantly by making use of some Hadoop feature that I have missed?
I am using Apache Hadoop 0.20.2 and using higher-level frameworks such as Pig or Hive is not possible in the scope of the project.
Thanks in advance for your replies!
I think what you have will work with a couple of caveats. Before I start listing them, I want to make two definitions clear. A map-only job is a job that has a defined Mapper and run's with 0 reducers. If the job is running with > 0 IdentityReducers, then the job is not a map-only job. A reduce-only job is a job that has a define Reducer and run's with an IdentityMapper.
Your first job, can be a map-only job, since all you're doing is canonicalizing URLs. But if you want to use CompositeInputFormat, you should run with an IdentityReducer with more than 0 reducer's.
For your second job, I don't know what you mean by a map-only job that sorts. Sorting by it's very nature is a reduce side task. You probably mean that it has a define Mapper but no Reducer. But in order for the URLs to be sorted, you should run with an IdentityReducer with more than 0 reducer's.
Your third job is an interesting idea, but you have to be careful with CompositeInputFormat. There are two conditions that must be met for you to be able to use this input format. The first is that there has to be the same number of files in both input directories. This can be achieved by setting the same number of reducer's for Job1 and Job2. The second condition is that the input files CANNOT be splittable. This can be achieved by using a non splittable compression such as bzip.
This job sounds good. Although you can filter website that have < n hits in the reducer of the previous job and save yourself some I/O.
There's obviously more than one solution to a problem in software, so while you're solution would work, I wouldn't recommend it. Having 4 MapReduce jobs for this task is a bit expensive IMHO. The implementation I have in mind is a M-R-R workflow that uses Secondary Sort.
As far as chaining jobs is concerned, you should have a look at Oozie, which is a workflow manager. I have yet to use it, but that's where I'd start.
We have a Current branch where the main development happens. For a while I have been working on something kind of experimental in a separate branch. In other words I branched what I needed from the Current branch into an Experimental branch. While working I have regularly merged Current into Experimental so that I have the changes others have made, so that I am sure what I make work with their changes.
I now want to merge back into Current. First I merged Current into Experimental, compiled and made sure everything was working. So in my head, Experimental and Current should be "in sync". But when I try to merge Experimental back into Current, I get a whole bunch of conflicts. But I thought I had already kind of solved those when I merged Current into Experimental.
What is going on? Have I totally misunderstood something? How can I do this smoothly? Really don't want to go through all of those conflicts...
When you click Resolve on an individual conflict, what does the summary message say? If your merges from Current -> Experimental were completed without major manual work, it should be something like "X source, 0 target, Y both, 0 conflicting." In other words, there are no content blocks in the target (Current) file that aren't already in the source branch's copy (Experimental). You can safely use the AutoMerge All button.
Note: AutoMerge should be safe regardless. It's optimized to be conservative about early warnings, not for the ability to solve every case. But I recognize that many of us -- myself included -- like to fire up the merge tool when there's any question. In the scenario described, IMO, even the most skittish can rest easy.
Why is there a conflict at all? And what if the summary message isn't so cut & dry? Glad you asked :) Short answer - because the calculation that determines the common ancestor ("base") of related files depends heavily on how prior merge conflicts between them were resolved. Simple example:
set up two branches, A and B.
make edits to A\foo.cs and B\foo.cs in separate parts of the file
merge A -> B
AutoMerge the conflict
merge B -> A
TFS must flag this sequence of events as conflicting. The closest common ancestor between B\foo.cs;4 and A\foo.cs;2 lies all the way back at step 1, and both sides have obviously changed since then.
It's tempting to say that A & B are in sync after step 4. (More precisely: that the common ancestor for step 5's merge is version #2). Surely a successful content merge implies that B\foo.cs contains all the changes made to date? Unfortunately there are a number of reasons you cannot assume this:
Generality: not all conflicts can be AutoMerged. You need criteria that apply to both scenarios.
Correctness: even when AutoMerge succeeds, it doesn't always generate valid code. A classic example arises when two people add the same field to different parts of a class definition.
Flexibility: every source control user has their own favorite merge tools. And they need the ability to continue development/testing between the initial Resolve decision ["need to merge the contents somehow, someday"] and the final Checkin ["here, this works"].
Architecture: in a centralized system like TFS, the server simply can't trust anything but its own database + the API's validation requirements. So long as the input meets spec, the server shouldn't try to distinguish how various types of content merges were performed. (If you think the scenarios so far are easily distinguished, consider: what if the AutoMerge engine has a bug? What if a rogue client calls the webservice directly with arbitrary file contents? Only scratching the surface here...servers have to be skeptical for a reason!) All it can safely calculate is you sent me a resulting file that doesn't match the source or target.
Putting these requirements together, you end up with a design that lumps our actions in step 4 into a fairly broad category that also includes manual merges resulting from overlapping edits, content merges [auto or not] provided by 3rd party tools, and files hand-edited after the fact. In TFS terminology this is an AcceptMerge resolution. Once recorded as such, the Rules of Merge(tm) have to assume the worst in pursuit of historical integrity and the safety of future operations. In the process your semantic intentions for Step 4 ("fully incorporate into B every change that was made to A in #2") were dumbed down to a few bytes of pure logic ("give B the following new contents + credit for handling #2"). While unfortunate, it's "just" a UX / education problem. People get far angrier when the Rules of Merge make bad assumptions that lead to broken code and data loss. By contrast, all you have to do is click a button.
FWIW, there are many other endings to this story. If you chose Copy From Source Branch [aka AcceptTheirs] in step 4, there would be no conflict in step 5. Ditto if you chose an AcceptMerge resolution but happened to commit a file with the same MD5 hash as A\foo.cs;2. If you chose Keep Target [aka AcceptYours] instead, the downstream consequences change yet again, though I can't remember the details right now. All of the above get quite complex when you add other changetypes (especially Rename), merge branches that are far more out of sync than in my example, cherry pick certain version ranges and deal with the orphans later, etc....
EDIT: as fate would have it, someone else just asked the exact same question on the MSDN forum. As tends to be my nature, I wrote them another long answer that came out completely different! (though obviously touching on the same key points) Hope this helps: http://social.msdn.microsoft.com/Forums/en-US/tfsversioncontrol/thread/e567b8ed-fc66-4b2b-a330-7c7d3a93cf1a
This has happened to me before. When TFS merges Experimental into Current, it does so using the workspaces on your hard drive. If your Current workspace is out of date on your local computer, TFS will get merge conflicts.
(Experimental on HD) != (Current in TFS) != (Old Current on HD)
Try doing a forced get of Current to refresh your local coppy of Current and try the merge again.
You probably have lines like this before you start the merge...
Main branch - Contains code A, B, C
Current branch - Contains code A, B, C, D, E
Experimental branch - Contains code A, B, C, D, F, G, H
When you push from Current to Exp, you are merging feature E into the experimental branch.
When you then push from Exp to Current, you still have to merge F, G, and H. This is where your conflicts are likely rooted.
----Response to 1st comment----
Do you auto merge, or use the merge tool?
What is an example of something that is "in conflict"?