I've created a widget for TFS/VSTS which allows you to see the number of failing builds. This number is based on the last builds result for each build definition. I've read the REST api documentation, but the only way to get this result is:
Get the list of definitions
Get the list of builds filtered by;definition=[allIds], maxBuildsPerDefinition = 1, resultFilter=failed
This is actually pretty slow (2x callback, lot's of response data) and I thought it should be possible in a single query. One of the problems is that the maxBuildsPerDefinition doesn't work without the definition filter. Does anyone have an idea how to load this data more efficient?
I'm afraid the answer is no. The way you use is the most efficient way for now.
Related
Setup:
read from pubsub -> window of 30s -> group by user -> combine -> write to cloud datastore
Problem:
I'm seeing DataStoreIO writer errors as objects with similar keys are present in the same transaction.
Question:
I want to understand how my pipeline combines results into bundles after a group by/combine operation. I would expect the bundle to be created for every window after the combine. But apparently, a bundle can contain more than 2 occurrences of the same user?
Can re-execution (retries) of bundles cause this behavior?
Is this bundling dependent of the runner?
Is deduplication an option? if so, how would I best approach that?
Note that I'm not looking for a replacement for the datastore writer at the end of the pipeline, I already know that we can use a different strategy. I'm merely trying to understand how the bundling happens.
There are two answers to your question. One is specific to your use case, and the other is in general about bundling / windowing in streaming.
Specific to your pipeline
I am assuming that the 'key' for Datastore is the User ID? In that case, if you have events from the same user in more than one window, your GroupByKey or Combine operations will have one separate element for every pair of user+window.
So the question is: What are you trying to insert into datastore?
An individual user's resulting aggregate over all time? In that case, you'd need to use a Global Window.
A user's resulting aggregate for every 30 seconds in time? Then you need to use the window as part of the key you use to insert to datastore. Does that help / make sense?
Happy to help you design your pipeline to do what you want. Chat with me in the comments or via SO chat.
The larger question about bundling of data
Bundling strategies will vary by runner. In Dataflow, you should consider the following two factors:
Every worker is assigned a key range. Elements for the same key will be processed by the same worker.
Windows belong to single elements; but a bundle may contain elements from multiple windows. As an example, if the data freshness metric makes a big jump*, a number of windows may be triggered - and elements of the same key in different windows would be processed in the same bundle.
*- when can Data freshness jump suddenly? A stream with a single element with a very old timestamp, and that is very slow to process may hold the watermark for a long time. Once this element is processed, the watermark may jump a lot, to the next oldest element (Check out this lecture on watermarks ; )).
To our Streaming pipeline, we want to submit unique GCS files, each file containing multiple event information, each event also containing a key (for example, device_id). As part of the processing, we want to shuffle by this device_id so as to achieve some form of worker to device_id affinity (more background on why we want to do it is in this another SO question. Once all events from the same file are complete, we want to reduce (GroupBy) by their source GCS file (which we will make a property of the event itself, something like file_id) and finally write the output to GCS (could be multiple files).
The reason we want to do the final GroupBy is because we want to notify an external service once a specific input file has completed processing. The only problem with this approach is that since the data is shuffled by the device_id and then grouped at the end by the file_id, there is no way to guarantee that all data from a specific file_id has completed processing.
Is there something we could do about it? I understand that Dataflow provides exactly_once guarantees which means all the events will be eventually processed but is there a way to set a deterministic trigger to say all data for a specific key has been grouped?
EDIT
I wanted to highlight the broader problem we are facing here. The ability to mark
file-level completeness would help us checkpoint different stages of the data as seen by external consumers. For example,
this would allow us to trigger per-hour or per-day completeness which are critical for us to generate reports for that window. Given that these stages/barriers (hour/day) are clearly defined on the input (GCS files are date/hour partitioned), it is only natural to expect the same of the output. But with Dataflow's model, this seems impossible.
Similarly, although Dataflow guarantees exactly-once, there will be cases where the entire pipeline needs to be restarted since something went horribly wrong - in those cases, it is almost impossible to restart from the correct input marker since there is no guarantee that what was already consumed has been completely flushed out. The DRAIN mode tries to achieve this but as mentioned, if the entire pipeline is messed up and draining itself cannot make progress, there is no way to know which part of the source should be the starting point.
We are considering using Spark since its micro-batch based Streaming model seems to fit better. We would still like to explore Dataflow if possible but it seems that we wont be able to achieve it without storing these checkpoints externally from within the application. If there is an alternative way of providing these guarantees from Dataflow, it would be great. The idea behind broadening this question was to see if we are missing an alternate perspective which would solve our problem.
Thanks
This is actually tricky. Neither Beam nor Dataflow have a notion of a per-key watermark, and it would be difficult to implement that level of granularity.
One idea would be to use a stateful DoFn instead of the second shuffle. This DoFn would need to receive the number of elements expected in the file (from either a side-input or some special value on the main input). Then it could count the number of elements it had processed, and only output that everything has been processed once it had seen that number of elements.
This would be assuming that the expected number of elements can be determined ahead of time, etc.
I need to maintain list if all user's incomplete tasks with asana API.
Right now, the best solution I came up with is polling asana for every X minutes and use /tasks with completed_since filter. However this is inefficient, since I have to perform exactly one call for every workspace.
The next thing I tried was looking into /events API, but events are generated only for projects and tasks. I got about 25 projects so it isn't the best solution either.
Is there any way I could check for updates efficiently?
Thanks.
Actually, "exactly one call per workspace" is as good as it's gonna get - we scope each request to a workspace (in fact, it's likely that in the future each API call will need to be explicitly scoped to a workspace). It's a hard IP boundary, so basically we never "mix" data from different workspaces (except for certain exceptions, like "listing the workspaces I'm in").
If you're specifically only looking for updates to tasks, you could also use modified_since.
How do i retrieve the tasks for a project under the priority heading?
For example i have recruitment project, i want to retrieve tasks under "Interviewed" heading (priority heading)
Thanks
There isn't currently a way to only get tasks in a given section, so the only way to do this at the moment is to fetch all tasks for the project and then filter on your side. Fortunately, the API will return the tasks in the appropriate order such that all the tasks in a given section appear after it.
It's clunky, and we do intend to provide better support for sections at some point in the future, but it's not on our immediate roadmap so I'd definitely recommend this workaround for now. If the response is simply too large, one hack could be to get the ID of the "Interviewed:" task, then fetch only the IDs from the project (GET /projects/.../tasks?opt_fields=id), and then iterate over the tasks by ID. I'd only recommend this approach if the project is genuinely too big to fetch at once, though.
I am trying to retrieve a list of opened work items for a given project programmatically. In searching through the web, the only way that I can see to do this is to use the WorkItemStore API and execute a query.
The major issue that I am having is that retrieving the workitemstore takes almost 2 minutes. I subsequently caches it, but the first hit is unacceptable. Beyond that, my application needs to refresh it every x number of minutes in case new work items are added.
Is there any way to get a list of opened work items associated with a project without using the WorkItemStore. I only need the work item number and optionally the title. I don't need any other information.
If not, is there something that I am doing wrong or something wrong with the TFS server (index missing perhaps) that makes the performance so slow. I have tried different ways of getting it by the way. They are all extremely slow.
WorkItemStore store = (WorkItemStore)tfs.GetService(typeof(WorkItemStore));
or
workItemStore = new WorkItemStore(tfsTeamProjectCollection);
or
workItemStore = new WorkItemStore(tfsServerName);
Any help in this matter would be greatly appreciated.
Even with an incredibly large DB you shouldn't experience two minutes delays.
I would load up SQL Profiler and take a look at the query generated to get work items. From there, you can probably identify what part of the query is causing the delay.
You can also consider running the query on the same box that the TFS DBs are on and see if that is the issue. As the comment above points out, remote connections can certainly cause delays.
If none of this resolves the issue then hopefully you can provide some more information like size of the project (shouldn't matter), TFS installation configuration (where are your servers and how are they setup) and what hardware it is on.