How to track DORA's four key metrics for teams delivering libraries instead of a deployable? - devops

The four key metrics as defined by the DORA research program are often used by software delivery teams to measure their performance and find out whether they are “low performers” to “elite performers”.
The four key metrics indicating the performance of a software development team are:
Deployment Frequency — How often a team successfully releases to production
Lead Time for Changes — The amount of time it takes a commit to get into production
Change Failure Rate — The percentage of deployments causing a failure in production
Time to Restore Service — How long it takes to a team to recover from a failure in production
Tracking those metrics for teams delivering a library (either open source or for internal use) is not a simple 1 to 1 mapping since some of the metrics are not applicable given that those teams don't deliver a deployable.
While I think we can easily map the first 2 metrics (Deployment Frequency and Lead Time for Changes) to:
Release Frequency — How often a team release a library successfully
Lead Time for Changes — The amount of time it takes a commit to get into a library release
I still struggle to think if there is a straightforward way to map the last 2 metrics (Change Failure Rate and Lead Time for Changes) to some metrics which are meaningful for a library.
Any thoughts?
Any other metric you track for measuring software delivery performance in teams delivering libraries?

Related

ordering message pub/sub GCP

I am new to Dataflow and pub-sub tools in GCP.
Need to migrate current on prem process to GCP.
Current Process is as follows:
We have two types of data feeds
Full Feed – its adhoc job – Size of full XML is ~100GB (Single XML – very complex one – Complete data – ETL Job process this xml and load it into ~60 tables)
Separate ETL jobs are there to process full feed. ETL job process
full feed and create load ready files and all tables will be truncate
and re-load.
Delta Feed - Every 30 min need to process delta files(XML files – it will have only changes with in last 30 min)
Source system push XML files in every 30 mins(More than one, file has timestamp), scheduled ETL process will pick all the files which are produced by source system and process all the xml files and create 3 load ready files insert, delete and update for each table
Schedule – ETL Jobs are scheduled to run every 5 min, if current process is running more than 5 min, next run will not trigger until current process completes
Order of the file processing is very important(ETL Job will take care of this). Need to process all the files in sequence.
At the end of ETL process load the load ready files into tables (Mainframe)
I was asked to propose the design to Migrate this to GCP. Need to have two process in GCP as well full and delta. My proposed solution should be handle/suitable for both the feeds.
Initially I thought below design.
Pub/sub -> DataFlow -> mySQL/BigQuery
Then came to know that pub/sub will not give the guarantee to process the files in sequence/order. After doing some research learn that recently google introduced ordering key concept for pub/sub, which will make sure to process the messages in order. In google cloud docs it was mentioned that, this feature is in Beta.
I have two questions:
Whether any one used ordering key concept in pub/sub in production environment. If yes, did you face any challenges while implementing this
Is this design is suitable for the above requirement or is there any better solution in GCP
is there any alternative for DataFlow?
Came to know that pub/sub can handle maximum 10MB size of messages, for us each XML size is more than ~5G.
As was mentioned by #guillaume blaquiere, Beta product launching phase brings some restrictions but they are mostly related to the product support:
At beta, products or features are ready for broader customer testing
and use. Betas are often publicly announced. There are no SLAs or
technical support obligations in a beta release unless otherwise
specified in product terms or the terms of a particular beta program.
The average beta phase lasts about six months.
Commonly, Cloud Pub/Sub message ordering feature works as intended, once you have something for developers attention it is highly appreciated to send a report via Google Issue tracker.

Beam Runner hooks for Throughput-based autoscaling

I'm curious if anyone can point me towards greater visibility into how various Beam Runners manage autoscaling. We seem to be experiencing hiccups during both the 'spin up' and 'spin down' phases, and we're left wondering what to do about it. Here's the background of our particular flow:
1- Binary files arrive on gs://, and object notification duly notifies a PubSub topic.
2- Each file requires about 1Min of parsing on a standard VM to emit about 30K records to downstream areas of the Beam DAG.
3- 'Downstream' components include things like inserts to BigQuery, storage in GS:, and various sundry other tasks.
4- The files in step 1 arrive intermittently, usually in batches of 200-300 every hour, making this - we think - an ideal use case for autoscaling.
What we're seeing, however, has us a little perplexed:
1- It looks like when 'workers=1', Beam bites off a little more than it can chew, eventually causing some out-of-RAM errors, presumably as the first worker tries to process a few of the PubSub messages which, again, take about 60 seconds/message to complete because the 'message' in this case is that a binary file needs to be deserialized in gs.
2- At some point, the runner (in this case, Dataflow with jobId 2017-11-12_20_59_12-8830128066306583836), gets the message additional workers are needed and real work can now get done. During this phase, errors decrease and throughput rises. Not only are there more deserializers for step1, but the step3/downstream tasks are evenly spread out.
3-Alas, the previous step gets cut short when Dataflow senses (I'm guessing) that enough of the PubSub messages are 'in flight' to begin cooling down a little. That seems to come a little too soon, and workers are getting pulled as they chew through the PubSub messages themselves - even before the messages are 'ACK'd'.
We're still thrilled with Beam, but I'm guessing the less-than-optimal spin-up/spin-down phases are resulting in 50% more VM usage than what is needed. What do the runners look for beside PubSub consumption? Do they look at RAM/CPU/etc??? Is there anything a developer can do, beside ACK a PubSub message to provide feedback to the runner that more/less resources are required?
Incidentally, in case anyone doubted Google's commitment to open-source, I spoke about this very topic with an employee there yesterday, and she expressed interest in hearing about my use case, especially if it ran on a non-Dataflow runner! We hadn't yet tried our Beam work on Spark (or elsewhere), but would obviously be interested in hearing if one runner has superior abilities to accept feedback from the workers for THROUGHPUT_BASED work.
Thanks in advance,
Peter
CTO,
ATS, Inc.
Generally streaming autoscaling in Dataflow works like this :
Upscale: If the pipeline's backlog is more than a few seconds based on current throughput, pipeline is upscaled. Here CPU utilization does not directly affect the amount of upsize. Using CPU (say it is at 90%), does not help in answering the question 'how many more workers are required'. CPU does affect indirectly since pipelines fall behind when they they don't enough CPU thus increasing backlog.
Downcale: When backlog is low (i.e. < 10 seconds), pipeline is downcaled based on current CPU consumer. Here, CPU does directly influence down size.
I hope the above basic description helps.
Due to inherent delays involved in starting up new GCE VMs, the pipeline pauses for a minute or two during resizing events. This is expected to improve in near future.
I will ask specific questions about the job you mentioned in description.

travis/coverity: automatically re-schedule a build after given time

I'm using githubs integration of travis-ci with coverity-scan (the free versions of all these services) to test my FLOSS code.
The problem I'm facing is that when continuously working on the code, i'm hitting the coverity quota pretty soon.
Since I'm working on multiple projects simultaneously, it can therefore well be that I switch away from working on a given project before I'm allowed to submit a coverity again, thus potentially having flaws in the code for weeks although they would have been caught easily by coverity.
I would like to avoid this.
The first measure to prevent hitting the quota too frequently, is by using a dedicated branch (usually coverity_scan) which does not receive pushes as often as the master and/or feature branches.
However, this puts cognitive load on the user (me), which I also like to avoid.
Also, sometimes I still hit the quota (some of my projects as in the 100k-500k lines-of-code range, so they have a lower threshold than usual).
What I would like to have is being able to automatically re-trigger a coverity-scan once the quota has expired, if (and only if) the current build did hit the quota.
Is somthing like this possible with plain travis-ci/coverity features?
Or would I have to setup a separate hook, that monitors the coverity quota and travis-ci builds?
You don't need to run Coverity on every check-in. It's just too slow.
You should configure your (coverity build) system to poll your repo for changes, but have them checked infrequently. Something like a few times per day.
This will trigger the build when things change, but not on every change that is detected.

Tool for monitoring QOS

In my project
We crawls x number of server.
Number of user for each server varies from 1 to n.
We crawls 1 to z item for each user.
Currently we are monitoring QOS using graphite. We are storing time taken to crawl the item.
x.time_taken
Problem with this approach is that if only single user is affected we get false alert about QOS.
What will be the correct tool/technique to answer/monitor following points:
Alert only if minimum k user are affected. [Not number of events]
List of user which were affected.
I think graphite and statsd is not correct tool for this. What will be better tool for answering those two question ?
What you are asking for is often called Service Monitoring. For very good reasons you want to know the service impact of an event, rather than just that an event has happened.
The advantage of this approach is exactly as you state in your requirements - you can focus on events which impact a large part of your user base and you have a list of the users affected right away.
The main drawback, IMHO, is that Service Monitoring is usually much more complex than simple performance or event/alert monitoring. It also often relies on a service model, which in my experience is something that is hard to build and even harder to keep up to date.
For example if a server in your system shows a significant slow down or failure, depending on your architecture this may impact all users who use a service that relies on that server, or it may impact a very small subset, or even none at all initially, if there is a load balancing mechanism or redundancy mechanism in place.
You would need to reflect this architecture in your service monitoring model, and also change it every time you update your system architecture or deployment.
If your system is static enough or critical enough to warrant the investment then this may be worth your while. If not then a simple compromise may be just to update the graphing and alerting you are doing to alert when the average response time over a set number of users, or over all users on a server increases by a significant amount.
This may give you most of the benefits you are after without having to invest in the extra complexity of a service monitoring solution.
If you definitely are looking to expand your monitoring approach and want to stick with open source tools then I would start by looking at NAGIOS if your focus is on infrastructure, or there are quite a few web service monitoring solutions with Free Tiers such as pingdom:
http://www.nagios.org
https://www.pingdom.com

Cost of continuous replications vs one-shot replications (using TouchDB and Cloudant)

We have an app that uses Cloudant as a remote server. Nevertheless, Cloudant is not completely compatible with TouchDB's continuous replications from previous experience. So our alternative for now is to trigger manually one-shot replications at a fixed frequency. Nevertheless, we would like to know if that approach is going to cost us more money than continuous replications, since continuous replications use longpoll and doesn't need to query the server often. In other words, does one-shot pull replications with Cloudant as the target cost us a GET request?
Thank you,
Paul
I think the issue you refer to is [1].
Cloudant's replication is 100% compatible with CouchDB. In this
instance, TouchDB's logs indicate the iOS network stack passed
on incomplete JSON to TouchDB. It's not clear who was to blame
in this case for the replication failure.
[1] https://github.com/couchbaselabs/TouchDB-iOS/issues/241
For the cost question, a one-shot pull replication will result in a GET to the _changes
feed each time it happens, plus the other requests required to
replicate. This _changes request will be counted as a light
HTTP request against your Cloudant account.
However, whether this works out as more or fewer requests overall
depends on the number of changes coming down from the remote server.
It's also important to remember that the number of _changes calls are very small
relative to the number of other calls involved (e.g., getting the
content of the changes themselves and particularly if there are many
attachments).
While this question is specific to TouchDB, and I mention specific
behaviours of that codebase, this answer deals with the requests involved
in replication between any two systems speaking the CouchDB replication
protocol[2].
[2] http://www.dataprotocols.org/en/latest/couchdb_replication.html
Let's take a contrived example: 1 update per 10 second window to
the source database for the replication, where a TouchDB database
is the target. Let's take a 5 minute poll vs. a continuous replication.
For simplicity of call-counting, let's also take attachments out of the
picture. We'll also assume the device has a constant network connection.
For the continuous case, every 10s TouchDB will receive an update in
the _changes feed. This causes the longpoll connection to close.
TouchDB then runs through the changes, requesting the updates from the
source database; one or more GET requests on the remote server. While
this is happening, TouchDB has to open up another longpoll request
to _changes. So in a five minute period, you'd end up with perhaps
30 calls to _changes, plus all the calls to get documents and record
checkpoints.
Compare this with a one-shot replication every five minutes. You'd
receive notification of the 30 updates in one _changes feed call.
TouchDB implements an optimisation[3] whereby it will call _all_docs
to get updated documents for 1- revs, so you might end up with a single
call to get all 30 documents (not possible in the continuous case as
you've received a single change). Then you've the checkpoint documents
to record. At best fewer than 5 HTTP calls, at most about a third of
the continuous case as you've avoided extra _changes requests.
[3] https://github.com/couchbaselabs/TouchDB-iOS/wiki/Replication-Algorithm#performance
It comes down to the frequency of updates you expect to the source
database. One-shot replication is likely to provide a smoother price
curve as you're in better control of the number of requests you make.
A further question is how often connections will drop because of the
network disconnects which happen regularly with mobile devices.
TouchDB's continuous replications will fire back up each time the
user comes on line (if added via the _replicator database). This is a
further source of unpredictable costs.
However, the benefits from more immediate visibility of changes may
certainly be worth the uncertainty.

Resources