Minimum Acceptable Code Coverage Numbers in the real world [duplicate] - code-coverage

This question already has answers here:
Closed 10 years ago.
Possible Duplicate:
What is a reasonable code coverage % for unit tests (and why)?
I am in the middle of putting together some guidelines around unit test code coverage and I want to specify a number that really makes sense. It's easy to repeat the 100% mantra that I see all over the internet without considering the cost benefit analysis and when diminishing returns actually sets in.
I solicit comments from persons who have actually reported code coverage on real-life, medium/large-sized projects. What percentages were you seeing? How much is too much? I really want some balance (in figures) that will help developers produce hight quality code. Is 65% coverage too low to expect? Is 80% too high?

When you mix code coverage with cyclomatic complexity, you can use the CRAP metric.
From artima.com:
Individual Method Interpretation:
Bob Evans and I have looked at a lot
of examples (using our code and many
open source projects) and listened to
a LOT of opinions. After much debate,
we decided to INITIALLY use a CRAP
score of 30 as the threshold for
crappiness. Below is a table that
shows the amount of test coverage
required to stay below the CRAP
threshold based on the complexity of a
method:
Method’s Cyclomatic Complexity % of coverage required to be
below CRAPpy threshold
------------------------------ --------------------------------
0 – 5 0%
10 42%
15 57%
20 71%
25 80%
30 100%
31+ No amount of testing will keep methods
this complex out of CRAP territory.
No amount of code coverage is going to guarantee "high quality code" by itself alone.
From the comments...
It's definitely too lax to give simple methods a pass on coverage. What you will likely find when implementing this on existing code is that the code coverage will rise as you're refactoring those ugly methods (code coverage should rise otherwise you're refactoring dangerously).
The 0-5's are essentially low-hanging fruit and the ROI isn't all that great. That being said, those methods are wonderful for learning TDD because they're often very easy to test.

Personally I would go for 80% coverage, but of course this is only relative... I personally didn't achive this yet, too.
Currently we have very high coverage (99%) on our utility classes, which is good because bugs in there will hunt you through your whole application.
Mediocre coverage is for most GUIs, because writing tests for them is hard and time expensive, so we often leave it to opening the gui in the unit tests and if there is no error we close it automatically.

I don't think you can really have too much code coverage. I think you need to determine what code runs the "regular course of business" in the application and have that completely covered. For the remaining code that isn't in the normal course of business, start whittling that down by doing the most critical first. Abnormal business that isn't terribly important has low gain for getting good code coverage on it.

The only correct answer is you test as much as you can afford. Obviously, this is an axiom across every engineering project.
Beyond that, it's all subjective and highly dependent upon the project at hand. For example, the flight control systems lockheed puts out had better be tested more than 80%, but 80% may suffice for my GUI front-end to an XML viewer.
Typically, you break down the cost of running tests with your team. In the theory world, it is customary to have man-hours as a result of the question: how much testing can we afford?
After this, you examine your modules and you determine which parts of the code have the most time spent in them. Each critical module should be covered once. From here on, you give an appropriate number of tests compared to the amount of time specific modules are executed. So in the end, there's no hard number of "X%" is covered.
John Musa has a really interesting book on the subject.

On the program that I'm on (~500k SLOC), we use 100%. That is a program requirement to proceed to the next phase of testing. Here are the reasons behind it:
The program is used in some safety
critical situations, and you don't
want any off nominal conditions to
not be tested
If you aren't hitting 100%, then you
either wrote code that isn't
necessary, and are hence wasting
money, or you aren't testing your
off nominal paths completely. See
#1.
Your unit test scenarios should
naturally get you close to 100%,
regardless of the actual program
code coverage metric you're using.
If someone is at 95% based solely on
their off nominal scenarios,
requiring 100% isn't onerous (and,
again, you should be asking why you
aren't at a 100% then. See #2.)
Your mileage will certainly vary. If you aren't working on a mission / safety critical application, than you probably don't need to be worrying about your code coverage as much - however, I'd have to ask again: why are you writing code that you don't need?
[Edit]
Based on the comments I've received below, I should clarify. The program guideline is 100% code coverage for unit tests. That development process requirement can be waived if, for a technical reason, a branch of code cannot be reached (protected default constructor that is never called, etc.) Approval is usually granted from an external, independent portion of the organization (go go SQA).
From an integration / systems test, code coverage becomes moot, as you start looking at requirements coverage. That's a different ball of yarn altogether.
The original question was looking for real world situations: I agree that not (most?) all real world situations will warrant 100% code coverage on a unit test level, but there are certainly cases that do, and programs that do. And it is a habit of some developers to write code that they don't need, which then ends up untested. This becomes a maintenance nightmare, as a latter developer will call methods that were never "meant" to be used (or were included because someone thought they were a "good" idea). Shooting for 100% coverage forces you to answer the question "why did I write this?"

It really depends. I know a lot of software that goes 0%. I have a lot of software that has single digit %. The main question is what really is needed, and wanted in financial terms.

Related

understanding Eclemma results

I am trying to understand the concept of code coverage and a complete novice to this topic.
I am using Eclemma to measure the code coverage of an open-source code. Can somebody help me to know what are important insights I should consider into the below snapshot?
Code coverage is a metric, which expresses which portion of the (application) code gets executed when you run your test cases. However, it is just a measure of completeness; it provides no information about how thoroughly the executed code was tested by the test cases.
In your screenshot, the third line of the table (src/main/java) is a relevant one. It expresses that the application code consists of 3,846 (bytecode) instructions; out of these, roughly 67% were executed (presumably by the automated test cases residing in src/test/java). This means that the test cases cannot reveal any fault in one third of the whole application because the test cases do not touch that code at all. The remaining code (other two thirds) is executed by at least one test case. Test cases can reveal faults in this code; how effectively they do depends on their used input data and oracles.
Note that it is often not possible or sensible to achieve 100% coverage.

Thoughts on minimize code and maximize data philosophy

I have heard of the concept of minimizing code and maximizing data, and was wondering what advice other people can give me on how/why I should do this when building my own systems?
Typically data-driven code is easier to read and maintain. I know I've seen cases where data-driven has been taken to the extreme and winds up very unusable (I'm thinking of some SAP deployments I've used), but coding your own "Domain Specific Languages" to help you build your software is typically a huge time saver.
The pragmatic programmers remain in my mind the most vivid advocates of writing little languages that I have read. Little state machines that run little input languages can get a lot accomplished with very little space, and make it easy to make modifications.
A specific example: consider a progressive income tax system, with tax brackets at $1,000, $10,000, and $100,000 USD. Income below $1,000 is untaxed. Income between $1,000 and $9,999 is taxed at 10%. Income between $10,000 and $99,999 is taxed at 20%. And income above $100,000 is taxed at 30%. If you were write this all out in code, it'd look about as you suspect:
total_tax_burden(income) {
if (income < 1000)
return 0
if (income < 10000)
return .1 * (income - 1000)
if (income < 100000)
return 999.9 + .2 * (income - 10000)
return 18999.7 + .3 * (income - 100000)
}
Adding new tax brackets, changing the existing brackets, or changing the tax burden in the brackets, would all require modifying the code and recompiling.
But if it were data-driven, you could store this table in a configuration file:
1000:0
10000:10
100000:20
inf:30
Write a little tool to parse this table and do the lookups (not very difficult, right?) and now anyone can easily maintain the tax rate tables. If congress decides that 1000 brackets would be better, anyone could make the tables line up with the IRS tables, and be done with it, no code recompiling necessary. The same generic code could be used for one bracket or hundreds of brackets.
And now for something that is a little less obvious: testing. The AppArmor project has hundreds of tests for what system calls should do when various profiles are loaded. One sample test looks like this:
#! /bin/bash
# $Id$
# Copyright (C) 2002-2007 Novell/SUSE
#
# This program is free software; you can redistribute it and/or
# modify it under the terms of the GNU General Public License as
# published by the Free Software Foundation, version 2 of the
# License.
#=NAME open
#=DESCRIPTION
# Verify that the open syscall is correctly managed for confined profiles.
#=END
pwd=`dirname $0`
pwd=`cd $pwd ; /bin/pwd`
bin=$pwd
. $bin/prologue.inc
file=$tmpdir/file
okperm=rw
badperm1=r
badperm2=w
# PASS UNCONFINED
runchecktest "OPEN unconfined RW (create) " pass $file
# PASS TEST (the file shouldn't exist, so open should create it
rm -f ${file}
genprofile $file:$okperm
runchecktest "OPEN RW (create) " pass $file
# PASS TEST
genprofile $file:$okperm
runchecktest "OPEN RW" pass $file
# FAILURE TEST (1)
genprofile $file:$badperm1
runchecktest "OPEN R" fail $file
# FAILURE TEST (2)
genprofile $file:$badperm2
runchecktest "OPEN W" fail $file
# FAILURE TEST (3)
genprofile $file:$badperm1 cap:dac_override
runchecktest "OPEN R+dac_override" fail $file
# FAILURE TEST (4)
# This is testing for bug: https://bugs.wirex.com/show_bug.cgi?id=2885
# When we open O_CREAT|O_RDWR, we are (were?) allowing only write access
# to be required.
rm -f ${file}
genprofile $file:$badperm2
runchecktest "OPEN W (create)" fail $file
It relies on some helper functions to generate and load profiles, test the results of the functions, and report back to users. It is far easier to extend these little test scripts than it is to write this sort of functionality without a little language. Yes, these are shell scripts, but they are so far removed from actual shell scripts ;) that they are practically data.
I hope this helps motivate data-driven programming; I'm afraid I'm not as eloquent as others who have written about it, and I certainly haven't gotten good at it, but I try.
In modern software the line between code and data can become awfully thin and blurry, and it is not always easy to tell the two apart. After all, as far as the computer is concerned, everything is data, unless it is determined by existing code - normally the OS - to be otherwise. Even programs have to be loaded into memory as data, before the CPU can execute them.
For example, imagine an algorithm that computes the cost of an order, where larger orders get lower prices per item. It is part of a larger software system in a store, written in C.
This algorithm is written in C and reads a file that contains an input table provided by the management with the various per-item prices and the corresponding order size thresholds. Most people would argue that a file with a simple input table is, of course, data.
Now, imagine that the store changes its policy to some sort of asymptotic function, rather than pre-selected thresholds, so that it can accommodate insanely large orders. They might also want to factor in exchange rates and inflation - or whatever else the management people come up with.
The store hires a competent programmer and she embeds a nice mathematical expression parser in the original C code. The input file now contains an expression with global variables, functions such as log() and tan(), as well as some simple stuff like the Planck constant and the rate of carbon-14 degradation.
cost = (base * ordered * exchange * ... + ... / ...)^13
Most people would still argue that the expression, even if not as simple as a table, is in fact data. After all it is probably provided as-is by the management.
The store receives a large amount of complaints from clients that became brain-dead trying to estimate their expenses and from the accounting people about the large amount of loose change. The store decides to go back to the table for small orders and use a Fibonacci sequence for larger orders.
The programmer gets tired of modifying and recompiling the C code, so she embeds a Python interpretter instead. The input file now contains a Python function that polls a roomfull of Fib(n) monkeys for the cost of large orders.
Question: Is this input file data?
From a strict technical point, there is nothing different. Both the table and the expression needed to be parsed before usage. The mathematical expression parser probably supported branching and functions - it might not have been Turing-complete, but it still used a language of its own (e.g. MathML).
Yet now many people would argue that the input file just became code.
So what is the distinguishing feature that turns the input format from data into code?
Modifiability: Having to recompile the whole system to effect a change is a very good indication of a code-centric system. Yet I can easily imagine (well, more like I have actually seen) software that has been designed incompetently enough to have e.g. an input table built-in at compile time. And let's not forget that many applications still have icons - that most people would deem data - built in their executables.
Input format: This is the - in my opinion, naively - most common factor that people consider: "If it is in a programming language then it is code". Fine, C is code - you have to compile it after all. I would also agree that Python is also code - it is a full blown language. So why isn't XML/XSL code? XSL is a quite complex language in its own right - hence the L in its name.
In my opinion, none of these two criteria is the actual distinguishing feature. I think that people should consider something else:
Maintainability: In short, if the user of the system has to hire a third party to make the expertise needed to modify the behaviour of the system available, then the system should be considered code-centric to a degree.
This, of course, means that whether a system is data-driven or not should be considered at least in relation to the target audience - if not in relation to the client on a case-by-case basis.
It also means that the distinction can be impacted by the available toolset. The UML specification is a nightmare to go through, but these days we have all those graphical UML editors to help us. If there was some kind of third-party high-level AI tool that parses natural language and produces XML/Python/whatever, then the system becomes data-driven even for far more complex input.
A small store probably does not have the expertise or the resources to hire a third party. So, something that allows the workers to modify its behaviour with the knowledge that one would get in an average management course - mathematics, charts etc - could be considered sufficiently data-driven for this audience.
On the other hand, a multi-billion international corporation usually has in its payroll a bunch of IT specialists and Web designers. Therefore, XML/XSL, Javascript, or even Python and PHP are probably easy enough for it to handle. It also has complex enough requirements that something simpler might just not cut it.
I believe that when designing a software system, one should strive to achieve that fine balance in the used input formats where the target audience can do what they need to, without having to frequently call on third parties.
It should be noted that outsourcing blurs the lines even more. There are quite a few issues, for which the current technology simply does not allow the solution to be approachable by the layman. In that case the target audience of the solution should probably be considered to be the third party to which the operation would be outsourced to.
That third party can be expected to employ a fair number of experts.
One of five maxims under the Unix Philosophy, as presented by Rob Pike, is this:
Data dominates. If you have chosen the right data structures and organized things well, the algorithms will almost always be self-evident. Data structures, not algorithms, are central to programming.
It is often shortened to, "write stupid code that uses smart data."
Other answers have already dug into how you can often code complex behavior with simple code that just reacts to the pattern of its particular input. You can think of the data as a domain-specific language, and of your code as an interpreter (maybe a trivial one).
Given lots of data you can go further: the statistics can power decisions. Peter Norvig wrote a great chapter illustrating this theme in Beautiful Data, with text, code, and data all available online. (Disclosure: I'm thanked in the acknowledgements.) On pp. 238-239:
How does the data-driven approach compare to a more traditional software development
process wherein the programmer codes explicit rules? ... Clearly, the handwritten rules are difficult to develop and maintain. The big
advantage of the data-driven method is that so much knowledge is encoded in the data,
and new knowledge can be added just by collecting more data. But another advantage is
that, while the data can be massive, the code is succinct—about 50 lines for correct, compared to over 1,500 for ht://Dig’s spelling code. ...
Another issue is portability. If we wanted a Latvian spelling-corrector, the English
metaphone rules would be of little use. To port the data-driven correct algorithm to another
language, all we need is a large corpus of Latvian; the code remains unchanged.
He shows this concretely with code in Python using a dataset collected at Google. Besides spelling correction, there's code to segment words and to decipher cryptograms -- in just a couple pages, again, where Grady Booch's book spent dozens without even finishing it.
"The Unreasonable Effectiveness of Data" develops the same theme more broadly, without all the nuts and bolts.
I've taken this approach in my work for another search company and I think it's still underexploited compared to table-driven/DSL programming, because most of us weren't swimming in data so much until the last decade or two.
In languages in which code can be treated as data it is a non-issue. You use what's clear, brief, and maintainable, leaning towards data, code, functional, OO, or procedural, as the solution requires.
In procedural, the distinction is marked, and we tend to think about data as something stored in an specific way, but even in procedural it is best to hide the data behind an API, or behind an object in OO.
A lookup(avalue) can be reimplemented in many different ways during its lifetime, as long as its starts as a function.
...All the time I desing programs for nonexisting machines and add: 'if we now had a machine comprising the primitives here assumed, then the job is done.'
... In actual practice, of course, this ideal machine will turn out not to exist, so our next task --structurally similar to the original one-- is to program the simulation of the "upper" machine... But this bunch of programs is written for a machine that in all probability will not exist, so our next job will be to simulate it in terms of programs for a next lower level machine, etc., until finally we have a program that can be executed by our hardware...
E. W. Dijkstra in Notes on Structured Programming, 1969, as quoted by John Allen, in Anatomy of Lisp, 1978.
When I think of this philosophy which I agree with quite a bit, the first thing that comes to mind is code efficiency.
When I'm making code I know for sure it isn't always anything close to perfect or even fully knowledgeable. Knowing enough to get close to maximum efficiency out of a machine when it is needed and good efficiency the rest of the time (perhaps trading off for better workflow) has allowed me to produce high quality finished products.
Coding in a data driven way, you end up using code for what code is for. To go and 'outsource' every variable to files would be foolishly extreme, the functionality of a program needs to be in the program and the content, settings and other factors can be managed by the program.
This also allows for much more dynamic applications and new features.
If you have even a simple form of database, you are able to apply the same functionality to many states. You may also do all manner of creative things like changing the context of what your program is doing based on file header data or perhaps directory, file name or extension, though not all data is necessarily stored on a filesystem.
Finally keeping your code in a state where it is simply handling data puts you in a state of mind where you are closer to envisioning what is actually going on. This also keeps the bulk out of your code, greatly reducing bloatware.
I believe it makes code more maintainable, more flexible and more efficient aaaand I like it.
Thank you to the others for your input on this as well! I found it very encouraging.

Proving the ROI of a technology?

How does one prove the ROI of a technology to their manager?
The closest thing I have found to a document on how to do this is:
http://www.agilejournal.com/pdf/Finding-ROI-in-Build-Automation.pdf
There are formulas in this document, but I can't really tell if they are just alot of marketing or if they are accurate formulas on how to calculate ROI.
I'm not really trying to calculate the ROI of the build tool in the above paper, I was just trying to calculate the ROI of a simple build tool like ANT.
They don't cut to the meat of the question: the intangible benefits - though they at least try to walk through an example. The formulas are just to get ROI as a nice percentage - if "using build tools" was a stock, how much return would I get on my investment?
Which already shows that the question itself is flawed: An automated build is mainly an instrument to improve quality; improving productivity is usually a secondary concern.
However, this doesn't help when talking to the guys sitting on the money.
Metrics I woud use to analyse effect of a build tool:
Turnaround time from checkin to final media
Number of builds (for testing, for release, ..)
Number of build requested (with faster builds, you can expect an increase in demand)
Number of errors introduced during manual build (assuming you track those)
Number of developers able to publish a release
Estimated resources (time, licences, build server, ..) for implementation and maintenance
Analysis of low-probability, high risk scenarios
Often, an automated build tool pays for itself just by removing a bottleneck: every developer can publish the software, not just John the Builder.
The last point is important (but hardest to give numbers for), as the total cost of bugs doesn't have a normal distribution, but is highly "pareto": a single bug can give you some nasty press, or make key customers switch to competition.
The core argument for maintaining an automated build is that publishing bugs are mostly avoidable.
I can't imagine that there would be any sensible way to accurately measure ROI on developer tools and practices. The only thing I can think of where that would might be possible would be in factory environments where you you might be able to measure productivity and average quality.
I'd suggest doing what everyone else does, which is pick some formulas that will support what you want and then tweak them until the ROI is high enough to justify the investment.
Put the estimate in hours:
Estimate how much time you currently spend, and how much time you'd spend
Put the estimate in customer complaints:
Estimate your current number of bugs. Estimate how many of those bugs the new system would have caught. Find out the percentage of bugs reported by users, and document how many fewer bugs will be user visible.
Add to the hours:
Figure out how long it would take to fix the bugs that would be caught, and tack that onto the hourly estimate.
Add a non-quantifiable "salability".
With the extra time, we build extra features. With fewer bugs, that's fewer demos where sales guys shoot themselves in the foot. How many extra copies of software can we sell if we do this?
The last bit won't be successful; it's there primarily to focus attention on the first two metrics; hours and customer-visible defects.
My advice is to discredit what's there now then offer the alternative.
You could try putting the emphasis on the problems of how you do build and deployment now and win that battle first. Manager's don't want to change something that isn't causing them grief so you need to prove it is going to be a problem if nothing's done.
You should consider: how much time and credibility is lost with bad builds; how many mistakes are made currently; how much manual repetition takes place etc. and try to put metrics or examples of these things.
If you can win the support of your developers then you can also add their approval to the strength of your argument. Another point to make is that good developers like to work with good tools so progressive Management equals motivated developers.
If you win the hearts-and-minds of the developers and your Manager that may mean more than a piece of paper with some numbers on it.

Software development metrics and reporting [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 2 years ago.
Improve this question
I've had some interesting conversations recently about software development metrics, in particular how they can be used in a reasonably large organisation to help development teams work better. I know there have been Stack Overflow questions about which metrics are good to use - like this one, but my question is more about which metrics are useful to which stakeholders, and at what level of aggregation.
As an example, my view is that code coverage is a useful metric in the following ways (and maybe others):
For a team's own internal use when combined with other measurements.
For facilitating/enabling/mentoring
teams, where it might be instructive
when considered on a team-by-team
basis as a trend (e.g. if team A and
B have coverage this month of 75 and
50, I'd be more concerned with team A
than B if the previous month they'd
had 80 and 40).
For senior management
when presented as an aggregated
statistic across a number of teams or
a whole department.
But I don't think it's useful for senior management to see this on a team-by-team basis, as this encourages artifical attempts to bolster coverage with tests that merely exercise, rather than test, code.
I'm in an organisation with a couple of levels in its management hierarchy, but where the vast majority of managers are technically minded and able (with many still getting their hands dirty). Some of the development teams are leading the way in driving towards agile development practices, but others lag, and there is now a serious mandate from the top for this to be the way the organisation works. A couple of us are starting a programme to encourage this. In this sort of an organisation, what sort of metrics do you think are useful, to whom, why, and at what level of aggregation?
I don't want people to feel their performance is being assessed based on a metric that they can artificially influence; at the same time, the senior management are going to want some sort of evidence that progress is being made. What advice or caveats can you provide based on experience in your own organisations?
EDIT
We are definitely wanting to use metrics as a tool for organisational improvement not as a tool for individual performance measurement.
A tale from personal experience. Apologies for the length.
A few years ago our development group tried setting "proper" measurable objectives for individuals and team leaders. The experiment lasted for just one year, because hard metrics didn't really work very well for individual objectives (see my question on the subject for some links and further discussion).
Note that I was a team leader, and involved in planning it all with my technical boss and the other team leaders, so the objectives weren't something dictated from on high by clueless upper management -- at the time we really wanted them to work. It is also worth noting that the bonus structure inadvertently encouraged competition between developers. Here are my observations on the things we tried.
Customer-visible issues
In our case, we counted outages on the service we provided to customers. In a shrink-wrapped product it might be the number of bugs reported by customers.
Advantages: This was the only real measure that was visible to upper management. It was also the most objective, being measured outside the development group.
Disadvantages: There weren't that many outages -- just around one per developer for the whole year -- which meant that failing or exceeding the objective was a matter of "pinning blame" for the few outages that did occur in each team. This led to bad feeling and loss of morale.
Amount of work completed
Advantages: This was the only positive measure. Everything else was "we notice when bad things happen," which was demoralising. Its inclusion was also necessary because, without it, a developer who did nothing all year would exceed all the other objectives, which clearly wouldn't be in the interests of the company. Measuring the amount of work completed checked the natural optimism of developers when estimating task size, which was useful.
Disadvantages: The measure of "work completed" was based on estimates provided by the developers themselves (usually a good thing), but making it part of their objectives encouraged gaming of the system to inflate estimates. We had no other viable measure of work completed: I think the only possible valuable way of measuring productivity is "impact on the company bottom line," but most developers are so far removed from direct sales that this is rarely practical at an individual level.
Defects found in new production code
We measured defects introduced into new production code during the year, as it was felt that bugs from previous years should not count against any individual in this year's objectives. Defects spotted by internal quality teams were included in the count even if they didn't impact customers.
Advantages: Surprisingly few. The time lag between the introduction of a defect and its discovery meant that there was really no immediate feedback mechanism to improve code quality. Macro trends at a team level were more useful.
Disadvantages: There was a heavy focus on the negative, since this objective was only invoked when a defect was found and we needed someone to blame for it. Developers were reluctant to record defects they found themselves, and a simple count meant that minor bugs were as bad as severe problems. Since the number of defects per individual was still quite low, the number of minor and severe defects didn't even out as it might with a larger sample. Old defects were not included, so the group's reputation for code quality (based on all bugs found) did not always match the measurable introduced-this-year count.
Timeliness of project delivery
We measured timeliness as the percentage of work delivered to internal QA teams by the stated deadline.
Advantages: Unlike counting defects, this was a measure that was under immediate, direct control of the developers, as they effectively decided when the work was complete. The presence of the objective focused the mind on completing tasks. This helped the team commit to realistic amounts of work, and improved the perception by internal customers of the development group's ability to deliver on promises.
Disadvantages: As the only objective directly under the developers' control, it was maximised at the expense of code quality: on the day of a deadline, given the choice between saying a task is complete or doing further testing to improve confidence in its quality, the developer would choose to mark it complete and hope any resulting bugs never come to the surface.
Complaints from internal customers
To gauge how well developers communicated with internal customers during development and subsequent support of their software, we decided that the number of complaints received about each individual would be recorded. The complaints would be validated by the manager, to avoid any possible vindictiveness.
Advantages: Really nothing I can recall. Measured at a sufficiently large group level it becomes a more useful "customer satisfaction" score.
Disadvantages: Not only highly negative, but also a subjective measure. As with other objectives, the numbers for each individual were around the zero mark, which meant that a single comment about someone could mean the difference between "infinitely exceeded" and "did not meet".
General comments
Bureaucracy: While our task management tools held much of the data for these metrics, there was still quite a lot of manual effort involved to collate it all. The time spent obtaining all the numbers was not enjoyable, generally focused on negative aspects of our work and may not even have been reclaimed by increased productivity.
Morale: For the measures where individuals were blamed for problems, not only did those with "bad" scores feel demotivated, but so did those with "good" scores, as they didn't like the loss in team morale and sometimes felt they were ranked higher not because they were better but because they were luckier.
Summary
So what did we learn from the episode? In later years we tried to re-use some of the ideas but in a "softer" way, where there was less emphasis on individual blame and more on team improvement.
It is impossible to define objectives for individual developers that are objectively measurable, add value to the company and cannot be gamed, so don't bother to try.
Customer issues and defects can be counted at a wider team level, if the location of the defect is unequivocally the responsibility of that team -- that is, you don't ever have to play the "blame game".
Once you measure defects only at the level of responsibility for a code module, you can (and should) measure old bugs as well as new ones, since it is in that group's interest to eliminate all defects.
Measuring defect counts at a group level increases the sample size per group, and so anomalies between minor and severe defects are smoothed out and a simple "number of bugs" measure can mean something, such as to see if you are improving month-on-month.
Include something that upper management care about, because keeping them happy is your primary purpose as a development group. In our case it was customer-visible outages, so even if the measure is sometimes arbitrary or seemingly unfair, if it's what the bosses are measuring then you need take notice too.
Upper management don't need to see metrics they don't have in their own objectives. This way it avoids the temptation to blame individuals for errors.
Measuring timeliness of project delivery did change developer behaviour and put a focus on completing tasks. It improved estimation and allowed the group to make realistic promises. If it were easy to collect the timeliness information then I would consider using it again at a team level to measure improvement over time.
All of this doesn't help when you are required to set measurable objectives for individual developers, but hopefully the ideas will be more useful for team improvement.
The key thing about metrics is knowing what you are using them for. Are you using them as a tool for improvement, a tool for reward, a tool for punishment, etc. It sounds like you're planning to use them as a tool for improvement.
The number one principle when setting metrics is to keep the information relevant so that the person receiving it can use it to make a decision. Most likely a senior manager cannot dictate the micro level of whether you need more tests, less complexity, etc. But a team leader can do that.
Therefore, I don't believe a measure of code coverage is going to be useful to management beyond the individual team. At the macro level, the organisation is probably interested in:
Cost of delivery
Timeliness of delivery
Scope of delivery & external quality
Internal quality won't be high on their list of things to cover off. It's a development team's mission to make it clear that internal quality (maintainability, test coverage, self-documenting code, etc) is a key factor in achieving the other three.
Therefore you should target metrics to more senior managers which cover off those three such as:
Overall Velocity (note that comparing velocity between teams is often artificial)
Expected vs Actual scope delivered to agreed timelines
Number of production defects (possibly per capita)
And measure things like code coverage, code complexity, cut 'n' paste score (code repetition using flay or similar), method length, etc at a team level where the recipients of the information can really make a difference.
A metric is a way of answering a question about a project, team or company. Before you start looking for the answers, you need to decide what questions you want to ask.
Typical questions include:
what is the quality of our code?
is the quality improving or degrading over time?
how productive is the team? Is it improving or degrading?
how effective is our testing?
...and so on.
Each question will require a different set of metrics to answer. Collecting metrics without knowing what questions you want answered is at best a waste of time and at worst counterproductive.
You also need to be aware that there is an 'uncertainty principle' at work - unless you are very careful the act of collecting metrics will change people's behaviour, often in unexpected and sometimes detrimental ways. This is especially so if people believe they are being evaluated on the metrics, or worse still have the metrics tied to some reward or punishment scheme.
I recommend reading Gerald Weinberg's Quality Software Management Vol 2: First Order Measurement. He goes into a lot of detail on software metrics, but says the most important are often what he calls "Zero Order Measurement" - asking people their opinion on how a project is going. All four volumes in the series are expensive and hard to get hold of, but well worth it.
Software writing
What must be optimised?
CPU(s) use, memory(s) use, memory cache(s) use, user time use, code size at run-time, data size at run-time, graphics performance, file access performance, network access performance, bandwidth use, code conciseness and readability, electricity use, (count of) distinct API calls used, (count of) distinct methods and algorithms used, maybe more.
How much must it be optimised?
It must be optimised the minimum reasonable amount (except in areas where surpassing acceptance test criteria is desirable) required to pass acceptance tests, facilitate maintenance, facilitate audit and meet user requirements.
("... for legal/illegal input test data and legal/illegal test events in all test states at all required test data volumes and test request volumes for all current and future test integration scenarios.")
Why the minimum reasonable amount?
Because optimised code is harder to write and so costs more.
What leadership is required?
Coding standards, basic structure, acceptance criteria and guidance on levels of optimisation required.
How can success of software writing be measured?
Cost
Time
Acceptance test passes
Extent to which acceptance tests it is desirable to surpass are surpassed
User approval
Ease of maintenance
Ease of audit
Degree of absence of over-optimisation
What cost/time should be ignored in assessing aggregate performance of programmers?
Wasted cost/time incurred because of requirements (inc architecture) changes
Extra cost/time incurred because of deficiencies in platforms/tools
But this cost/time should be included in assessing aggregate performance of teams (inc architects, managers).
How can success of architects be measured?
Other measures plus:
Instances of "avoiding early" being affected by deficiencies in platforms/tools
Degree of absence of changes in architecture
As I said in What is the fascination with code metrics?, metrics include:
different populations, meaning the scope of interest is not the same for developer or for manager
trends meaning any metrics in itself is meaningless without its associated trend, in order to take the decision to act upon it or to ignore it.
We are using a tool able to provide:
lots of micro-level metrics (interesting for developers), with trends.
lots of rules with multi-level (UI, Data, Code) static analysis capabilities
lots of aggregations rules (meaning those vast number of metrics are condensed in several domains of interests, adequate for higher level of populations)
The result is an analysis which can be drilled-down, from high level aggregation domains (security, architecture, practices, documentation, ...) all the way down to some line of code.
The current feedback is:
project managers can get defensive very quickly when some rules are not respected and make their global note significantly lower.
Each study has to be re-tailored to respect each project quirks.
The benefit is the definition of a contract where exceptions are acknowledged but rules to be respected are defined.
higher levels (IT department, stakeholder) use the global notes just as one element of their evaluation of the progress made.
They will actually look more closely at other elements based on delivery cycles: how often are we able to iterate and put an application into production?, how many errors did we had to solve before that release? (in term of merges, or in term of pre-production environment not correctly setup), what immediate feedbacks are generated by a new release of an application?
So:
which metrics are useful to which stakeholders, and at what level of aggregation
At high level:
the (static analysis) metrics are actually the result of low-level metric aggregations, and organized by domains.
Other metrics (more "operational-oriented", based on the release cycle of the application, and not just on the static analysis of the code) are taken into account
The actual ROI is achieved through other actions (like six-sigma studies)
At lower level:
the static analysis is enough (but has to encompass multi-level tiers applications, with sometimes multi-languages developments)
the actions are piloted by the trends and importance
the study has to be approved/supported by all levels of hierarchy to be accepted/acted upon (in particular, budget for the ensuing refactoring has to be validated)
If you have some Lean background/knowledge, then I would suggest the system that Mary Poppendieck recommends (that I've already mentioned in this previous answer). This system is based on three holistic measurements that must be taken as a package:
Cycle time
From product concept to first release or
From feature request to feature deployment or
From bug detection to resolution
Business Case Realization (without this, everything else is irrelevant)
P&L or
ROI or
Goal of investment
Customer Satisfaction
e.g. Net Promoter Score
The aggregation level is product/project level and I believe that these metrics are helpful for everybody (developers should never forget that they don't write code for fun, they write code to create value and should always keep that in mind).
Teams may (and actually do) use technical metrics to measure quality standards conformance which are integrated in the Definition of Done (as "no increase of the technical debt"). But high quality is not a end in itself, it's just a mean to achieve short cycle time (to be a fast company) which is the real target (with Business Case Realization and Customer Satisfaction).
This is a bit of a side note to the main question, but I had a very similar experience to Paul Stephensons answer above. One thing I would add to that is about collection of data and visibility of metrics.
In our case, the development director was meant to collate a bunch of data from various disparate systems and distribute individual metric results once a month. This often didn't happen, as it was a time consuming job and he was a busy man.
The results of this were:
Unhappy developers, as performance bonuses were based on metrics and people didn't know how they were getting on.
Some time consuming multiple entry of data into various different systems.
If you are going down this route, you need to be sure that all metric data can be collated automatically and is easily visible to those it affects.
One of the interesting approaches that's currently getting some hype is Kanban. It's fairly Agile. What's particularly interesting is that it permits a metric of "work done" to be applied. I havn't used/encountered this in actual practice yet, but I'd like to work towards getting a kanban-ish flow going at my job.
Interestingly I just finished reading PeopleWare, and the authors strongly discourage individual metrics being made visible to superiors (even direct managers), but that aggregate metrics should be very visible.
As far as code specific metrics I think it's good for a team to know the state of the code at the current time, and to know the trends affecting the code as it matures and grows.
The question is obviously not focussed on .NET, but I think the .NET product NDepend has done a lot of work to define and document common metrics that are useful.
The documentation section on metrics is educational reading, even if you're not doing .NET.
Software metrics have been with us for a long time and as best I
can tell nothing to date has emerged individually or in aggregate
that is capable of guiding projects during development. The nut of
the problem is that we want to use objective measures and these
can only measure what has happened,
not what is happening or about to happen.
By the time we have measured, analyzed and interpreted some
series of metrics we are reacting to things that
have already gone wrong, or very occasionally, gone right.
I don't want to underplay the importance of learning from
objective metrics but I do want to
point out that this is a reactive not a pro-active response.
Developing a "confidence index" may be a better way of monitoring
whether project is on-track or headed for trouble. Try
developing a voting system where a reasonable number of
representatives from each project area of interest are asked
to anonymously vote their
confidence from time to. Confidence is voted in two areas:
1) Things are on-track 2) Things will continue to be on-track or get
back on-track.
These are purely subjective measurements from people closest to the
"action".
Feed the results into a Kanban type chart where the
columns represent voting areas and you
should have a pretty good idea where to focus your attention. Use
question 1 to evaluate whether management reacted to the
previous voting cycle appropriately. Use question 2 to identify
where management should focus next.
This idea is based on each of us having a comfort level
within our own area of responsibility. Our confidence level
is a product of experience, knowledge within our
domain of expertise, the number and severity of problems
we are facing, the amount of time we have to accomplish our
tasks, the quality of the information we are working with and
a whole bunch of other factors.
MBWA (Management By Walking Around) is often touted as
one of the most effective tools we have - this is a variation of it.
This technique is not much use at the level of
individual teams because it only reflects the general mood
of the team. Kind of like using someone’s watch to tell them
the time. However, at higher levels of management it should
be quite informative.

What is the most critical piece of code you have written and how did you approach it?

Put it another way: what code have you written that cannot fail. I'm interested in hearing from those who have worked on projects dealing with heart monitors, water testing, economic fundamentals, missile trajectories, or the O2 concentration on the space shuttle.
How did you prepare for writing this sort of code: methodologically, intellectually, and emotionally?
Edit
I've marked this wiki in case the rep issue is keeping people from replying. I thought there would be a good deal more perspective on this issue than there has been.
While I am not personally involved in what is described there, this article will hopefully contribute to the spirit of your question: They Write the Right Stuff.
I wrote a driver for a blood pressure measuring device for hospital use. If it "fails", the patient will not have his blood pressure checked at the scheduled time; if his blood pressure is abnormal, no alarm (in the larger system) will be triggered. Such an event could be clinically significant.
My approach was to thoroughly read the spec/documentation in a non-work environment (to avoid the temptation to start coding right away), then read it again at work. After that, I summarized the possible states and actions on paper and "flowcharted" an algorithm, and annotated all the potential real-world "bad events" (cables getting unplugged, batteries dying, etc). Finally, I wrote and rewrote the driver three times, each with different mechanisms (e.g. FSM), and compared their results. Each iteration helped me identify weaknesses I hadn't yet discovered. The third rewrite was the "official" result. I reviewed each iteration with my co-worker.
Emotional preparation consisted of convincing myself that should the unthinkable happen, at least I wasn't willfully negligent -- just incompetent (the old "I'm only human" excuse). ;-)
I have written computer interface to a MRI machine. It had no chance of hurting the end user as it was just record management, but it could potentially have given an incorrect diagnosis or omit important information.
Tests, lots and lots of tests.
Unit tests, mid and high level tests. Simulate all possible input combinations. Also a great deal of testing with the hardware itself. Testing must be done in a complete and methodical way. It should take a great deal more time to test than to write.
Error Reporting
All errors must be reported and be obvious. If it won't hurt the patent to do so, fail fast.
For something that is actively keeping a person alive things are even worse. It must never stop working. If it fails it needs to restart and keep trying. Redundant internals are also a must in case the hardware fails.
At the wrong company it can really a difficult kind of situation to work in. However, if things are going well, you are well funded and release pressure is not high, it can be a very rewarding space to work in.
Not really an answer, but:
I've got a friend who writes embedded control software for laser eye surgery machines. When he had laser eye surgery himself, he made sure to go to an ophthalmologist who used his company's system. I have great admiration for this guy. I can't think of a piece of software I've ever written whose level of quality was high enough that I'd trust my own eyesight to it.
Right now I'm working on some base code for a system that retrieves medical patient information from clinics and hospitals for a medical billing office. We're starting out with a smaller client and a long break-in period to ensure quality, but eventually this code needs to securely handle a large variety of report formats from a number of clients at different facilities.
It's not quite in the same scale as your examples, but a bad mistake could result in the wrong people being billed or the right person billed to a defunct address (screwing up credit reports) or open people up to identity theft, so it's still pretty critical. Oh yeah, and it could mean doctors don't get paid quite as quick. That's important, too, especially from a business perspective, but not in the same class as data protection and integrity.
I've heard crazy stories of the processes used to write code at NASA for the spaceshuttles. Every line of code has about 10-20 lines of documentation, along with tests, full revision history, etc. Every time a bug is found, not only is the code evaluated and repaired, but the entire procedure of writing code, the entire command chain, etc. is reviewed to answer the question: "What happened wrong in our process that allowed this bug to get included in the first place?"
While nothing quite so important as an MRI machine or a blood pressure monitor, I did get tapped to do a rewrite of Blackjack when I worked for an online gambling provider. Blackjack is by far the most popular online game, and millions of dollars was going to go through this software (and did).
I wrote the game engine separate from the server and the client, and used Test Driven Development to ensure that what I was assuming was coming through in the results. I also had a wrapper "server" that had console output that would allow me to play. This was actually only useful in that it mimicked the real server interface, since playing a text version of blackjack isn't very fun or easy ("You draw a 10. You now have a 10 and a 6, while the dealer has a 6 showing. [bsd] >")
The game is still being run on some sites out there, and to my knowledge, has never had any financial bugs after years of play.
My first "real" software job was writing a GUI app for planning stereotactic brain surgery. Testing, testing, testing... absolutely no formal methods, engineering-style thoughts, just younger programmers cranking it out. When they started talking about using the software to control a robotic arm with a laser, without any serious engineering methods in place, i got a bit worried, left for more officey lands.
I've created information system application for local government cultures and tourism department in Bali island which were installed in several tourism denstinations, providing extensive informations about the culture, maps, accomodations etc.
if it failed then probably tourists couldnt get the right informations they need most, cheat by brookers, or lost somewhere :)

Resources