Skip to main content

I work in an organisation where some domains we think are doing OK. They are CD, YBIYRI, product teams delivering to production in some instances multiple times a day.

In other domains they have not really even started on that journey. They are projects not products and follow a more siloed, phased, document heavy way of working.

We have a mandate to help these other domains on their journey to CD.

We do however want to be sure we are safe to scale. We traditionally have only used trailing metrics (e.g. DORA) which tell us how we DID. We have a gut feel that as we scale we may need metrics that are more leading (e.g. measuring engineering signals like code complexity or, dare I say it, code coverage) which tell us how we ARE DOING.

We think we are aware of the usual gotchas in measurement and so our measurement principles include things like: don't measure individuals but teams, don't set arbitrary targets, don't use metrics to judge across teams but to help each team have better evidence-based discussions, etc.

Bit of a huge / unfair question, but what is your view of initiatives that try to introduce metrics, what pitfalls have you seen (other than the above) and have you seen these been successful & valuable?

Ultimately one of our motivations is to have the ability to spot correlation patterns in our organisation between great outcomes (trailing metrics) and great engineering practices (leading metrics). Is this a fools errand?

As you say, measurement is a complex thing. It is my view that most measurement looks at the wrong things, lines of code, individual productivity, test coverage, etc. are all bad measures.

Despite this I think that if our aim is to be closer to “engineering” rather than craft, then a lot of the difference is in  measurement, but I think that we have to take a different view of what we mean by ‘measurement’.

The DORA metrics are a lagging indicator, but can form a useful ‘speedometer’ of where you are headed. The huge advantage to the DORA metrics, is that they measure generic things “Stability” measures the quality of our output, and “Throughput” measures the efficiency (Speed) with which we can deliver software of that quality. They are good, general measures, but you can’t trade them off against each other, you can’t optimise only for “Throughput” and ignore “Stability” which is a trap that lots of orgs fall into.

The other measures matter and that you need, are not generic and are not even org or team specific, they are feature specific. Think of these as analogous to a carpenter measuring a shelf, the measurements are only meaningful for that specific shelf.

I think that these take two forms...

  1. Automated tests that specify what the software is meant to do, create these “measures” (tests) as specifications before you build the code (TDD) and run the tests all the time (CI) to see the state of the system.
  2. Product-level measures of the success of the idea (things like “did this feature make more money?”, “Did it recruit more users?”, “did it improve system up time”. Define these measures for each feature.

I quite like the SRE model of identify a Service Level Indicator (SLO) that tells us what measure  indicates success for each feature as part of the specification of each feature.

Most people don’t think of the tests as measures, but certainly as part of a TDD process they are, they are specifications of what the software is meant to do, and as soon as one of these tests is failing you know that your code doesn’t meet it’s specification - “your shelves don’t fit!”


As you say, measurement is a complex thing. It is my view that most measurement looks at the wrong things, lines of code, individual productivity, test coverage, etc. are all bad measures.

Despite this I think that if our aim is to be

I work in an organisation where some domains we think are doing OK. They are CD, YBIYRI, product teams delivering to production in some instances multiple times a day.

In other domains they have not really even started on that journey. They are projects not products and follow a more siloed, phased, document heavy way of working.

We have a mandate to help these other domains on their journey to CD.

We do however want to be sure we are safe to scale. We traditionally have only used trailing metrics (e.g. DORA) which tell us how we DID. We have a gut feel that as we scale we may need metrics that are more leading (e.g. measuring engineering signals like code complexity or, dare I say it, code coverage) which tell us how we ARE DOING.

We think we are aware of the usual gotchas in measurement and so our measurement principles include things like: don't measure individuals but teams, don't set arbitrary targets, don't use metrics to judge across teams but to help each team have better evidence-based discussions, etc.

Bit of a huge / unfair question, but what is your view of initiatives that try to introduce metrics, what pitfalls have you seen (other than the above) and have you seen these been successful & valuable?

Ultimately one of our motivations is to have the ability to spot correlation patterns in our organisation between great outcomes (trailing metrics) and great engineering practices (leading metrics). Is this a fools errand?

 

closer to “engineering” rather than craft, then a lot of the difference is in  measurement, but I think that we have to take a different view of what we mean by ‘measurement’.

The DORA metrics are a lagging indicator, but can form a useful ‘speedometer’ of where you are headed. The huge advantage to the DORA metrics, is that they measure generic things “Stability” measures the quality of our output, and “Throughput” measures the efficiency (Speed) with which we can deliver software of that quality. They are good, general measures, but you can’t trade them off against each other, you can’t optimise only for “Throughput” and ignore “Stability” which is a trap that lots of orgs fall into.

The other measures matter and that you need, are not generic and are not even org or team specific, they are feature specific. Think of these as analogous to a carpenter measuring a shelf, the measurements are only meaningful for that specific shelf.

I think that these take two forms...

  1. Automated tests that specify what the software is meant to do, create these “measures” (tests) as specifications before you build the code (TDD) and run the tests all the time (CI) to see the state of the system.
  2. Product-level measures of the success of the idea (things like “did this feature make more money?”, “Did it recruit more users?”, “did it improve system up time”. Define these measures for each feature.

I quite like the SRE model of identify a Service Level Indicator (SLO) that tells us what measure  indicates success for each feature as part of the specification of each feature.

Most people don’t think of the tests as measures, but certainly as part of a TDD process they are, they are specifications of what the software is meant to do, and as soon as one of these tests is failing you know that your code doesn’t meet it’s specification - “your shelves don’t fit!”

loreum 


Reply