Your performance tests are smelly!

Intro

In 1999, Martine Fowler introduced the idea of “code smells”, which he defined as “surface indication that usually corresponds to a deeper problem in the system.”.
Code smells are not bugs, but rather certain code structures that imply bad design or shortcomings as far as readability and maintainability.

Examples of such code smells include duplicated code, dead code or data clumps.

Later in 2001, Van Deursen et.al further expanded on the concept of code smells and suggested that unit-test code should have its own set of smells.

For example: Resource optimism; Fire and forget; Assertion roulette.

In this article, I’ll extend on the idea of code smells and test smells to the realm of performance testing and load testing,

Smells

Smells, allow us to give names to known bad structures.
This way we can address these problems at any stage of the development process and discuss it.
It ease on the communication between developers, it provides a framework in which these problems can be fixed, and in recent years, with the advancements of artificial intelligence \ machine learning, it is possible to create “smell detectors” to locate such anti-patterns automatically.

With performance testing, I think we can recognize 3 areas of concern:

Design

These smells take place in the design phase of the test, meaning the scripting or the creation of the testing environment.

A failure on the design phase, could lead to unrealistic test scenarios which do not reflect a real-life situation, and it is therefore unreliable.

1 – Throughput trap

It’s very tempting to think that throughput is the only contributing factor in the test script, for example, 1000 RPS generated by 50 virtual users is similar in effect to 1000 RPS generated by 1000 virtual users.

However, this is not necessarily the case.

Modern servers may use caching to optimize performance.
It might as well be that they use queuing mechanisms for asynchronous events such as Kafka or Amazon’s SQS, and the name of these queues might be unique for each user identity.

This can influence performance.

Make sure that the throughput is not only correct in total but also distributed correctly between the virtual users, and make sure your virtual users have a unique identity.

2 – Utopian data storage

When the data in our load testing environment is “too neatly organized”.
In the real world, our data storages typically go through migrations, restructuring, maintenance work, restore from backup etc.

In order to have a realistic load testing scenario, our data storages must be as close to resembling production as possible.

By taking a sampled data from your current production environment, you can extrapolate what a future data storage might look like.

Remember that just like your workload models, your data storage models are also under a constant revision.

3 – Perfect timing

Your tests are short and usually get executed at the same time during the day.

It might well be that some systems have scheduled tasks, or any other influence related to the time during the day, but if you work from 09:00 to 18:00 and run your tests at these times only, you might miss a big portion of other times of the day.

Make sure to run longitude tests.

Analysis

Typically, after we execute a performance test, we analyze its results.
As opposed to our regular “functional” tests, where in we rely on assertions to validate software behavior, in performance tests we rely on matrices, measurable outcomes that are either expressed as aggregated data (for example a histogram plot), or a time series.

Analyzing metrices require some knowledge of statistics.
It requires caution, and attention to details.

1 – Means trap

If I sit in a room together with Bill Gates, on average we are both billionaires…

Averages are important, but they are insufficient for analyzing data.

You must look at other central tendency values such as median or mode.
It is also important to evaluate the dispersion of data by looking at the 90th, 95th or even the 99th percentiles.

2 – Aggregation sucker

This refers to analyzing results based on aggregated data while ignoring the time series analysis.

Aggregated values might look good overall, for example, you might find that 90% of the response times were under 1000 milliseconds.

However, when looking at the time series, you might find that there are large clusters of higher latency at some parts of the test.
This could provide valuable information, and sometimes it might follow a pattern (for example, a higher latency every round hour).

Interpretation

After the test was executed and the results are analyzed, now it’s time to make decisions based on these outcomes. If we give the analysis the wrong interpretation, this could lead to adverse effects

1 – Absolutely fine

A single test execution might be good when standing on its own, however it should be compared with past executions.
If there’s a significant regression or even improvement in the performance that cannot be explained by code or architectural changes, it has to be accounted for in the analysis.

2 – Negatively positive

A common misconception among testers in general is that a false positive (i.e. a bug was falsely detected while system behavior is adequate) is to be preferred to false negative (i.e. bug(s) were missed in the test execution).

The logic here is that false positives are nothing but a nuisance.
They are unpleasant to handle, but at least they are noticed.
On the other hand, false negatives are not noticed, since the tests tend to pass.

However, in the context of performance testing, false positive can lead to excessive use of resources that is unnecessary.

So as part of the final decision making, we should think of a possibility of false positives.
Remember, your job is not only to keep your product performant, but also to keep your products profitable!

3 – One-stop show

This one refers to relying on a single source of data analysis.
For example, the results of your load testing tool.
These results should be cross confirmed by other tools, such as your observability tools and even the application logs.
This can provide great confidence in your tests as well as allow you to get the story from different angles.

Summary

Smells are a great practice to coding, and it is a great practice for testing as well, and more specifically for performance testing, where a careful examination of evidence and attention to details is required.

I believe that it can ease on the communication between the testers and the other IT personnel such as backend developers, DBAs, devops and product managers.

It gives these common problems a name to be addressed and delivers a framework to improve the performance testing efforts.

Do you want to get started with performance testing?

Try out our latest release of NeoLoad to quickly and easily start designing, executing, and analyzing your first protocol and browser-based tests at https://www.tricentis.com/software-testing-tool-trial-demo/neoload-trial

References

Fowler M (1999) Refactoring: improving the design of existing code. Addison-Wesley Professional
Arie Deursen, Leon M.F. Moonen, A. Bergh, and Gerard Kok. 2001. Refactoring test code. Technical Report. CWI (Centre for Mathematics and Computer Science), NLD.

CThis article was originally written by Eldad Uzman for TestProject]

Page 1 / 1

Intro

In 1999, Martine Fowler introduced the idea of “code smells”, which he defined as “surface indication that usually corresponds to a deeper problem in the system.”.
Code smells are not bugs, but rather certain code structures that imply bad design or shortcomings as far as readability and maintainability.

Examples of such code smells include duplicated code, dead code or data clumps.

Later in 2001, Van Deursen et.al further expanded on the concept of code smells and suggested that unit-test code should have its own set of smells.

For example: Resource optimism; Fire and forget; Assertion roulette.

In this article, I’ll extend on the idea of code smells and test smells to the realm of performance testing and load testing,

Smells

Smells, allow us to give names to known bad structures.
This way we can address these problems at any stage of the development process and discuss it.
It ease on the communication between developers, it provides a framework in which these problems can be fixed, and in recent years, with the advancements of artificial intelligence \ machine learning, it is possible to create “smell detectors” to locate such anti-patterns automatically.

With performance testing, I think we can recognize 3 areas of concern:

Design

These smells take place in the design phase of the test, meaning the scripting or the creation of the testing environment.

A failure on the design phase, could lead to unrealistic test scenarios which do not reflect a real-life situation, and it is therefore unreliable.

1 – Throughput trap

It’s very tempting to think that throughput is the only contributing factor in the test script, for example, 1000 RPS generated by 50 virtual users is similar in effect to 1000 RPS generated by 1000 virtual users.

However, this is not necessarily the case.

Modern servers may use caching to optimize performance.
It might as well be that they use queuing mechanisms for asynchronous events such as Kafka or Amazon’s SQS, and the name of these queues might be unique for each user identity.

This can influence performance.

Make sure that the throughput is not only correct in total but also distributed correctly between the virtual users, and make sure your virtual users have a unique identity.

2 – Utopian data storage

When the data in our load testing environment is “too neatly organized”.
In the real world, our data storages typically go through migrations, restructuring, maintenance work, restore from backup etc.

In order to have a realistic load testing scenario, our data storages must be as close to resembling production as possible.

By taking a sampled data from your current production environment, you can extrapolate what a future data storage might look like.

Remember that just like your workload models, your data storage models are also under a constant revision.

3 – Perfect timing

Your tests are short and usually get executed at the same time during the day.

It might well be that some systems have scheduled tasks, or any other influence related to the time during the day, but if you work from 09:00 to 18:00 and run your tests at these times only, you might miss a big portion of other times of the day.

Make sure to run longitude tests.

Analysis

Typically, after we execute a performance test, we analyze its results.
As opposed to our regular “functional” tests, where in we rely on assertions to validate software behavior, in performance tests we rely on matrices, measurable outcomes that are either expressed as aggregated data (for example a histogram plot), or a time series.

Analyzing metrices require some knowledge of statistics.
It requires caution, and attention to details.

1 – Means trap

If I sit in a room together with Bill Gates, on average we are both billionaires…

Averages are important, but they are insufficient for analyzing data.

You must look at other central tendency values such as median or mode.
It is also important to evaluate the dispersion of data by looking at the 90th, 95th or even the 99th percentiles.

2 – Aggregation sucker

This refers to analyzing results based on aggregated data while ignoring the time series analysis.

Aggregated values might look good overall, for example, you might find that 90% of the response times were under 1000 milliseconds.

However, when looking at the time series, you might find that there are large clusters of higher latency at some parts of the test.
This could provide valuable information, and sometimes it might follow a pattern (for example, a higher latency every round hour).

Interpretation

After the test was executed and the results are analyzed, now it’s time to make decisions based on these outcomes. If we give the analysis the wrong interpretation, this could lead to adverse effects

1 – Absolutely fine

A single test execution might be good when standing on its own, however it should be compared with past executions.
If there’s a significant regression or even improvement in the performance that cannot be explained by code or architectural changes, it has to be accounted for in the analysis.

2 – Negatively positive

A common misconception among testers in general is that a false positive (i.e. a bug was falsely detected while system behavior is adequate) is to be preferred to false negative (i.e. bug(s) were missed in the test execution).

The logic here is that false positives are nothing but a nuisance.
They are unpleasant to handle, but at least they are noticed.
On the other hand, false negatives are not noticed, since the tests tend to pass.

However, in the context of performance testing, false positive can lead to excessive use of resources that is unnecessary.

So as part of the final decision making, we should think of a possibility of false positives.
Remember, your job is not only to keep your product performant, but also to keep your products profitable!

3 – One-stop show

This one refers to relying on a single source of data analysis.
For example, the results of your load testing tool.
These results should be cross confirmed by other tools, such as your observability tools and even the application logs.
This can provide great confidence in your tests as well as allow you to get the story from different angles.

Summary

Smells are a great practice to coding, and it is a great practice for testing as well, and more specifically for performance testing, where a careful examination of evidence and attention to details is required.

I believe that it can ease on the communication between the testers and the other IT personnel such as backend developers, DBAs, devops and product managers.

It gives these common problems a name to be addressed and delivers a framework to improve the performance testing efforts.

Do you want to get started with performance testing?

Try out our latest release of NeoLoad to quickly and easily start designing, executing, and analyzing your first protocol and browser-based tests at https://www.tricentis.com/software-testing-tool-trial-demo/neoload-trial

References

Fowler M (1999) Refactoring: improving the design of existing code. Addison-Wesley Professional
Arie Deursen, Leon M.F. Moonen, A. Bergh, and Gerard Kok. 2001. Refactoring test code. Technical Report. CWI (Centre for Mathematics and Computer Science), NLD.

eThis article was originally written by Eldad Uzman for TestProject]

This article provides an insightful extension of the concept of "code smells" to performance testing, and it's fascinating how the same principles of recognizing bad patterns can be applied to ensure the reliability and accuracy of load tests.

The examples of design smells, such as the "Throughput Trap" and "Utopian Data Storage," highlight the importance of realistic testing environments and scenarios. These nuances, like ensuring virtual users have unique identities or mimicking real-world data structures, are crucial for reliable performance results.

Also, the "Means Trap" and "Aggregation Sucker" emphasize the need for thorough analysis beyond averages and aggregated data. It's critical to evaluate response times across different percentiles and avoid relying solely on high-level summaries.

Ultimately, addressing these "performance smells" can help testers avoid common pitfalls and enhance the overall quality of load testing. It’s an excellent reminder that performance testing, like code design, is about paying attention to detail and continuously improving processes.

Intro

Smells

Design

1 – Throughput trap

2 – Utopian data storage

3 – Perfect timing

Analysis

1 – Means trap

2 – Aggregation sucker

Interpretation

1 – Absolutely fine

2 – Negatively positive

3 – One-stop show

Summary

Do you want to get started with performance testing?

References

Intro

Smells

Design

1 – Throughput trap

2 – Utopian data storage

3 – Perfect timing

Analysis

1 – Means trap

2 – Aggregation sucker

Interpretation

1 – Absolutely fine

2 – Negatively positive

3 – One-stop show

Summary

Do you want to get started with performance testing?

References

Reply

Sign up

Login to the community

Scanning file for viruses.

This file cannot be downloaded