Blog

Easy: A key requirement for continuous performance testing

11 months ago
11 August 2023
2 replies
116 views

Userlevel 1

paulsbruce
Crewman
1 reply

Things that are easy to do are easier to expect people to take up and make part of their process. Obviously. . . . But think about that for a minute in the context of all the DevOps tools and technologies. If you make it easy to do something right (e.g., provide templates, good practices instructions, and informed process guardrails, as so many SREs do for product teams) and hard to do the wrong thing, you’re likely to improve software quality far more than expecting someone who’s already too busy to become an expert in performance overnight.

Particularly for a topic like performance testing, “easier is likely better” goes not only for test scripting but also for automating processes and infrastructure requirements, and for consuming the results of testing. Many Tricentis NeoLoad customers are now providing their product teams “self-service” resources and training in order to scale the performance mindset beyond a small band of subject matter experts (SMEs). Those SMEs are then able to apply their expertise to PI planning, process automation, and DevOps teams who need more help than others.

The modern mantra: early, often, and easy

Feedback loops for performance are a critical step for modern continuous delivery practices. In a nutshell, late-stage performance fixes are too costly to do anyone good and there are a few elements that need to be in place to expect to get performance feedback early in development cycles. Automation is key, but so is prioritizing which areas of architecture to “left shift” performance testing into.

Prioritization is at the heart of what’s often missing with the statement “early and often.” That’s why I tend to include “easy” to the list to remind us that, if we have picked a path that includes high resistance, we’re likely to fail. By “easy,” I don’t necessarily mean the path that is perceived as least complex; I mean easy in the project management sense of low hanging fruit or obvious to business owners — in other words, “easy to agree upon, easy to see the importance of.” It’s a lot easier to make a case for performance testing systems in the critical path than services that are far removed from what others think is important.

Prioritize, then systematize

One book that every performance-minded engineer should read is System Performance: Enterprise and the Cloud by Brendan Gregg, wherein he provides useful mental models such as Utilization-Saturation- Errors (USE) for system analysis. If we adapt this method to help us prioritize which systems under test (SuT) for which to automate performance feedback loops, it would look something like this:

Utilization Which core APIs or microservices are at the nexus of many other business processes, thereby ballooning their usage as a shared enterprise architecture component?
Saturation Which services have experienced “overflow” — the need for additional VMs/costs to horizontally scale — or are constantly causing SEV-n incidents that require “all hands on deck” to get back into reasonable operational status?
Errors From a perspective of “process error,” how many times has a specific service update or patch been held up by long-cycle performance reviews? Which systems do we need fast time-to-deploy or where slow feedback cycles cause the product teams to slip their delivery deadlines (immediate or planned)?

Often, examples that rank high on one or more of these vectors include user/consumer authentication, cart checkout, claims processing, etc.

Other ways to view system performance, such as rate-error-duration (RED) signals and looking towards business metrics in analytics platforms, ensure that critical path and revenue-generating user experiences are working as expected.

Organizations that efficiently prioritize performance work look to these and other signals from both pre-production and in-production systems to see where to apply more of their efforts. Everyone now has some form of microservice architecture in play along with traditionally monolithic and shared systems, but not everyone realizes the benefits of independently testing and monitoring subsystems and distributed components in a prioritized and systematic fashion.

Easy scripting: why are APIs “easier”… and easier than what?

Modern APIs typically use standards such as HTTP and JSON to send and receive data. Often API teams also have a manifest of the REST API as described by an API specification, such as OpenAPI, Swagger, or WSDL documents. Additionally, test data is often more straightforward to inject since the payloads are often self-descriptive (i.e., field names and data match formats in examples). Finally, organizations with APIs often have functional test assets, such as in Postman or REST Assured test suites, which provide a reference point for constructing performance test scripts.

Dealing with API endpoints and payloads is often far less complicated than dealing with complete traffic captures of end-to-end web applications, which often include dynamic scripts, static content, “front-end” API calls only and other client-side tokenization semantics. API endpoints described in specification docs make scripting and playing tests back a far simpler proposition than ever before, rendering them as “easy” targets for early testing. Service descriptors also make it far easier to mock out APIs than entire web servers for end-to-end app tests . . . just look to the Mockiato project for examples of how it doesn’t take a genius or the rich to do service virtualization.

In NeoLoad, both OpenAPI and WSDL descriptors can be imported to quickly create load test paths, which can be further customized with dynamic data, text validations, and SLAs. NeoLoad also supports YAML-defined test details, which further simplifies “easy” load test suite generation. Once tests are defined, tests can be executed as part of a CI orchestration or pipeline, producing test results and SLA indicators as “feedback” information in early development cycles.

Easy pipelines: beyond the test script and into “codified process”

Everything is code now. Apps and services are code. Testing is code. Infrastructure is code (IaC). Networking is code. Security is code. Even compliance is code. And, yes, most if not all of these processes are also representable as code. Why has this happened?

Code is executable by a machine. It’s a representation of what you want to happen, whether you’re there to press the button to do it or not. It can be scheduled, triggered, and replayed. Code in a repository represents transparency, whether open source or closed source. Those that can read the code can understand what it’s doing and, when it’s not working, potentially see why. Code can be versioned, rolled back and forward, audited, and revised.

Test assets, testing infrastructure, and testing process, when represented as code, inherit all of these affinities. Most of the coding world has unified around Git (in one flavor or another) as the default version management protocol over all code assets. This allows performance testing tools like NeoLoad, Keptn, and a million other technologies to allow their artifacts to be maintained in an approved and understood system of control.

Many Tricentis customers not only store their test suites in Git but also store their “performance pipeline” scripts in the same repos as their load test projects. Why? Because changes to test scripts, testing semantics, and test data sources all usually predicate incorporation into the semantics of their process as code (a.k.a. the pipeline scripts). These changes can be done in a separate branch from the primary/master branch already approved and used by teams, then that branch can be run as the pipeline and proven to be working in order for promotion back to the master/trunk. Just like test case promotion, testing process promotion is critical to keeping the delivery process running smoothly and on time.

“Performance pipelines” are also often separate, triggerable processes from broader delivery pipelines. Why? When you run a performance test and it fails, you don’t necessarily need to rerun build, package, and deploy predicating steps; rather, operational configuration and deployment tweaks are often good enough to achieve optimal performance. After these changes are made, simply re-running the performance testing pipeline is a trivial task.

NL67-swagger-1
Easily import Swagger and any OpenAPI file to quickly get a NeoLoad performance test scenario which matches the API definitions.

Easy infrastructure

There is this “big rock” that load testing puts in the way of people who need reliable, statistically significant feedback on system or component performance. It is that the infrastructure you’re testing, the system under test (SuT), must be separate from the infrastructure used to put pressure on that SuT. With short, tiny local performance checks, using a single compute resource is fine.

With larger and longer tests, you have to break the work up across multiple compute resources. This is what we call “load generators.” Where you have more than one compute resource, you need a test orchestrator, which is what we call a “controller.” This is the same concept as CI systems and their build nodes, just for the purposes of providing performance data that is not biased by the SuT compute resources when under pressure.

In a pipeline, it’s a pretty bad idea to conflate the role of the build node with the load-making software, particularly because if that resource is stressed by making the load, it stops reliably reporting build status back to the CI master. Most build nodes are simply for tools such as Maven, Python, Sonar, and various CLIs. A flaky agent is a bad actor, particularly at scale when running many of these tests independent of one another. No one likes flaky or failed builds.

An alternative to this is to leave the pipeline’s worker node to just execution semantics, but requisition load infrastructure dynamically from a separate system. Early industry attempts to provide “sidecar” build containers to pipeline code ultimately collapsed under the complexities of container networking orchestration and resource contention. Still, some NeoLoad customers who adopted performance testing into pipelines as early as Jenkins 2.x would use cloud CLI tools like AWS or GCP to make provisioning calls, wait for acquisition, run load tests using the resources, then spin them down. That’s a lot of work and ultimately bulks out your pipeline code.

Particularly as many organizations now use cloud providers for their CI processes, NeoLoad Web Runtime supports auto-provisioning these resources via OpenShift and Kubernetes providers. Traditional static infrastructure is still supported by attaching these always-on resources to NeoLoad Web Resource “zones,” but if at all possible, most customers prefer provisioning as-needed testing infrastructure via elastic compute like AWS EKS Fargate and other Kubernetes-compliant container provisioning platforms. And if you don’t want to think about infrastructure, there’s always the NeoLoad Cloud Platform for SaaS-based load generators.

Abstracting load infrastructure from performance pipeline semantics has dramatically improved our customers’ ability to provide a self-service performance feedback model to many product teams across large organizations. Plus, it makes your pipelines far more concise, easy to understand, and quick to adjust as necessary.

Easy go/no-gos

We make all these things easy . . . so what? What’s the point of testing if it doesn’t quickly provide developers and ops engineers useful feedback that can actually help them do something about the performance of their systems?

Aside from the usual (but still necessary) raw load testing results, teams often set up thresholds or SLAs to warn when performance is outside an acceptable range. SREs tend to establish these thresholds using service level objectives (SLOs) and measure them with service level indicators (SLIs) so that it’s easy to know when performance takes a dive.

You shouldn’t have to wait a long time to know if during your performance pipeline, your systems aren’t ready and are failing performance thresholds. Going back to the notions of USE and RED, these service “impact” metrics should also be a critical part of your threshold and performance test failure strategy. The combination of RED and USE measurements during a load test allows performance pipelines to “fail fast” across both the pressure/load and impact/service observations. Without both these types of metrics, you simply do not have a sufficient view to know if performance is acceptable or not.

For this very purpose, the NeoLoad CLI used in our customers’ performance pipelines supports easy to implement “fastfail“ syntax. Similarly, our YAML-based “as-code” specification supports not only SLA thresholds (which may be different per environment and easy to define differently per environment), but also direct Dynatrace support to capture and emit the full spectrum of RED and USE metrics in NeoLoad Web results and dashboards.

Finally, comparing current tests to a baseline and accessing trend-based data should be easy within the context of pipelines. Using this data in real time via REST APIs is a key goal of the NeoLoad Web API, such that whatever your particular design and analysis requirements, you have a reliable and future-proof way of building these into your performance pipeline process.

Putting it all together

We at Tricentis help customers do these things on a daily basis, and there are so many configurations and technology choices already in play. Reference examples are available on GitHub, as are our load testing containerized components, but the broader performance testing learnings are:

Start codifying your performance testing process over APIs.
Use those patterns and learnings to extend out to other types of app testing.
Don’t expect good results from heavy tooling baked into CI build nodes.
Make sure that there are environment-appropriate thresholds/SLAs in place.
Expose both RED and USE metrics into your pipelines to simplify go/no-go outcomes.
Standardize how results are reported into a central place where data can be aggregated.

For an in-depth discussion of these performance testing topics, check out our white paper A practical guide to continuous performance testing.

2 replies

N

Hi Paul,
thank you for that great Blog. If I think about “shift left” and NeoLoad, then the question arises in my mind:
How far left can we go with NeoLoad as part of "shift left" strategy?

if Mocks are available
if the developer design the NeoLoad test scripts
if containerization is possible

Would it be possible with NeoLoad to go far left than Service Endpoints (to the direction of unit/component testing)?

For instance:

Check the performance of the individual Java components through unit tests. This could include measuring the execution time of key methods and functions, as well as checking memory usage.
Simulate load at the individual component level to ensure each component can handle the expected load. This can be done by testing parallel calls, large amounts of data, or intensive computing operations.
Monitor memory usage of Java components to ensure no memory leaks or unexpected memory issues occur.

What do you think?
Regards, Niko.

Userlevel 1

paulsbruce
Author
Crewman
1 reply
4 months ago
27 February 2024

Niko, all good points. A few thoughts:

On “shift left” as a “strategy”:

To me, the word “strategy” is ambiguously used by folks, so I’m interested to know your definition...and also about “shift left” (equally ambiguous across the industry). My definition of strategy comes from Rumelt “...its the craft of figuring our which purposes are both worth pursuing and capable of being accomplished”...which IMO clearly puts the phrase “shift left” in a bad light because it is A) insufficient as a description of what/how/when/why to “shift” B) something that proposes in-specific and thus non-actionable. That is unless there is a very clear and complete strategy of how to do that.

https://www.goodreads.com/en/book/show/11721966

Most leaders and practitioners I’ve talked to and supported over the years often expect that load testing should be treated the same as any other testing. It is not, and not simply because “it takes too long”. Systems are often rarely ready for load testing (disambiguating from Performance Testing here). They also don’t require a strong culture of being clear and specific about the requirements are that drive the development, testing, and ultimate performance of the services they build. It takes cross-functional systems thinking (and dedicated, skilled practitioners) to both perform proper load testing as well as then find useful outcomes that can be automated in pipelines (what people typically think of when they talk about ‘shifting testing left’). In my experience, devs are paid to make features and test-ish them, and as such are often in that conflict when being responsible for (what is traditionally labeled) ‘non-functional’ testing and requirements. The short is, if people can’t articulate specifically what kind of feedback they need and why, and if it isn’t critical to the predictable and safe delivery of software to production, then maybe it’s not something that needs to be ‘shifted’ earlier in cycles or shoved up into a pipeline just because ‘everything must be automated’. Maybe progressive feedback loops at different times and producing component, service, and end-to-end evidences is exactly the kind of nuanced strategy that leaders and developers alike just don’t have the time to appreciate the need for.

I have a lot of my personal feelings on the matters above in my YouTube channel over the past 5 years: https://www.youtube.com/@PaulBruceIO/videos

On the numbered points you made:

To your first point, this sounds a lot like the need for ‘instrumentation’ to me. Using unit tests to drive execution while measuring small samples of timings, while not a load test, taken consistently over a period of time is exactly the kind of “single user performance monitoring” that James Pulley is known for encouraging. Traditional ‘instrumentation’ injected all sorts of often-times disruptive external library callbacks into binaries, which still remains as the best way (sadly) to do ‘performance monitoring’ of desktop apps. Modern web apps aren’t “binaries” (except if they’re Java :) ) at all, and in fact are mostly UI with a bunch of API calls to and in through subsystems/services, but the same notion of instrumentation applies. All the APM vendors inject themselves into the runtimes for this. The modern way of thinking about unit-level performance is to take samples over time from telemetry sources that are passive to the actual running software...less “inject”, more “emit”, which means that service and app teams need to intentionally instrument their apps to do so (see OpenTelemetry and https://o11yfest.org videos for more.) But much of this is more for service/operational telemetry and not purely about re-using unit tests (known good inputs producing known good outputs) as you mentioned.

To your second point about ‘individual components’…without those components and the functions/methods comprising them being manifest as some load testable interface (via HTTP, GRPC, etc.), then there is no such thing as ‘component’ load testing. There is service (or in the case of web apps, surface) load testing who’s main goal is to provide evidences, hopefully statistically relevant ones, of system performance when components are taken together.

The third point rolls back into the answer to the first point...monitoring this stuff may be useful to detect when prior patterns are no longer the current (uptick) ones, and the unit tests should be nucleic enough to at least point devs in the direction of where to start looking. Aggregating these with some kind of hierarchical intel coming from the components and dependencies would also be potentially useful to show “biggest contributing factors” (not “root cause” because there’s usually not just one problem).

So, to sum up...NeoLoad is a load testing platform built to simplify making lots of outbound calls over standard protocols to accessible surface areas such as APIs, web resources/apps, even some non-standard protocols. I’ve also seen it (mis?)used for bulk data generation, long-running monitoring tasks, even some folks have turned it in to a highly parallel API functional testing framework. But to run (what are traditionally called) “unit tests” on components...I don’t think it’s fruitful. There are too many yikes moments in my head about intra-process parallelization and other process-vs-service hurdles to overcome, ultimately resulting in the need for ‘instrumentation’ and/or telemetry emission anyway.

But these are just my personal thoughts, I’m totally open to further discussion about what additional options are out there to help you get the feedback about the components your team needs to make good decisions.

The modern mantra: early, often, and easy

Prioritize, then systematize

Easy scripting: why are APIs “easier”… and easier than what?

Easy infrastructure

Easy go/no-gos

Putting it all together

Reply

Sign up

Login to the community

Scanning file for viruses.

This file cannot be downloaded