Designed for Production Usage
Application performance monitoring products like Retrace APM+ are amazingly powerful for understanding the performance and behavior of your web applications. The downside is that APM solutions can slow down your applications due to their overhead. From day one we have designed Retrace APM+ to be very lightweight and safe for production servers.
Three key reasons why Retrace is designed for production usage:
- Code profiling is minimized to key application framework methods
- Implemented in highly optimized C++ code.
- Data processing is done in a separate process, outside of your application code
The last point above is particularly important. Some APM solutions collect, aggregate, and upload processed data all within the same process. That same process is your IIS worker process. This has the potential to cause erratic and major performance problems in your applications. Stackify avoids this by using a separate Windows service to process the profiler output to minimize performance impact to your application.
Impact on Real-World Applications
Our experience and industry research shows that most real-world applications receive less than 30 requests per second. In applications under this level of load, as you will see below, Retrace’s impact on performance is negligible. We also see a significant number of apps that perform synchronous database calls, getting IO bound quickly; in these settings, where most of the wait time is spent on database calls, our testing again has shown Retrace’s overhead to be minimal.
Note: Each web request causes our profiler to inspect and track roughly 50-60 method calls, and that, of course, varies wildly depending on what the request does since we automatically inspect all DB calls, cache calls, etc.
Key Metrics to Watch When Testing APM Profiler Overhead
When testing profiler overhead, there are a few metrics you want to track. One of the most important to not be overlooked is the throughput itself. With APM turned on, you could potentially see page load times not affected very much, but the actual number of requests being handled is lower. Here are the four most important metrics to measure:
- Requests response times
- Requests per second
- Total requests during test (total throughput)
- Application & server CPU usage
Comparison of Various Sites With Retrace APM+ On vs. Off
Large blog site
This application is an old codebase written with ASP.NET WebForms. The app itself makes a ton of database calls. The site runs on two Azure servers and has over a million monthly blog visitors. Each server receives about 7-10 requests per second and the workload varies wildly since it is a public site.
Turning Retrace APM+ on caused no noticeable overhead.
Retrace Web Services
This application is a mix of web services written in WCF, MVC, and Web API. It handles all the communication from our agents deployed on our clients’ servers. It handles a lot of database calls as well as writing to Azure table storage and queues. The site runs on multiple Azure servers and each server receives 10-12 requests per second.
Again, turning Retrace APM+ on caused no noticeable overhead.
MVC load test with very high request volume
We also performed a load test that represents a basic MVC site receiving 100 requests per second in traffic, which is up to 10x what most basic web applications receive. You can read more about it below.
Load Testing Retrace APM+ Under Heavy Load
For this test, we used loader.io against a single dual core server hosted on Windows Azure. For the test we hit a single URL. It is an MVC controller that returns a simple razor view and does not do any other operations. Most APM overhead is tied to the volume of methods that are inspected and instrumented. A high request volume on a simple page is the best way to do a controlled test to see APM overhead, as this removes wait time and variability for anything that might be IO bound or making boundary crossings to other servers or resources.
Retrace APM+ Results
Over a 10 minute window you can see that the application provided very consistent throughput and response times with little variance. The blue line below represents the response times.
Note: These numbers are virtually identical to having APM disabled. Response times of ~80ms includes the network latency seen by loader.io. Server side times were 0-1ms.
Results From a Competitive APM Provider
As a comparison here is the same test while having a different leading APM vendor’s solution enabled. This APM product caused random page load time spikes every minute or so. This is because their product is engineered to aggregate and upload the APM data it collects in the same process as the IIS worker process. This can cause random thread blocking and performance issues in your app.
We point this out because we want you to understand that Stackify was truly designed for speed, stability, and production usage. You can’t say that about every APM solution.
Load Test Results
Retrace’s APM+ is engineered specifically to cause very little impact on response times and throughput of your application while keeping CPU overhead as low as possible.
For a server doing a very high number of requests per second (100), we consider this additional CPU overhead while maintaining excellent response times to be very good and production safe.
Retrace APM+ has minimal to no impact to most web applications, making it safe to run at all times on production servers. Naturally, your results may not exactly mirror our test results since all apps are different, but hopefully this in-depth look into what you can expect with Retrace APM+ enabled will give you confidence that you can trust Stackify to provide deep visibility into your application’s performance hot spots without contributing new ones!