APM best practices: Dos and don’ts guide for practitioners

blog-APM_Best_Practices.jpg

Application performance management (APM) is the practice of regularly tracking, measuring, and analyzing the performance and availability of software applications. APM helps you get visibility into complex microservices environments, which can overwhelm site reliability engineering (SRE) teams. The generated insights create an optimal user experience and achieve desired business outcomes. It’s a complex process, but the goal is straightforward: ensuring that an application runs smoothly and meets the expectations of users and businesses. 

A clear understanding of an application's operation and a proactive APM practice are crucial for maintaining high-performing software applications. APM shouldn’t be an afterthought. It should be considered from the beginning. When implemented proactively, it can be incorporated into how software runs by embedding monitoring components directly into the application.

What is application performance management?

Application performance management incorporates continuous monitoring, analysis, and management of an application’s backend and frontend performance. Application monitoring is expanding and evolving, but APM strategy shouldn’t be created in silos. It’s essential to bring in multiple stakeholders, business experts, application developers, and operations teams. A successful APM strategy goes beyond uptime or server health and focuses on application service level objectives (SLOs) before they become a problem for users. 

Modern APM implementation involves instrumenting your applications to collect three types of telemetry data: traces (request flows), metrics (aggregated measurements), and logs (discrete events). The challenge isn't just collecting data — it's collecting the right data without impacting performance.

Learn more about observability metrics.

There are numerous instrumentation approaches, but the most effective strategy combines auto-instrumentation (for frameworks and libraries) with manual instrumentation (for business logic). Auto-instrumentation using OpenTelemetry agents can capture 80% of your observability needs with minimal code changes:

# Auto-instrumentation handles this automatically
@app.route('/api/orders')
def create_order():
    # Add manual span only for critical business logic
    with tracer.start_as_current_span("order.validation") as span:
        span.set_attribute("order.value", order_total)
        if not validate_order(order_data):
            span.set_status(Status(StatusCode.ERROR))
            return 400

  • Do: Start with auto-instrumentation, then add manual spans for business-critical operations.

  • Don't: Manually instrument every function call — you'll create performance overhead and noise.

  • Pitfall: Over-instrumentation can add 15%–20% latency. Monitor your monitoring with baseline performance comparisons.

A few components for an organization or business to consider when developing an APM strategy are:

  • Performance monitoring, including evaluating latency, service level objectives, response time, throughput, and request volumes

  • Error tracking, including exceptions, crashes, and failed API calls 

  • Infrastructure monitoring, including health and resource usage of servers, containers, and cloud environments that support the application

  • User experience metrics, including load times, session performance, click paths, and browser or device details (It’s important to keep in mind that even if system metrics look fine, users may still encounter performance issues.)

Key principles of effective APM

The core principles of effective application performance management are end-to-end visibility (from the user's browser to the database), real-time monitoring and insights, and contextual insights, with a user- and business-objective focus. APM can improve application scalability by enabling continuous improvements and increasing performance over time.

  • Do: Implement real-time dashboards with SLO-based alerts rather than arbitrary thresholds.

  • Don't: Rely only on periodic performance reviews or CPU/memory alerts — instrument user experience metrics.

  • Pitfall: Alert fatigue from low-level system metrics. Focus on user-facing SLOs that indicate real problems.

When creating an APM strategy, here are a few key principles to consider:

1. Proactive monitoring: Prevent issues before they impact users by setting up alerts and responding quickly to any anomalies. But try to avoid alert fatigue. Balance automated alerts with human oversight so important issues don’t get missed, focusing on outcomes rather than system metrics. 

2. Real-time insights: Move beyond logging issues and enable fast decision-making based on live data and real-time dashboards that prioritize the most critical business transactions. Use telemetry data (logs, metrics, and traces) to parse your performance insights.

3. End-to-end visibility: Monitor the application across the entire environment, the entire user flow, and all layers, from frontend to backend.

4. User-centric approach: Prioritize performance and experience from an end-user perspective, while considering key business objectives.

5. Real user monitoring: The work doesn’t stop when it’s in your user’s hands. By monitoring their experience, you can iterate and improve based on their feedback.

6. Continuous improvement: Use insights to optimize over time and regularly uncover and tackle unreported issues. Issues should be addressed dynamically rather than when discovered in periodic performance reviews. 

7. Context propagation: Ensure trace context flows through your entire request path, especially across service boundaries:

# Outgoing request - inject context
headers = {}
propagate.inject(headers)
response = requests.post('http://service-b/process', headers=headers)

8. Sampling strategy: Use intelligent sampling to balance visibility with performance:

  • 1%–10% head-based sampling for high-traffic services

  • 100% sampling for errors and slow requests using tail-based sampling

  • Monitor instrumentation overhead — aim for <5% performance impact

Best practices for APM implementation

The right APM solution should support your technology stack with minimal instrumentation effort. OpenTelemetry has become the industry standard, providing vendor-neutral instrumentation that works across languages:

@RestController
public class OrderController {
    
    @PostMapping("/orders")
    public ResponseEntity<Order> createOrder(@RequestBody OrderRequest request) {
        // Auto-instrumentation captures this endpoint automatically
        // Add custom business context
        Span.current().setAttributes(Attributes.of(
            stringKey("order.value"), String.valueOf(request.getTotal()),
            stringKey("user.tier"), request.getUserTier()
        ));
        
        return ResponseEntity.ok(processOrder(request));
    }
}

  • Do: Implement sampling strategies and monitor instrumentation overhead in production.

  • Don't: Use 100% sampling for high-traffic services — you'll impact performance and explode storage costs.

  • Pitfall: Head-based sampling can miss critical error traces. Use tail-based sampling to capture all errors while reducing volume.

Here’s how to get it right:

  • Select the right APM solution: The right APM tool should align with an application's architecture and the organization's needs. The solution should provide an organization with the tools and capabilities it needs to monitor, track, measure, and analyze its software applications. A business may use OpenTelemetry, an open source observability framework, to instrument and collect telemetry data (traces, metrics, and logs) from applications. 

  • Manage cardinality to control costs: High-cardinality attributes can make metrics unusable and expensive:
# Good - bounded cardinality
span.set_attribute("user.tier", user.subscription_tier)  # 3-5 values
span.set_attribute("http.status_code", response.status_code)  # ~10 values

# Bad - unbounded cardinality  
span.set_attribute("user.id", user.id)  # Millions of values
span.set_attribute("request.timestamp", now())  # Infinite values
  • Set up intelligent alerting based on SLOs rather than arbitrary thresholds. Use error budgets to determine when to page someone:
slos:
  - name: checkout_availability
    target: 99.9%
    window: 7d
  - name: checkout_latency  
    target: 95%  # 95% of requests under 500ms
    window: 7d

  • Train teams and promote collaboration. An APM strategy impacts a wide range of stakeholders, not just developers. Be sure to involve IT teams and other business stakeholders in cross-departmental collaboration. Work together by implementing APM into your organizational setup. Make sure to establish clear goals and KPIs that align with business needs and consider user experience. 

  • Review and evaluate. An APM strategy continues to evolve and change alongside application and business needs.

Monitoring strategies in APM

A key aspect of a successful application performance management strategy is considering how and when to utilize different monitoring approaches. Considering a combination of monitoring strategies is vital because different components of an application, like user experience or infrastructure, require tailored approaches to detect and resolve issues effectively. A diverse strategy ensures comprehensive coverage, faster analysis, more uninterrupted application performance, and happier end users.


There are various monitoring approaches to consider: 
  • Real-time monitoring: Continuously tracks live system performance with sub-second granularity. Implement custom metrics for business logic alongside technical metrics:
order_processing_duration = Histogram(
    "order_processing_seconds",
    "Time to process orders", 
    ["payment_method", "order_size"]
)

with order_processing_duration.labels(
    payment_method=payment.method,
    order_size=get_size_bucket(order.total)
).time():
    process_order(order)
  • Synthetic monitoring: Simulates user interactions to detect issues before real users are affected. Critical for external dependencies:
// Synthetic check for critical user flow
const syntheticCheck = async () => {
    const span = tracer.startSpan('synthetic.checkout_flow');
    try {
        await loginUser();
        await addItemToCart();
        await completePurchase();
        span.setStatus({code: SpanStatusCode.OK});
    } catch (error) {
        span.recordException(error);
        span.setStatus({code: SpanStatusCode.ERROR});
        throw error;
    } finally {
        span.end();
    }
};

  • Deep-dive diagnostics and profiling: Helps troubleshoot complex performance bottlenecks, which could include third-party plugins or tools. Through application profiling, you can go deeper into your data and analyze how it is performing according to its functions.

  • Distributed tracing: Essential for microservices architectures. Handle context propagation carefully across async boundaries:
# Event-driven systems - propagate context through messages
def publish_order_event(order_data):
    headers = {}
    propagate.inject(headers)
    
    message = {
        'data': order_data,
        'trace_headers': headers  # Preserve trace context
    }
    kafka_producer.send('order-events', message)

APM data analysis and insights

Monitoring and gathering data is just the beginning. Businesses need to understand how to interpret application performance management data for tuning and decision-making.

Identifying trends and patterns helps teams proactively detect issues. Use correlation analysis to link user complaints with backend performance. See an example here using ES|QL (Elastic’s query language):

FROM traces-apm*
| WHERE user.id == "user_12345" 
  AND @timestamp >= "2024-06-06T09:00:00" 
  AND @timestamp <= "2024-06-06T10:00:00"
| EVAL duration_ms = transaction.duration.us / 1000
| KEEP trace.id, duration_ms, transaction.name, service.name, transaction.result
| WHERE duration_ms > 2000
| SORT duration_ms DESC
| LIMIT 10

Detecting bottlenecks: APM reveals common performance anti-patterns such as n+1 problems that can be seen in the code below. Use APM to optimize the code:

# N+1 query problem detected by APM
def get_user_orders_slow(user_id):
    user = User.query.get(user_id)
    orders = []
    for order_id in user.order_ids:  # Each iteration = 1 DB query
        orders.append(Order.query.get(order_id))
    return orders

# Optimized after APM analysis
def get_user_orders_fast(user_id):
    return Order.query.filter(Order.user_id == user_id).all()  # Single query

Correlating metrics and linking user complaints with backend performance data, including historical data, reveals how different parts of the system interact. This can help teams accurately diagnose root causes and understand the full impact of performance issues.

Automating root cause analysis and using AI/machine learning-based tools such as AIOps helps to accelerate diagnostics and resolution by pinpointing the source of problems, reducing downtime, and freeing up resources.

It’s important to use a holistic picture of your data to inform future decisions. The more data you have, the more you can leverage.

  • Do: Use distributed traces to identify the specific service and operation causing slowdowns.

  • Don't: Assume correlation means causation — verify with code-level profiling data.

  • Pitfall: Legacy systems often appear as black boxes in traces. Use log correlation and synthetic spans to maintain visibility.

Advanced implementation patterns

Complex production environments present unique challenges that require advanced implementation strategies. This section covers practical approaches for handling polyglot architectures, legacy system integration, and sophisticated correlation analysis.

Context propagation in polyglot environments: Maintaining trace context across different languages and frameworks requires explicit attention to propagation mechanisms:

// Java - Auto-propagation with Spring Cloud
@PostMapping("/orders")
public ResponseEntity<Order> createOrder(@RequestBody OrderRequest request) {
    Span.current().setAttributes(Attributes.of(
        stringKey("order.type"), request.getOrderType(),
        longKey("order.value"), request.getTotalValue()));
    
    // OpenFeign automatically propagates context to downstream services
    return paymentClient.processPayment(request.getPaymentData());}
// Go - Manual context extraction and propagation
func processHandler(w http.ResponseWriter, r *http.Request) {
    ctx := otel.GetTextMapPropagator().Extract(r.Context(), 
                                              propagation.HeaderCarrier(r.Header))
    ctx, span := tracer.Start(ctx, "process_payment")
    defer span.End()
    // Continue with trace context maintained}

Legacy system integration: Create observability bridges for systems that can't be directly instrumented:

# Synthetic spans with correlation IDs for mainframe calls
with tracer.start_as_current_span("mainframe.account_lookup") as span:
    correlation_id = format(span.get_span_context().trace_id, '032x')
    
    logger.info("CICS call started", extra={
        "correlation_id": correlation_id,
        "trace_id": span.get_span_context().trace_id
    })
    
    result = call_mainframe_service(account_data, correlation_id)
    span.set_attribute("account.status", result.status)

Advanced trace analysis with ES|QL: Link user complaints to backend performance using Elastic's query language:

-- Find slow requests during complaint timeframe
FROM traces-apm*
| WHERE user.id == "user_12345" AND @timestamp >= "2024-06-06T09:00:00"
| EVAL duration_ms = transaction.duration.us / 1000
| WHERE duration_ms > 2000
| STATS avg_duration = AVG(duration_ms) BY service.name, transaction.name
| SORT avg_duration DESC

-- Correlate errors across service boundaries
FROM traces-apm*
| WHERE trace.id == "44b3c2c06e15d444a770b87daab45c0a"
| EVAL is_error = CASE(transaction.result == "error", 1, 0)
| STATS error_rate = SUM(is_error) / COUNT(*) * 100 BY service.name
| WHERE error_rate > 0

Event-driven architecture patterns: Explicitly propagate context through message headers for async processing:

# Producer - inject context into message
headers = {}
propagate.inject(headers)
message = {
    'data': order_data,
    'trace_headers': headers  # Preserve trace context
}
await kafka_producer.send('order-events', message)

# Consumer - extract and continue trace
trace_headers = message.get('trace_headers', {})
context = propagate.extract(trace_headers)
with tracer.start_as_current_span("order.process", context=context):
    await process_order(message['data'])

  • Do: Use ES|QL for complex trace analysis that traditional dashboards can't handle.

  • Don't: Try to instrument legacy systems directly — use correlation IDs and synthetic spans.

  • Pitfall: Message queues and async processing break trace context unless explicitly propagated through headers.

  • Key insight: Perfect instrumentation isn't always possible. Strategic use of correlation IDs, synthetic spans, and intelligent querying provides comprehensive observability even in complex, hybrid environments.

APM for performance optimization with Elastic Observability

Elastic Observability makes implementing an application performance management strategy seamless by offering unified observability, combining application performance data with logs, metrics, and traces on a single powerful platform. Collecting data with Elastic’s Distributions of OpenTelemetry (EDOT) makes it quick and easy to start collecting APM data. 

Developers can set up alerts for anomalies, use distributed tracing to optimize specific services or transactions, reduce latency, and enhance performance stability with Elastic through load balancing and caching. 

Through code profiling, teams can identify performance hotspots, inefficient code paths, memory leaks, or resource-heavy operations that slow down applications. Businesses can create custom dashboards to track KPIs, ultimately supporting better business outcomes.

Explore Elastic Observability Labs for more technical observability content.

Additional APM resources 

The release and timing of any features or functionality described in this post remain at Elastic's sole discretion. Any features or functionality not currently available may not be delivered on time or at all.