Unlock the Secrets of Istio: Master Monitoring with These Expert Tips!

5 min readSep 4, 2024

After discussing how to implement rate limiting with Istio, let’s explore how to monitor these rate limits and what additional metrics we can track using Istio.

But before we dive in, Istio didn’t necessarily make our lives easier (or did they? You decide). Each Istio metric includes a reporter label with two options: source or destination.

Let’s use istio_requests_total{app="app-a"} as a reference metric to better understand the meaning of the reporter label.

The source reporter will show the outgoing traffic from app-a, while the destination reporter will display the incoming traffic to app-a.

Now, what happens when the label is source_app=”app-a” or destination_app=”app-a”?

When source_app=”app-a” and the reporter is source, app-a reports how many `istio_requests_total` there were.
On the other hand, if the reporter is destination and the label is destination_app=”app-b” (the destination app-a sends requests to), then app-b reports istio_requests_total.
Similarly, when destination_app=”app-a” and the reporter is source, the app sending requests to app-a (Proxy-to-all) will report istio_requests_total.
If the reporter is destination, app-a will be the one reporting istio_requests_total.

I know this is a lot to digest, so to simplify things, we’ll focus on using only reporter=source. This approach also helps reduce cardinality, but that’s a topic for another time.

Monitor rate limits.

concurrent connection

Let’s begin by monitoring the concurrent connections to a service. As you may recall, tcp.maxConnections creates a connection pool with a specified size, which we can easily monitor using the envoy_server_total_connections metric.

avg(envoy_server_total_connections{app=~”${App}”})

It’s easy to observe how the envoy_server_total_connections metric changes when I apply the destination rule with tcp.maxConnections=10. I used this metric in my k6 stress test to determine the appropriate value for tcp.maxConnections, as discussed in my previous blog.

Rate limiting

Another useful panel we can create shows how much of an app’s traffic is being rate-limited and identifies the source of the rate-limiting. When an app is rate-limited, it triggers the OU or URX flag. We can use a simple equation to calculate the percentage of traffic that is rate-limited, as follows:

((requests with OU or URX flags) / (all requests))

sum(rate(istio_requests_total{response_flags=~”UO|URX”,reporter=”source”}[$interval])) by (source_app,destination_service_name)
/
sum(rate(istio_requests_total{reporter=”source”}[$interval])) by (source_app,destination_service_name)

I sent 20 requests with this DestinationRule config:

  trafficPolicy:
    connectionPool:
      tcp:
        maxConnections: 10
      http:
        http1MaxPendingRequests: 5

which means that 5 requests should get UO/URX flags 5/20=0.25 which is 25%

Bonus R.E.D

The RED Method is a monitoring philosophy specifically designed for microservices, focusing on three key metrics: Rate, Errors, and Duration. It helps to consistently track the number of requests a service handles per second (Rate), the proportion of those requests that fail (Errors), and the time taken to process each request (Duration). This method provides a clear and actionable view of service performance, enabling teams to monitor user experience and ensure that services are meeting expected SLAs. By applying the RED Method across all services, teams can achieve a unified understanding of system behavior and enhance operational scalability.

Istio service mesh enhances the RED Method by automatically collecting metrics for Rate, Errors, and Duration across all microservices. Istio’s sidecar proxies gather this data without modifying application code, making it easier to monitor and analyze service performance. With Istio, you can effortlessly track RED metrics, set up alerts, and ensure consistent service behavior, all of which help in maintaining a high-quality user experience.

Thanks, Yossi Nachum for bringing this great philosophy to our team!

Let's create an outgoing traffic RED panels

Rate metric: istio_requests_total.
Error metric: istio_requests_total(No 5XX errors)/istio_requests_total.
Duration metric: istio_request_duration_milliseconds_bucket.

I will provide full promql at the end of the blog.

Try creating incoming RED panels.

Bonus deployment/restart annotations

A faulty deployment may take down the success rate with it. so yeah you can go to git and check if someone has deployed a new version to your app. or you can create an annotation that will let you know that.

Thanks to Amirl for the idea and for creating the annotation in our team.

In conclusion

mastering Istio’s monitoring capabilities can significantly elevate your service performance and reliability. By understanding how to effectively use metrics like rate limits, concurrent connections, and the RED method, you can gain invaluable insights into your microservices’ behavior and ensure they meet the highest standards. Whether you’re optimizing your Istio setup or looking to enhance your monitoring strategy, the techniques we’ve explored will empower you to make data-driven decisions that keep your applications running smoothly. Don’t just deploy Istio — unlock its full potential and watch your systems thrive.

# RATE
sum by (source_app)(
    rate(istio_requests_total{
        source_app="${App}",
        reporter="source"
        }[${interval}]
    )
)

# ERROR 
sum by (source_app) (
  rate(
    istio_requests_total{
      reporter="source",
      response_code!~"5.*",
      source_app="${App}"
    }[$interval]
  )
  )
  /
  sum by (source_app) (
    rate(
      istio_requests_total{
        reporter="source",
        source_app="${App}"
      }[$interval]
    )
  )

# Duraion for 0.50 0.90 and 0.99
histogram_quantile(
  0.50, # change this 
  sum by (le, source_app) (
    rate(
      istio_request_duration_milliseconds_bucket{
          reporter=~"source",
          source_app="${App}",
      }
    [$interval])
  )
)

# Deployment annotation
changes(max(kube_replicaset_created{namespace="$namespace",replicaset=~"$App.*"})[$__interval:])