Unlock the Secrets of Istio: Master Monitoring with These Expert Tips!
After discussing how to implement rate limiting with Istio, let’s explore how to monitor these rate limits and what additional metrics we can track using Istio.
But before we dive in, Istio didn’t necessarily make our lives easier (or did they? You decide). Each Istio metric includes a reporter
label with two options: source
or destination
.
Let’s use istio_requests_total{app="app-a"}
as a reference metric to better understand the meaning of the reporter
label.
The source
reporter will show the outgoing traffic from app-a, while the destination
reporter will display the incoming traffic to app-a.
Now, what happens when the label is source_app=”app-a”
or destination_app=”app-a”
?
- When
source_app=”app-a”
and the reporter issource
, app-a reports how many `istio_requests_total` there were. - On the other hand, if the reporter is
destination
and the label isdestination_app=”app-b”
(the destination app-a sends requests to), then app-b reportsistio_requests_total
. - Similarly, when
destination_app=”app-a”
and the reporter issource
, the app sending requests to app-a (Proxy-to-all) will reportistio_requests_total
. - If the reporter is
destination
, app-a will be the one reportingistio_requests_total
.
I know this is a lot to digest, so to simplify things, we’ll focus on using only reporter=source
. This approach also helps reduce cardinality, but that’s a topic for another time.
Monitor rate limits.
concurrent connection
Let’s begin by monitoring the concurrent connections to a service. As you may recall, tcp.maxConnections
creates a connection pool with a specified size, which we can easily monitor using the envoy_server_total_connections
metric.
It’s easy to observe how the envoy_server_total_connections
metric changes when I apply the destination rule with tcp.maxConnections=10
. I used this metric in my k6 stress test to determine the appropriate value for tcp.maxConnections
, as discussed in my previous blog.
Rate limiting
Another useful panel we can create shows how much of an app’s traffic is being rate-limited and identifies the source of the rate-limiting. When an app is rate-limited, it triggers the OU
or URX
flag. We can use a simple equation to calculate the percentage of traffic that is rate-limited, as follows:
((requests with OU or URX flags) / (all requests))
sum(rate(istio_requests_total{response_flags=~”UO|URX”,reporter=”source”}[$interval])) by (source_app,destination_service_name)
/
sum(rate(istio_requests_total{reporter=”source”}[$interval])) by (source_app,destination_service_name)
I sent 20 requests with this DestinationRule
config:
trafficPolicy:
connectionPool:
tcp:
maxConnections: 10
http:
http1MaxPendingRequests: 5
which means that 5 requests should get UO/URX flags 5/20=0.25 which is 25%
Bonus R.E.D
The RED Method is a monitoring philosophy specifically designed for microservices, focusing on three key metrics: Rate, Errors, and Duration. It helps to consistently track the number of requests a service handles per second (Rate), the proportion of those requests that fail (Errors), and the time taken to process each request (Duration). This method provides a clear and actionable view of service performance, enabling teams to monitor user experience and ensure that services are meeting expected SLAs. By applying the RED Method across all services, teams can achieve a unified understanding of system behavior and enhance operational scalability.
Istio service mesh enhances the RED Method by automatically collecting metrics for Rate, Errors, and Duration across all microservices. Istio’s sidecar proxies gather this data without modifying application code, making it easier to monitor and analyze service performance. With Istio, you can effortlessly track RED metrics, set up alerts, and ensure consistent service behavior, all of which help in maintaining a high-quality user experience.
Thanks, Yossi Nachum for bringing this great philosophy to our team!
Let's create an outgoing traffic RED panels
Rate metric: istio_requests_total.
Error metric: istio_requests_total(No 5XX errors)/istio_requests_total.
Duration metric: istio_request_duration_milliseconds_bucket.
I will provide full promql at the end of the blog.
Try creating incoming RED panels.
Bonus deployment/restart annotations
A faulty deployment may take down the success rate with it. so yeah you can go to git and check if someone has deployed a new version to your app. or you can create an annotation that will let you know that.
Thanks to Amirl for the idea and for creating the annotation in our team.
In conclusion
mastering Istio’s monitoring capabilities can significantly elevate your service performance and reliability. By understanding how to effectively use metrics like rate limits, concurrent connections, and the RED method, you can gain invaluable insights into your microservices’ behavior and ensure they meet the highest standards. Whether you’re optimizing your Istio setup or looking to enhance your monitoring strategy, the techniques we’ve explored will empower you to make data-driven decisions that keep your applications running smoothly. Don’t just deploy Istio — unlock its full potential and watch your systems thrive.
# RATE
sum by (source_app)(
rate(istio_requests_total{
source_app="${App}",
reporter="source"
}[${interval}]
)
)
# ERROR
sum by (source_app) (
rate(
istio_requests_total{
reporter="source",
response_code!~"5.*",
source_app="${App}"
}[$interval]
)
)
/
sum by (source_app) (
rate(
istio_requests_total{
reporter="source",
source_app="${App}"
}[$interval]
)
)
# Duraion for 0.50 0.90 and 0.99
histogram_quantile(
0.50, # change this
sum by (le, source_app) (
rate(
istio_request_duration_milliseconds_bucket{
reporter=~"source",
source_app="${App}",
}
[$interval])
)
)
# Deployment annotation
changes(max(kube_replicaset_created{namespace="$namespace",replicaset=~"$App.*"})[$__interval:])