Composite metrics
What are composite metrics?
Composite metrics enable you to define new, higher-level metrics by applying a set of mathematical transformations and operations to existing native metrics. This allows for the composition of complex queries and the creation of metrics that provide more insightful and actionable data.
Composite metrics leverage the flexibility and expressiveness of PromQL to perform these transformations. PromQL is a widely-adopted query language in the monitoring and observability community. This approach facilitates a more standardized and powerful way to analyze and visualize your metric data.
With composite metrics, you can:
-
Aggregate data across different dimensions and time intervals.
-
Perform arithmetic operations to combine metrics.
-
Apply functions such as averages, sums, rates, and more.
-
Filter and group metrics based on specific labels and criteria.
Composite metric must be prefixed by composite
.
Create a composite metric
-
Press Create Metric in the Metrics Explorer to create a composite metric.
-
Enter a name for the new metric and fill in additional details, such as Display Name, Description, and metric Unit as desired.
-
Click the Formula tab and enter a PromQL definition in the field provided. To insert an existing metric into the PromQL definition, click Insert Metric and select the metric name. See SolarWinds composite metric PromQL syntax.
-
Click Preview Definition to preview the data for the current definition of the composite metric and verify that the expected data appears.
-
Click Create.
SolarWinds composite metric PromQL syntax
Composite metrics are created using the PromQL syntax, with further customizations and rules unique to SolarWinds Observability. To understand the basic syntax and structure of a PromQL query, review the PromQL Querying Prometheus documentation.
The SolarWinds implementation of PromQL includes several customizations to better align with our system's requirements. These customizations include support for additional functions, a different naming convention, and variable substitution. Below are some key differences and extensions.
SolarWinds PromQL naming convention
Our system follows the OpenTelemetry (OTel) naming conventions for both metric names and attributes, instead of the Prometheus restrictions. This section explains the differences and rules applied in our implementation.
SolarWinds PromQL metric names
Metric names in our system are backward compatible with PromQL metric naming. In addition, metric names can contain additional characters. Metric names must match following regex: [a-zA-Z_:][a-zA-Z0-9_:.]*
and may contain escaped characters using a backslash (\
).
Example metric names: service.requests.total
, k8s.container_cpu_usage_seconds_total
, or system/network/throughput
SolarWinds PromQL attribute names
Attributes (also known as labels in Prometheus) follow OTel naming conventions, which provide a consistent and descriptive approach. Attribute names can include alphanumeric characters, underscores (_), dots (.
), and may contain escaped characters using a backslash (\
). Attribute names must match the following regex: [a-zA-Z_.][a-zA-Z0-9_:.]*
.
Example attributes: service.name
or container.id
Basics
Expression language data types
In the SolarWinds Observability SaaS PromQL expression language, an expression or sub-expression can evaluate to one of three types:
-
Instant vector: A set of time series containing a single sample for each time series, all sharing the same timestamp
-
Range vector: A set of time series containing a range of data points over time for each time series
-
Scalar: A simple numeric floating point value
Literals
String literals
String literals are designated by single quotes or double quotes.
Float literals
Scalar float values can be written as literal integers or floating-point numbers.
Time series selectors
Time series selectors are responsible for selecting the times series and raw or inferred sample timestamps and values.
Do not confuse time series selectors with the higher level concept of instant and range queries that can execute the time series selectors. A higher level instant query would evaluate the given selector at one point in time. However, the range query would evaluate the selector at multiple different times in between a minimum and maximum timestamp at regular steps.
Instant vector selectors
Instant vector selectors allow the selection of a set of time series and a single sample value for each at a given timestamp (point in time). In the simplest form, only a metric name is specified, which results in an instant vector containing elements for all time series that have this metric name.
This example selects all time series that have the http_requests_total
metric name:
http_requests_total
You can filter these time series further by appending a comma-separated list of label matchers in curly braces ({}
).
This example selects only those time series with the http_requests_total
metric name that also have the job
label set to prometheus
and their group
label set to canary
:
http_requests_total{job="prometheus",group="canary"}
You can also negatively match a label value, or match label values against regular expressions. The following label matching operators are available:
-
=
: Select labels that are exactly equal to the provided string. -
!=
: Select labels that are not equal to the provided string. -
=~
: Select labels that regex-match the provided string. -
!~
: Select labels that do not regex-match the provided string.
For example, this selects all http_requests_total
time series for staging
, testing
, and development
environments and HTTP methods other than GET
.
http_requests_total{environment=~"staging|testing|development",method!="GET"}
Label matchers that match empty label values also select all time series that do not have the specific label set at all. It is possible to have multiple matchers for the same label name.
For example, given the dataset:
http_requests_total
http_requests_total{replica="rep-a"}
http_requests_total{replica="rep-b"}
http_requests_total{environment="development"}
Multiple matchers can be used for the same label name. They all must pass for a result to be returned.
The query:
http_requests_total{replica!="rep-a",replica=~"rep.*"}
Would then match:
http_requests_total{replica="rep-b"}
All regular expressions use RE2 syntax.
Range vector selectors
Range vector literals work like instant vector literals, except that they select a range of samples back from the current instant. Syntactically, a time duration is appended in square brackets ([]
) at the end of a vector selector to specify how far back in time values should be fetched for each resulting range vector element. The range is a closed interval, Meaning that samples with timestamps coinciding with either boundary of the range are still included in the selection.
In this example, we select all the values we have recorded within the last five minutes for all time series that have the metric name http_requests_total
and a method
label set to get
:
http_requests_total{method="get"}[5m]
Time durations
Time durations are specified as a number followed immediately by one of these units:
ms
: Millisecondss
: Secondsm
: Minutesh
: Hours
Offset modifier
The offset
modifier allows changing the time offset for individual instant and range vectors in a query.
For example, the following expression returns the value of http_requests_total
5 minutes in the past relative to the current query evaluation time:
http_requests_total offset 5m
Note that the offset
modifier always needs to follow the selector immediately. Therefore, the following would be correct:
sum(http_requests_total{method="GET"} offset 5m) // GOOD.
But the following would be incorrect:
sum(http_requests_total{method="GET"}) offset 5m // INVALID.
The same works for range vectors. This returns the 5-minute rate
that http_requests_total
had 24 hours ago:
rate(http_requests_total[5m] offset 24h)
Supported operators
The SolarWinds Observability SaaS custom PromQL supports a subset of standard PromQL functions, along with some additional functions unique to our implementation.
Aggregation operators
SolarWinds Observability SaaS PromQL supports the following built-in aggregation operators that can be used to aggregate the elements of a single instant vector, resulting in a new vector of fewer elements with aggregated values:
sum
(calculate sum over dimensions)min
(select minimum over dimensions)max
(select maximum over dimensions)avg
(calculate the average over dimensions)count
(count number of elements in the vector)topk
(largest k elements by sample value)bottomk
(smallest k elements by sample value)
You can use these operators to aggregate over all label dimensions, or you can preserve distinct dimensions by including a without
or by
clause. These clauses can be used before or after the expression.
<aggr-op> [without|by (<label list>)] ([parameter,] <vector expression>)
or
<aggr-op>([parameter,] <vector expression>) [without|by (<label list>)]
-
label list
is a list of unquoted attributes that can include a trailing comma, meaning that both(label1, label2)
and(label1, label2,)
are valid syntax. -
without
removes the listed labels from the result vector, while all other labels are preserved in the output. -
by
does the opposite and drops labels that are not listed in theby
clause, even if their label values are identical between all elements of the vector. -
parameter
is required only fortopk
orbottomk
. -
topk
andbottomk
are different from other aggregators in that a subset of the input samples, including the original labels, are returned in the result vector.by
andwithout
are used only to bucket the input vector.
Example:
If the metric http_requests_total
had time series that fanned out by application
, instance
, and group
labels, we could calculate the total number of seen HTTP requests per application and group over all instances using:
sum without (instance) (http_requests_total)
Which is equivalent to:
sum by (application, group) (http_requests_total)
If we are only interested in the total number of HTTP requests we have seen in all applications, we could write:
sum(http_requests_total)
To get the five largest HTTP requests counts across all instances, we could write:
topk(5, http_requests_total)
Arithmetic binary operators
The following binary arithmetic operators are available:
+
(addition)-
(subtraction)*
(multiplication)/
(division)%
(modulo)^
(power/exponentiation)
Binary arithmetic operators are defined between scalar/scalar, vector/scalar, and vector/vector value pairs.
-
Between two scalars, they evaluate to another scalar that is the result of the operator applied to both scalar operands.
-
Between an instant vector and a scalar, the operator is applied to the value of every data sample in the vector. For example, if a time series instant vector is multiplied by 2, the result is another vector in which every sample value of the original vector is multiplied by 2. The metric name is dropped.
-
Between two instant vectors, a binary arithmetic operator is applied to each entry in the left-hand side vector and its matching element in the right-hand vector. The result is propagated into the result vector with the grouping labels becoming the output label set. Entries for which no matching entry in the right-hand vector can be found are not part of the result.
Comparison binary operators
The following binary comparison operators are available in Prometheus:
==
(equal)!=
(not-equal)>
(greater-than)<
(less-than)>=
(greater-or-equal)<=
(less-or-equal)
Comparison operators are defined between vector/scalar value pairs.
Between an instant vector and a scalar, these operators are applied to the value of every data sample in the vector, and vector elements between which the comparison result is false
get dropped from the result vector.
Binary operator precedence
The following list shows the precedence of binary operators, from highest to lowest.
-
^
-
*, /, %
-
+, -
-
==, !=, <=, <, >=, >
Operators on the same precedence level are left-associative. For example, 2 * 3 % 2
is equivalent to (2 * 3) % 2
. However, ^
is right-associative, so 2 ^ 3 ^ 2
is equivalent to 2 ^ (3 ^ 2)
.
Vector matching
Operations between vectors attempt to find a matching element in the right-hand side vector for each entry in the left-hand side. There are two basic types of matching behavior: one-to-one and many-to-one/one-to-many.
Vector matching keywords
The following vector matching keyword allows for matching between series with different label sets providing:
on
Label lists provided to matching keywords determine how vectors are combined.
Group modifiers
The following group modifier enables many-to-one/one-to-many vector matching:
group_left
Label lists can be provided to the group modifier that contain labels from the "one"-side to be included in the result metrics.
One-to-one vector matches
One-to-one finds a unique pair of entries from each side of the operation. In the default case, that is an operation following the format vector1 <operator> vector2
. Two entries match if they have the exact same set of labels and corresponding values.
Many-to-one and one-to-many vector matches
Many-to-one and one-to-many matchings refer to the case where each vector element on the "one"-side can match with multiple elements on the "many"-side. This must be explicitly requested using the group_left
, where the left vector has the higher cardinality.
<vector expr> <bin-op> on(<label list>) group_left <vector expr>
Example query:
method_code:http_errors:rate5m / on(method) group_left method:http_requests:rate5m
In this case the left vector contains more than one entry per method
label value. Therefore, we indicate this using group_left
. The elements from the right side are now matched with multiple elements with the same method
label on the left:
{method="get", code="500"} 0.04 // 24 / 600 {method="get", code="404"} 0.05 // 30 / 600 {method="post", code="500"} 0.05 // 6 / 120 {method="post", code="404"} 0.175 // 21 / 120
Supported functions
SolarWinds Observability SaaS PromQL supports the following functions.
abs()
Function
abs(v instant-vector)
Definition
Returns the input vector with all sample values converted to their absolute values.
ceil()
Function
ceil(v instant-vector)
Definition
Rounds the sample values of all elements in v
up to the nearest integer value greater than or equal to v
.
Examples
ceil(1.49) = 2.0
ceil(1.78) = 2.0
fill()
This function is unique to the SolarWinds Observability SaaS implementation.
Function
fill(value scalar, v instant-vector)
Definition
Given a sparse metric stream, fill()
will "fill in" any missing data points with a value that you provide.
Parameters
value
|
(Required) Must be a number. |
Example
Use the fill()
function to ensure a 0 value is rendered if no data is received.
fill(0, json.errors.count)
All streams now display a 0 value if no measurements are received.
floor()
Function
floor(v instant-vector)
Definition
Rounds the sample values of all elements in v down to the nearest integer value smaller than or equal to v.
Examples
floor(1.49) = 1.0
floor(1.78) = 1.0
increase()
Function
increase(v range-vector)
Definition
Calculates the increase in the time series in the range vector. Breaks in monotonicity (such as counter resets due to target restarts) are automatically adjusted for.
increase
should only be used with counters. It is syntactic sugar for rate(v)
multiplied by the number of seconds under the specified time range window, and should be used primarily for human readability.
Example
The following example expression returns the number of HTTP requests as measured over the last five minutes, per time series in the range vector.
increase(http_requests_total{job="api-server"}[5m])
integrate()
This function is unique to the SolarWinds Observability SaaS implementation.
Function
integrate(v instant-vector)
Definition
Performs a numerical integration on each series to return a set of equal length. This is equivalent to computing the cumulative sum over the series, where each point in the returned series is the sum of the current point and the accumulated sum of all previous points in the series.
Example
The following example converts a gauge metric to a cumulative sum for the displayed interval.
integrate(AWS.EC2.CPUCreditUsage)
last_fill ()
Function
Given a sparse metric stream, last_fill()
will "fill in" any missing data points with a value of the last datapoint in the given time interval.
This function is unique to the SolarWinds Observability implementation.
Example
Use the last_fill()
function to ensure that the last recorded value within the viewable graph is repeated for the remaining intervals.
last_fill(json.errors.count)
rate()
Function
rate(v range-vector)
Definition
Calculates the per-second average rate of increase of the time series in the range vector. Breaks in monotonicity (such as counter resets due to target restarts) are automatically adjusted for.
rate
should only be used with counters. It is best suited for alerting, and for graphing of slow-moving counters.
When you combine rate()
with an aggregation operator (for example, sum()
), always take a rate()
first, then aggregate. Otherwise rate()
cannot detect counter resets when your target restarts.
Example
The following example expression returns the per-second rate of HTTP requests as measured over the last five minutes, per time series in the range vector.
rate(http_requests_total{job="api-server"}[5m])
round()
Function
round(v instant-vector)
Definition
Rounds the sample values of all elements in v
to the nearest integer. Ties are resolved by rounding up.
aggregation_over_time()
The following functions allow aggregating each series of a given range vector over time and return an instant vector with per-series aggregation results:
avg_over_time(range-vector)
: The average value of all points in the specified interval.min_over_time(range-vector)
: The minimum value of all points in the specified interval.max_over_time(range-vector)
: The maximum value of all points in the specified interval.sum_over_time(range-vector)
: The sum of all values in the specified interval.count_over_time(range-vector)
: The count of all values in the specified interval.last_over_time(range-vector)
: The most recent point value in the specified interval.
Subquery
Subquery allows you to run an instant query for a given range and resolution. The result of a subquery is a range vector.
Syntax
<instant_query> '[' <range> ':' [<resolution>] ']'
<resolution>
is optional. Default is the bucket size.
Example
min_over_time(rate(http_requests_total[5m])[30m:1m])
Breakdown:
-
rate(http_requests_total[5m])[30m:1m]
is the subquery, whererate(http_requests_total[5m])
is the query to be executed. -
rate(http_requests_total[5m])
is executed fromstart=<now>-30m
toend=<now>
, at a resolution of1m
. -
Finally, the result of all the evaluations above are passed to
min_over_time()
.
Variable substitution
The SolarWinds implementation of PromQL supports built-in template variables that can be used in queries to simplify and enhance their flexibility. These variables include:
-
$timerange
: The displayed interval (for example, 1h). -
$timerange_s
: The same as$timerange
, but in seconds. -
$bucket_interval
: The interval used for bucketing in the query (for example, when the time range is 1 hour, the bucket interval might be 30 seconds; when the time range is 1 week, the bucket interval might be 1 hour). -
$bucket_interval_s
: The same as$bucket_interval
, but in seconds. -
$rate_bucket_interval
: Used in rate/interval functions and is four times larger than the standard scrape interval (ensuring the correct rate calculation). For example:rate(http_requests_total[$rate_bucket_interval])
These variables help in writing more dynamic and adaptable queries. Below are some example usages:
sum_over_time(trace.service.requests[$bucket_interval]) / $bucket_interval_s rate(k8s.container_cpu_usage_seconds_total[$rate_bucket_interval])
When a query with such parameters is used in dashboards, users can adjust time ranges of the whole dashboard without the need to adjust metric definitions themselves.
Entity Association
In PromQL, there is a special attribute entity.ids
that you can use to filter or group time series by specific entity IDs. entity.ids
is comma separater ordered list of entity IDs that are associated with individual time series. This feature is crucial for associating metrics with specific entities in your system. Below are the key points to understand and use this feature:
-
Filtering by entity ID: You can filter time series to include only those associated with a specific entity ID using
entity.ids
.Example:
metric_name{entity.ids=~"e-123456"}
filters all time series to include only those associated with the entity having IDe-123456
. -
Aggregating by entity ID: When you need to aggregate time series data by entity, use
entity.ids
in your aggregation functions.Example:
sum by (entity.ids) (metric_name)
aggregates the time series by the associated entity IDs. -
Important note on aggregation: When you are performing an aggregation on time series data (using functions like
sum
,avg
,max
, and so on), and you intend to use the composite metric in alerts on entities, it is mandatory to includeentity.ids
in the aggregation. This ensures that the individual time series can still be correctly associated with their respective entities.Example: If you use
sum by (attr1, attr2, entity.ids) (metric_name)
in a composite metric intended for alerts, the entities can be accurately linked to the individual time series.Failure to include
entity.ids
in such aggregations can result in a loss of association between the time series and their corresponding entities, which can lead to incorrect or incomplete alerting behavior.