Documentation forSolarWinds Observability SaaS

Composite metrics

What are composite metrics?

Composite metrics enable you to define new, higher-level metrics by applying a set of mathematical transformations and operations to existing native metrics. This allows for the composition of complex queries and the creation of metrics that provide more insightful and actionable data.

Composite metrics leverage the flexibility and expressiveness of PromQL to perform these transformations. PromQL is a widely-adopted query language in the monitoring and observability community. This approach facilitates a more standardized and powerful way to analyze and visualize your metric data.

With composite metrics, you can:

  • Aggregate data across different dimensions and time intervals.

  • Perform arithmetic operations to combine metrics.

  • Apply functions such as averages, sums, rates, and more.

  • Filter and group metrics based on specific labels and criteria.

Composite metric must be prefixed by composite.

Create a composite metric

  1. Press Create Metric in the Metrics Explorer to create a composite metric.

  2. Enter a name for the new metric and fill in additional details, such as Display Name, Description, and metric Unit as desired.

  3. Click the Formula tab and enter a PromQL definition in the field provided. To insert an existing metric into the PromQL definition, click Insert Metric and select the metric name. See SolarWinds composite metric PromQL syntax.

  4. Click Preview Definition to preview the data for the current definition of the composite metric and verify that the expected data appears.

  5. Click Create.

SolarWinds composite metric PromQL syntax

Composite metrics are created using the PromQL syntax, with further customizations and rules unique to SolarWinds Observability. To understand the basic syntax and structure of a PromQL query, review the PromQL Querying Prometheus documentation.

The SolarWinds implementation of PromQL includes several customizations to better align with our system's requirements. These customizations include support for additional functions, a different naming convention, and variable substitution. Below are some key differences and extensions.

SolarWinds PromQL naming convention

Our system follows the OpenTelemetry (OTel) naming conventions for both metric names and attributes, instead of the Prometheus restrictions. This section explains the differences and rules applied in our implementation.

SolarWinds PromQL metric names

Metric names in our system are backward compatible with PromQL metric naming. In addition, metric names can contain additional characters. Metric names must match following regex: [a-zA-Z_:][a-zA-Z0-9_:.]* and may contain escaped characters using a backslash (\).

Example metric names: service.requests.total, k8s.container_cpu_usage_seconds_total, or system/network/throughput

SolarWinds PromQL compatibility with Prometheus-style metric suffixes

SolarWinds PromQL introduces a compatibility feature to support Prometheus-style metric naming conventions often used with summary and histogram metrics. These metrics are commonly suffixed with _sum and _count to represent the sum and count of observations. With this addition:

  • When a query references a metric ending in _sum or _count, the system first checks whether the specified metric actually exists within the given time interval.
  • If it does exist, the system queries and returns data for that metric as usual.
  • If it does not exist, the system treats the provided metric name (minus the _sum or _count suffix) as the base metric. It then applies the appropriate aggregations to generate the requested sums or counts from the underlying metric data.

This ensures improved compatibility for PromQL queries imported from Prometheus. Users who rely on Prometheus naming conventions for summary or histogram data can continue using familiar suffixes, while the system gracefully handles the underlying data to produce equivalent results.

Example: If http_request_duration_seconds_sum and http_request_duration_seconds_count are queried, and these exact metrics don't exist, SolarWinds PromQL will check for http_request_duration_seconds as the base metric and aggregate the data accordingly.

SolarWinds PromQL attribute names

Attributes (also known as labels in Prometheus) follow OTel naming conventions, which provide a consistent and descriptive approach. Attribute names can include alphanumeric characters, underscores (_), dots (.), and may contain escaped characters using a backslash (\). Attribute names must match the following regex: [a-zA-Z_.][a-zA-Z0-9_:.]*.

Example attributes: service.name or container.id

Basics

Expression language data types

In the SolarWinds Observability SaaS PromQL expression language, an expression or sub-expression can evaluate to one of three types:

  • Instant vector: A set of time series containing a single sample for each time series, all sharing the same timestamp

  • Range vector: A set of time series containing a range of data points over time for each time series

  • Scalar: A simple numeric floating point value

Literals

String literals

String literals are designated by single quotes or double quotes.

Float literals

Scalar float values can be written as literal integers or floating-point numbers.

Time series selectors

Time series selectors are responsible for selecting the times series and raw or inferred sample timestamps and values.

Do not confuse time series selectors with the higher level concept of instant and range queries that can execute the time series selectors. A higher level instant query would evaluate the given selector at one point in time. However, the range query would evaluate the selector at multiple different times in between a minimum and maximum timestamp at regular steps.

Instant vector selectors

Instant vector selectors allow the selection of a set of time series and a single sample value for each at a given timestamp (point in time). In the simplest form, only a metric name is specified, which results in an instant vector containing elements for all time series that have this metric name.

This example selects all time series that have the http_requests_total metric name:

http_requests_total

You can filter these time series further by appending a comma-separated list of label matchers in curly braces ({}).

This example selects only those time series with the http_requests_total metric name that also have the job label set to prometheus and their group label set to canary:

http_requests_total{job="prometheus",group="canary"}

You can also negatively match a label value, or match label values against regular expressions. The following label matching operators are available:

  • =: Select labels that are exactly equal to the provided string.

  • !=: Select labels that are not equal to the provided string.

  • =~: Select labels that regex-match the provided string.

  • !~: Select labels that do not regex-match the provided string.

For example, this selects all http_requests_total time series for staging, testing, and development environments and HTTP methods other than GET.

http_requests_total{environment=~"staging|testing|development",method!="GET"}

Label matchers that match empty label values also select all time series that do not have the specific label set at all. It is possible to have multiple matchers for the same label name.

For example, given the dataset:

http_requests_total
http_requests_total{replica="rep-a"}
http_requests_total{replica="rep-b"}
http_requests_total{environment="development"}

Multiple matchers can be used for the same label name. They all must pass for a result to be returned.

The query:

http_requests_total{replica!="rep-a",replica=~"rep.*"}

Would then match:

http_requests_total{replica="rep-b"}

All regular expressions use RE2 syntax.

Range vector selectors

Range vector literals work like instant vector literals, except that they select a range of samples back from the current instant. Syntactically, a time duration is appended in square brackets ([]) at the end of a vector selector to specify how far back in time values should be fetched for each resulting range vector element. The range is a closed interval, Meaning that samples with timestamps coinciding with either boundary of the range are still included in the selection.

In this example, we select all the values we have recorded within the last five minutes for all time series that have the metric name http_requests_total and a method label set to get:

http_requests_total{method="get"}[5m]

Time durations

Time durations are specified as a number followed immediately by one of these units:

  • ms: Milliseconds
  • s: Seconds
  • m: Minutes
  • h: Hours

Offset modifier

The offset modifier allows changing the time offset for individual instant and range vectors in a query.

For example, the following expression returns the value of http_requests_total 5 minutes in the past relative to the current query evaluation time:

http_requests_total offset 5m

Note that the offset modifier always needs to follow the selector immediately. Therefore, the following would be correct:

sum(http_requests_total{method="GET"} offset 5m) // GOOD.

But the following would be incorrect:

sum(http_requests_total{method="GET"}) offset 5m // INVALID.

The same works for range vectors. This returns the 5-minute rate that http_requests_total had 24 hours ago:

rate(http_requests_total[5m] offset 24h)

Supported operators

The SolarWinds Observability SaaS custom PromQL supports a subset of standard PromQL functions, along with some additional functions unique to our implementation.

Aggregation operators

SolarWinds Observability SaaS PromQL supports the following built-in aggregation operators that can be used to aggregate the elements of a single instant vector, resulting in a new vector of fewer elements with aggregated values:

  • sum (calculate sum over dimensions)
  • min (select minimum over dimensions)
  • max (select maximum over dimensions)
  • avg (calculate the average over dimensions)
  • count (count number of elements in the vector)
  • topk (largest k elements by sample value)
  • bottomk (smallest k elements by sample value)

You can use these operators to aggregate over all label dimensions, or you can preserve distinct dimensions by including a without or by clause. These clauses can be used before or after the expression.

<aggr-op> [without|by (<label list>)] ([parameter,] <vector expression>)

or

<aggr-op>([parameter,] <vector expression>) [without|by (<label list>)]
  • label list is a list of unquoted attributes that can include a trailing comma, meaning that both (label1, label2) and (label1, label2,) are valid syntax.

  • without removes the listed labels from the result vector, while all other labels are preserved in the output.

  • by does the opposite and drops labels that are not listed in the by clause, even if their label values are identical between all elements of the vector.

  • parameter is required only for topk or bottomk.

  • topk and bottomk are different from other aggregators in that a subset of the input samples, including the original labels, are returned in the result vector. by and without are used only to bucket the input vector.

Example:

If the metric http_requests_total had time series that fanned out by application, instance, and group labels, we could calculate the total number of seen HTTP requests per application and group over all instances using:

sum without (instance) (http_requests_total)

Which is equivalent to:

sum by (application, group) (http_requests_total)

If we are only interested in the total number of HTTP requests we have seen in all applications, we could write:

sum(http_requests_total)

To get the five largest HTTP requests counts across all instances, we could write:

topk(5, http_requests_total)

Arithmetic binary operators

The following binary arithmetic operators are available:

  • + (addition)
  • - (subtraction)
  • * (multiplication)
  • / (division)
  • % (modulo)
  • ^ (power/exponentiation)

Binary arithmetic operators are defined between scalar/scalar, vector/scalar, and vector/vector value pairs.

  • Between two scalars, they evaluate to another scalar that is the result of the operator applied to both scalar operands.

  • Between an instant vector and a scalar, the operator is applied to the value of every data sample in the vector. For example, if a time series instant vector is multiplied by 2, the result is another vector in which every sample value of the original vector is multiplied by 2. The metric name is dropped.

  • Between two instant vectors, a binary arithmetic operator is applied to each entry in the left-hand side vector and its matching element in the right-hand vector. The result is propagated into the result vector with the grouping labels becoming the output label set. Entries for which no matching entry in the right-hand vector can be found are not part of the result.

Comparison binary operators

The following binary comparison operators are available in Prometheus:

  • == (equal)
  • != (not-equal)
  • > (greater-than)
  • < (less-than)
  • >= (greater-or-equal)
  • <= (less-or-equal)

Comparison operators are defined between vector/scalar value pairs.

Between an instant vector and a scalar, these operators are applied to the value of every data sample in the vector, and vector elements between which the comparison result is false get dropped from the result vector.

Binary operator precedence

The following list shows the precedence of binary operators, from highest to lowest.

  1. ^

  2. *, /, %

  3. +, -

  4. ==, !=, <=, <, >=, >

Operators on the same precedence level are left-associative. For example, 2 * 3 % 2 is equivalent to (2 * 3) % 2. However, ^ is right-associative, so 2 ^ 3 ^ 2 is equivalent to 2 ^ (3 ^ 2).

Vector matching

Operations between vectors attempt to find a matching element in the right-hand side vector for each entry in the left-hand side. There are two basic types of matching behavior: one-to-one and many-to-one/one-to-many.

Vector matching keywords

The following vector matching keyword allows for matching between series with different label sets providing:

  • on

Label lists provided to matching keywords determine how vectors are combined.

Group modifiers

The following group modifier enables many-to-one/one-to-many vector matching:

  • group_left

Label lists can be provided to the group modifier that contain labels from the "one"-side to be included in the result metrics.

One-to-one vector matches

One-to-one finds a unique pair of entries from each side of the operation. In the default case, that is an operation following the format vector1 <operator> vector2. Two entries match if they have the exact same set of labels and corresponding values.

Many-to-one and one-to-many vector matches

Many-to-one and one-to-many matchings refer to the case where each vector element on the "one"-side can match with multiple elements on the "many"-side. This must be explicitly requested using the group_left, where the left vector has the higher cardinality.

<vector expr> <bin-op> on(<label list>) group_left <vector expr>

Example query:

method_code:http_errors:rate5m / on(method) group_left method:http_requests:rate5m

In this case the left vector contains more than one entry per method label value. Therefore, we indicate this using group_left. The elements from the right side are now matched with multiple elements with the same method label on the left:

{method="get", code="500"}  0.04            //  24 / 600
{method="get", code="404"}  0.05            //  30 / 600
{method="post", code="500"} 0.05            //   6 / 120
{method="post", code="404"} 0.175           //  21 / 120

Supported functions

SolarWinds Observability SaaS PromQL supports the following functions.

abs()

Function

abs(v instant-vector)

Definition

Returns the input vector with all sample values converted to their absolute values.

ceil()

Function

ceil(v instant-vector)

Definition

Rounds the sample values of all elements in v up to the nearest integer value greater than or equal to v.

Examples

  • ceil(1.49) = 2.0
  • ceil(1.78) = 2.0

fill()

This function is unique to the SolarWinds Observability SaaS implementation.

Function

fill(value scalar, v instant-vector)

Definition

Given a sparse metric stream, fill() will "fill in" any missing data points with a value that you provide.

Parameters

value (Required) Must be a number.

Example

Use the fill() function to ensure a 0 value is rendered if no data is received.

fill(0, json.errors.count)

All streams now display a 0 value if no measurements are received.

floor()

Function

floor(v instant-vector)

Definition

Rounds the sample values of all elements in v down to the nearest integer value smaller than or equal to v.

Examples

  • floor(1.49) = 1.0
  • floor(1.78) = 1.0

increase()

Function

increase(v range-vector)

Definition

Calculates the increase in the time series in the range vector. Breaks in monotonicity (such as counter resets due to target restarts) are automatically adjusted for.

increase should only be used with counters. It is syntactic sugar for rate(v) multiplied by the number of seconds under the specified time range window, and should be used primarily for human readability.

Example

The following example expression returns the number of HTTP requests as measured over the last five minutes, per time series in the range vector.

increase(http_requests_total{job="api-server"}[5m])

integrate()

This function is unique to the SolarWinds Observability SaaS implementation.

Function

integrate(v instant-vector)

Definition

Performs a numerical integration on each series to return a set of equal length. This is equivalent to computing the cumulative sum over the series, where each point in the returned series is the sum of the current point and the accumulated sum of all previous points in the series.

Example

The following example converts a gauge metric to a cumulative sum for the displayed interval.

integrate(AWS.EC2.CPUCreditUsage)

last_fill ()

Function

Given a sparse metric stream, last_fill() will "fill in" any missing data points with a value of the last datapoint in the given time interval.

This function is unique to the SolarWinds Observability implementation.

Example

Use the last_fill() function to ensure that the last recorded value within the viewable graph is repeated for the remaining intervals.

last_fill(json.errors.count)

rate()

Function

rate(v range-vector)

Definition

Calculates the per-second average rate of increase of the time series in the range vector. Breaks in monotonicity (such as counter resets due to target restarts) are automatically adjusted for.

rate should only be used with counters. It is best suited for alerting, and for graphing of slow-moving counters. When you combine rate() with an aggregation operator (for example, sum()), always take a rate() first, then aggregate. Otherwise rate() cannot detect counter resets when your target restarts.

Example

The following example expression returns the per-second rate of HTTP requests as measured over the last five minutes, per time series in the range vector.

rate(http_requests_total{job="api-server"}[5m])

round()

Function

round(v instant-vector)

Definition

Rounds the sample values of all elements in v to the nearest integer. Ties are resolved by rounding up.

aggregation_over_time()

The following functions allow aggregating each series of a given range vector over time and return an instant vector with per-series aggregation results:

  • avg_over_time(range-vector): The average value of all points in the specified interval.
  • min_over_time(range-vector): The minimum value of all points in the specified interval.
  • max_over_time(range-vector): The maximum value of all points in the specified interval.
  • sum_over_time(range-vector): The sum of all values in the specified interval.
  • count_over_time(range-vector): The count of all values in the specified interval.
  • last_over_time(range-vector): The most recent point value in the specified interval.

Subquery

Subquery allows you to run an instant query for a given range and resolution. The result of a subquery is a range vector.

Syntax

<instant_query> '[' <range> ':' [<resolution>] ']'

<resolution> is optional. Default is the bucket size.

Example

min_over_time(rate(http_requests_total[5m])[30m:1m])

Breakdown:

  • rate(http_requests_total[5m])[30m:1m] is the subquery, where rate(http_requests_total[5m]) is the query to be executed.

  • rate(http_requests_total[5m]) is executed from start=<now>-30m to end=<now>, at a resolution of 1m.

  • Finally, the result of all the evaluations above are passed to min_over_time().

Variable substitution

The SolarWinds implementation of PromQL supports built-in template variables that can be used in queries to simplify and enhance their flexibility. These variables include:

  • $timerange: The displayed interval (for example, 1h).

  • $timerange_s: The same as $timerange, but in seconds.

  • $bucket_interval: The interval used for bucketing in the query (for example, when the time range is 1 hour, the bucket interval might be 30 seconds; when the time range is 1 week, the bucket interval might be 1 hour).

  • $bucket_interval_s: The same as $bucket_interval, but in seconds.

  • $rate_bucket_interval: Used in rate/interval functions and is four times larger than the standard scrape interval (ensuring the correct rate calculation). For example: rate(http_requests_total[$rate_bucket_interval])

  • $group_attributes: Used in aggregation operators. It is replaced by all grouping attributes selected in the UI during metric values calculation. For example: sum by ($group_attributes) (http_requests_total)

These variables help in writing more dynamic and adaptable queries. Below are some example usages:

sum_over_time(trace.service.requests[$bucket_interval]) / $bucket_interval_s
rate(k8s.container_cpu_usage_seconds_total[$rate_bucket_interval])

When a query with such parameters is used in dashboards, users can adjust time ranges of the whole dashboard without the need to adjust metric definitions themselves.

Entity Association

In PromQL, there is a special attribute entity.ids that you can use to filter or group time series by specific entity IDs. entity.ids is comma separater ordered list of entity IDs that are associated with individual time series. This feature is crucial for associating metrics with specific entities in your system. Below are the key points to understand and use this feature:

  • Filtering by entity ID: You can filter time series to include only those associated with a specific entity ID using entity.ids.

    Example: metric_name{entity.ids=~"e-123456"} filters all time series to include only those associated with the entity having ID e-123456.

  • Aggregating by entity ID: When you need to aggregate time series data by entity, use entity.ids in your aggregation functions.

    Example: sum by (entity.ids) (metric_name) aggregates the time series by the associated entity IDs.

  • Important note on aggregation: When you are performing an aggregation on time series data (using functions like sum, avg, max, and so on), and you intend to use the composite metric in alerts on entities, it is mandatory to include entity.ids in the aggregation. This ensures that the individual time series can still be correctly associated with their respective entities.

    Example: If you use sum by (attr1, attr2, entity.ids) (metric_name) in a composite metric intended for alerts, the entities can be accurately linked to the individual time series.

    Failure to include entity.ids in such aggregations can result in a loss of association between the time series and their corresponding entities, which can lead to incorrect or incomplete alerting behavior.