Do I understand Prometheus's rate vs increase functions correctly?

Prometheus

Prometheus Problem Overview


I have read the Prometheus documentation carefully, but its still a bit unclear to me, so I am here to get confirmation about my understanding.

(Please note that for the sake of the simplest examples possible I have used the one second for scraping interval, timerange - even if its not possible in practice)

Despite we scrape a counter in each second and the counter's values is 30 right now. We have the following timeseries for that:

second   counter_value	  increase calculated by hand(call it ICH from now)
1		      1			           1
2		      3			           2
3		      6			           3
4		      7			           1
5		     10			           3
6		     14			           4
7		     17			           3
8		     21			           4
9		     25			           4
10		     30			           5

We want to run some query on this dataset.

1.rate()
Official document states:
"rate(v range-vector) : calculates the per-second average rate of increase of the time series in the range vector."

With a layman's terms this means that we will get the increase for every second and the value for the given second will be the average increment in the given range?

Here is what I mean:
rate(counter[1s]): will match ICH because average will be calculated from one value only.
rate(counter[2s]): will get the average from the increment in 2 sec and distribute it among the seconds
So in the first 2 second we got an increment of total 3 which means the average is 1.5/sec. final result:

second result
1       1,5
2       1,5
3        2
4        2
5       3,5
6       3,5
7       3,5
8       3,5
9       4,5
10      4,5

rate(counter[5s]): will get the average from the increment in 5 sec and distribute it among the seconds
The same as for [2s] but we calculate the average from total increment of 5sec. final result:

second result
1        2
2        2
3        2
4        2
5        2
6        4
7        4
8        4
9        4
10       4

So the higher the timerange the smoother result we will get. And the sum of these increase will match the actual counter.

2.increase()
Official document states:
"increase(v range-vector) : calculates the increase in the time series in the range vector."

For me this means it wont distribute the average among the seconds, but instead will show the single increment for the given range(with extrapolation).
increase(counter[1s]): In my term this will match with the ICH and the rate for 1s, just because the total range and rate's base granularity match.
increase(counter[2s]): First 2 seconds gave us an increment of 3 total,so 2.seconds will get the value of 3 and so on...

  second result   
    1        3*  
    2        3
    3        4*
    4        4
    5        7*
    6        7
    7        7*
    8        7
    9        9*
    10       9

*In my terms these values means the extrapolated values to cover every second.

Do I understand it well or am I far from that?

Prometheus Solutions


Solution 1 - Prometheus

In an ideal world (where your samples' timestamps are exactly on the second and your rule evaluation happens exactly on the second) rate(counter[1s]) would return exactly your ICH value and rate(counter[5s]) would return the average of that ICH and the previous 4. Except the ICH at second 1 is 0, not 1, because no one knows when your counter was zero: maybe it incremented right there, maybe it got incremented yesterday, and stayed at 1 since then. (This is the reason why you won't see an increase the first time a counter appears with a value of 1 -- because your code just created and incremented it.)

increase(counter[5s]) is exactly rate(counter[5s]) * 5 (and increase(counter[2s]) is exactly rate(counter[2s]) * 2).

Now what happens in the real world is that your samples are not collected exactly every second on the second and rule evaluation doesn't happen exactly on the second either. So if you have a bunch of samples that are (more or less) 1 second apart and you use Prometheus' rate(counter[1s]), you'll get no output. That's because what Prometheus does is it takes all the samples in the 1 second range [now() - 1s, now()] (which would be a single sample in the vast majority of cases), tries to compute a rate and fails.

If you query rate(counter[5s]) OTOH, Prometheus will pick all the samples in the range [now() - 5s, now] (5 samples, covering approximately 4 seconds on average, say [t1, v1], [t2, v2], [t3, v3], [t4, v4], [t5, v5]) and (assuming your counter doesn't reset within the interval) will return (v5 - v1) / (t5 - t1). I.e. it actually computes the rate of increase over ~4s rather than 5s.

increase(counter[5s]) will return (v5 - v1) / (t5 - t1) * 5, so the rate of increase over ~4 seconds, extrapolated to 5 seconds.

Due to the samples not being exactly spaced, both rate and increase will often return floating point values for integer counters (which makes obvious sense for rate, but not so much for increase).

Solution 2 - Prometheus

** explanation analysing issue in opposite direction**

Let's assume that we have

rate(some_metric_name_count [3m]) = 2

This means that in interval of 3 min prior that point in time counter had increased by 2 per each second and that after that 3 minutes we have increase of 2*180 (sec) = 360 for this counter.

This also means that in this case:

increase(some_metric_name_count [3m]) ~ 360

There are slight approximations under the hood mainly for the first point in time so there can be absolute error of 2 meaning that:

increase(some_metric_name_count [3m]) = 360 +/- 2

and that covers interval from [358, 362] including ends of intervals

Solution 3 - Prometheus

Prometheus calculates rate(counter[d]) at timestamp t in the following way:

  1. It selects raw samples for the counter time series on the time range (t-d ... t]. Note that the t-d timestamp isn't included in the time range, while t timestamp is included in the time range. If the selected time range contains less than two raw samples, then Prometheus returns an empty value (a gap) at the timestamp t.
  2. Then it calculates the increase of the selected raw samples. Usually it is calculated as the difference between the last selected sample and the first selected sample. Calculations become slightly complicated if the counter was reset to zero during the selected time range. Let's skip this for the sake of clarity.
  3. Then the resulting increase can be extrapolated if timestamps for the first and/or the last raw samples are located too far from the bounds of the selected time range.
  4. Then the rate is calculated by dividing the extrapolated increase by d.

Prometheus calculates increase(counter[d]) in the same way except the last step.

Let's look at a few examples applied to the original data:

second   counter_value    increase calculated by hand(call it ICH from now)
1             1                    1
2             3                    2
3             6                    3
4             7                    1
5            10                    3
6            14                    4
7            17                    3
8            21                    4
9            25                    4
10           30                    5
  • The rate(counter[1s]) will return nothing at any timestamp t, since any time range (t-1s ... t] contains only a single raw sample, while Prometheus requires at least two samples for calculating both rate() and increase().

  • The rate(counter[2s]) and increase(counter[2]) would return the following values per each timestamp t when extrapolation isn't applied:

t       counter_value    rate(counter[2s])        increase(counter[2s])
1             1                    -                       -
2             3               (3-1)/2=1.0                3-1=2
3             6               (6-3)/2=1.5                6-3=3
4             7               (7-6)/2=0.5                7-6=1
5            10              (10-7)/2=1.5               10-7=3
6            14             (14-10)/2=2                14-10=4
7            17             (17-14)/2=1.5              17-14=3
8            21             (21-17)/2=2                21-17=4
9            25             (25-21)/2=2                25-21=4
10           30             (30-25)/2=2.5              30-25=5

In reality Prometheus results for rate(counter[2s]) and increase(counter[2s]) may be slightly bigger because of extrapolation, since the first sample on the selected time range is located comparatively far from the start of the time range.

Such calculations have the following issues:

  • Prometheus can return fractional results from increase() over time series, which contains only integer values. This is because of extrapolation. For example, Prometheus may return fractional results from increase(http_requests_total[5m]).

  • Prometheus returns empty results (aka gaps) from increase(counter[d]) and rate(counter[d]) when the lookbehind window d doesn't cover at least two samples - see rate(counter[1s]) and increase(counter[1s]) example above.

  • Prometheus completely misses the increase between the raw sample just before the (t-d ... t] interval and the first raw sample on this interval. This may result in inaccurate calculations. For example, increase(counter[1h]) doesn't equal to sum_over_time(increase(counter[1m])[1h:1m]).

Prometheus developers are aware of these issues - see this link. These issues are addressed by VictoriaMetrics at MetricsQL query language - see this comment and this article for technical details.

Categories

Attributions

All content for this solution is sourced from the original question on Stackoverflow.

The content on this page is licensed under the Attribution-ShareAlike 4.0 International (CC BY-SA 4.0) license.

Content TypeOriginal AuthorOriginal Content on Stackoverflow
QuestionbeatriceView Question on Stackoverflow
Solution 1 - PrometheusAlin SînpăleanView Answer on Stackoverflow
Solution 2 - PrometheustrinityView Answer on Stackoverflow
Solution 3 - PrometheusvalyalaView Answer on Stackoverflow