Analyze benchmark results

perf commands

To analyze benchmark results, write the output into a JSON file using the --output option (-o):

$ python3 -m perf timeit '[1,2]*1000' -o bench.json
.....................
Mean +- std dev: 4.22 us +- 0.08 us

perf provides the following commands to analyze benchmark results:

  • perf show: single line summary, mean and standard deviation
  • perf check: check if benchmark results stability
  • perf metadata: display metadata collected during the benchmark
  • perf dump: see all values per run, including warmup values and the calibration run
  • perf stats: compute various statistics (min/max, mean, median, percentiles, etc.).
  • perf hist: render an histogram to see the shape of the distribution.
  • perf slowest: top 5 benchmarks which took the most time to be run.

Statistics

Outliers

If you run a benchmark without tuning the system, it’s likely that you will get outliers: a few values much slower than the average.

Example:

$ python3 -m perf timeit '[1,2]*1000' -o outliers.json
.....................
WARNING: the benchmark result may be unstable
* the maximum (6.02 us) is 39% greater than the mean (4.34 us)

Try to rerun the benchmark with more runs, values and/or loops.
Run 'python3 -m perf system tune' command to reduce the system jitter.
Use perf stats, perf dump and perf hist to analyze results.
Use --quiet option to hide these warnings.

Mean +- std dev: 4.34 us +- 0.31 us

Use the :ref:`perf stats <stats_cmd>`_ command to count the number of outliers (9 on this example):

$ python3 -m perf stats outliers.json -q
Total duration: 11.6 sec
Start date: 2017-03-16 16:30:01
End date: 2017-03-16 16:30:16
Raw value minimum: 135 ms
Raw value maximum: 197 ms

Number of calibration run: 1
Number of run with values: 20
Total number of run: 21

Number of warmup per run: 1
Number of value per run: 3
Loop iterations per value: 2^15
Total number of values: 60

Minimum:         4.12 us
Median +- MAD:   4.25 us +- 0.05 us
Mean +- std dev: 4.34 us +- 0.31 us
Maximum:         6.02 us

  0th percentile: 4.12 us (-5% of the mean) -- minimum
  5th percentile: 4.15 us (-4% of the mean)
 25th percentile: 4.21 us (-3% of the mean) -- Q1
 50th percentile: 4.25 us (-2% of the mean) -- median
 75th percentile: 4.30 us (-1% of the mean) -- Q3
 95th percentile: 4.84 us (+12% of the mean)
100th percentile: 6.02 us (+39% of the mean) -- maximum

Number of outlier (out of 4.07 us..4.44 us): 9

Histogram:

$ python3 -m perf hist outliers.json -q
4.10 us: 15 ##############################
4.20 us: 29 ##########################################################
4.30 us:  6 ############
4.40 us:  3 ######
4.50 us:  2 ####
4.60 us:  1 ##
4.70 us:  0 |
4.80 us:  1 ##
4.90 us:  0 |
5.00 us:  0 |
5.10 us:  0 |
5.20 us:  2 ####
5.30 us:  0 |
5.40 us:  0 |
5.50 us:  0 |
5.60 us:  0 |
5.70 us:  0 |
5.80 us:  0 |
5.90 us:  0 |
6.00 us:  1 ##

Using an histogram, it’s easy to see that most values (57 values) are in the range [4.12 us; 4.84 us], but 3 values are in the range [5.17 us; 6.02 us us]: 39% slower for the maximum (6.02 us).

See How to get reproductible benchmark results to avoid outliers.

If you cannot get stable benchmark results, another option is to use median and median absolute deviation (MAD) instead of mean and standard deviation. Median and MAD are robust statistics which ignore outliers.

Minimum VS average

Links:

Median and median absolute deviation VS mean and standard deviation

Median and median absolute deviation (MAD) are robust statistics which ignore outliers.

Probability distribution

The perf hist command renders an histogram of the distribution of all values.

See also:

Why is perf so slow?

--fast and --rigorous options indirectly have an impact on the total duration of benchmarks. The perf module is not optimized for the total duration but to produce reliable benchmarks.

The --fast is designed to be fast, but remain reliable enough to be sensitive. Using less worker processes and less values per worker would produce unstable results.

Compare benchmark results

Let’s use Python 2 and Python 3 to generate two different benchmark results:

$ python2 -m perf timeit '[1,2]*1000' -o py2.json
.....................
Mean +- std dev: 4.70 us +- 0.18 us

$ python3 -m perf timeit '[1,2]*1000' -o py3.json
.....................
Mean +- std dev: 4.22 us +- 0.08 us

The perf compare_to command compares the second benchmark to the first benchmark:

$ python3 -m perf compare_to py2.json py3.json
Mean +- std dev: [py2] 4.70 us +- 0.18 us -> [py3] 4.22 us +- 0.08 us: 1.11x faster (-10%)

Python 3 is faster than Python 2 on this benchmark.

perf determines whether two samples differ significantly using a Student’s two-sample, two-tailed t-test with alpha equals to 0.95.

Render a table using --table option:

$ python3 -m perf compare_to py2.json py3.json --table
+-----------+---------+------------------------------+
| Benchmark | py2     | py3                          |
+===========+=========+==============================+
| timeit    | 4.70 us | 4.22 us: 1.11x faster (-10%) |
+-----------+---------+------------------------------+