Over Provisioning Performance

Over Provisioning NVMe Performance

All solid state drives have a bit of extra space on them. This is called the over provisioning (OP) space. This extra space is used to increase endurance and performance. M.2 drives don’t typically have much OP, U.2’s have a fair amount, and storage class memory must have lots.

In older solid state drives, the OP space was essentially fixed. There may have been some tools that could be run to change it, but they were generally back doors and known primarily by the manfacturer. With the advent of multiple namespaces devices, many drive manufacturers affixed their OP space to effectively be the inverse of the space allocated to namespaces.

What this means is, certain manufacturers can obtain higher performance by reducing the amount of the drive allocated by the namespaces. Your mileage will vary here, and it is not guaranteed to work. Not all vendors do this.

Baseline Performance

To measure with performance, the first thing we need to do is get a baselines. The most stressful thing we generally can do to drives is 4k random reads and writes. Therefore, our baseline will consist of some 4k random read and writes.

[root@smc-server thorst]# nvme list
Node             SN                   Model                                    Namespace Usage                      Format           FW Rev
---------------- -------------------- ---------------------------------------- --------- -------------------------- ---------------- --------
/dev/nvme0n1     S4YNNE0N801309       SAMSUNG MZWLJ1T9HBJR-00007               1           1.92  TB /   1.92  TB    512   B +  0 B   EPK98B5Q
/dev/nvme1n1     S5H7NS1NA02815E      Samsung SSD 970 EVO 500GB                1           2.71  GB / 500.11  GB    512   B +  0 B   2B2QEXE7

[root@smc-server thorst]# fio --name=4krandread --iodepth=1 --rw=randread --bs=4k --runtime=60 --ramp=5 --group_reporting --numjobs=64 --sync=1 --direct=1 -
-size=100% --filename=/dev/nvme0n1
4krandread: (g=0): rw=randread, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=psync, iodepth=1
...
fio-3.7
Starting 64 processes
Jobs: 63 (f=63): [r(7),E(1),r(56)][100.0%][r=1565MiB/s,w=0KiB/s][r=401k,w=0 IOPS][eta 00m:00s]
4krandread: (groupid=0, jobs=64): err= 0: pid=4147: Wed Dec  9 21:58:35 2020
   read: IOPS=401k, BW=1566MiB/s (1642MB/s)(91.7GiB/60001msec)
    clat (usec): min=15, max=2171, avg=159.08, stdev=94.21
     lat (usec): min=15, max=2171, avg=159.14, stdev=94.21
    clat percentiles (usec):
     |  1.00th=[   28],  5.00th=[   40], 10.00th=[   51], 20.00th=[   75],
     | 30.00th=[   98], 40.00th=[  122], 50.00th=[  147], 60.00th=[  176],
     | 70.00th=[  204], 80.00th=[  235], 90.00th=[  269], 95.00th=[  302],
     | 99.00th=[  449], 99.50th=[  619], 99.90th=[  709], 99.95th=[  734],
     | 99.99th=[  783]
   bw (  KiB/s): min=24128, max=26208, per=1.56%, avg=25015.35, stdev=255.61, samples=7616
   iops        : min= 6032, max= 6552, avg=6253.82, stdev=63.91, samples=7616
  lat (usec)   : 20=0.02%, 50=9.58%, 100=21.48%, 250=53.42%, 500=14.69%
  lat (usec)   : 750=0.78%, 1000=0.03%
  lat (msec)   : 2=0.01%, 4=0.01%
  cpu          : usr=0.81%, sys=1.10%, ctx=24047148, majf=0, minf=198
  IO depths    : 1=108.2%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     issued rwts: total=24047118,0,0,0 short=0,0,0,0 dropped=0,0,0,0
     latency   : target=0, window=0, percentile=100.00%, depth=1

Run status group 0 (all jobs):
   READ: bw=1566MiB/s (1642MB/s), 1566MiB/s-1566MiB/s (1642MB/s-1642MB/s), io=91.7GiB (98.5GB), run=60001-60001msec

Disk stats (read/write):
  nvme0n1: ios=25970761/0, merge=0/0, ticks=4049288/0, in_queue=0, util=99.74%

[root@smc-server thorst]# fio --name=4krandwrite --iodepth=1 --rw=randwrite --bs=4k --runtime=60 --ramp=5 --group_reporting --numjobs=64 --sync=1 --direct=1
 --size=100% --filename=/dev/nvme0n1
4krandwrite: (g=0): rw=randwrite, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=psync, iodepth=1
...
fio-3.7
Starting 64 processes
Jobs: 63 (f=63): [w(3),E(1),w(60)][100.0%][r=0KiB/s,w=2461MiB/s][r=0,w=630k IOPS][eta 00m:00s]
4krandwrite: (groupid=0, jobs=64): err= 0: pid=4217: Wed Dec  9 22:00:28 2020
  write: IOPS=625k, BW=2441MiB/s (2560MB/s)(143GiB/60012msec)
    clat (usec): min=12, max=21547, avg=101.70, stdev=78.34
     lat (usec): min=12, max=21547, avg=101.79, stdev=78.34
    clat percentiles (usec):
     |  1.00th=[   57],  5.00th=[   71], 10.00th=[   76], 20.00th=[   83],
     | 30.00th=[   87], 40.00th=[   91], 50.00th=[   95], 60.00th=[   99],
     | 70.00th=[  105], 80.00th=[  115], 90.00th=[  137], 95.00th=[  155],
     | 99.00th=[  204], 99.50th=[  233], 99.90th=[  338], 99.95th=[  437],
     | 99.99th=[ 1958]
   bw (  KiB/s): min=35336, max=40648, per=1.56%, avg=39080.30, stdev=622.58, samples=7625
   iops        : min= 8834, max=10162, avg=9770.06, stdev=155.65, samples=7625
  lat (usec)   : 20=0.01%, 50=0.52%, 100=61.68%, 250=37.45%, 500=0.30%
  lat (usec)   : 750=0.01%, 1000=0.01%
  lat (msec)   : 2=0.02%, 4=0.01%, 10=0.01%, 20=0.01%, 50=0.01%
  cpu          : usr=1.35%, sys=2.19%, ctx=37506504, majf=0, minf=287
  IO depths    : 1=108.2%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     issued rwts: total=0,37506084,0,0 short=0,0,0,0 dropped=0,0,0,0
     latency   : target=0, window=0, percentile=100.00%, depth=1

Run status group 0 (all jobs):
  WRITE: bw=2441MiB/s (2560MB/s), 2441MiB/s-2441MiB/s (2560MB/s-2560MB/s), io=143GiB (154GB), run=60012-60012msec

Disk stats (read/write):
  nvme0n1: ios=104/40566026, merge=0/0, ticks=4/2860295, in_queue=25019, util=99.20%

The key data here:

  • Starting with 1.92 TB of disk
  • 400k random 4k read IOPs
  • 625k random 4k write IOPs

Not too shabby, but this is a U.2 enterprise drive.

Going down to 1.6 TB

You often see drives with 1.92 TB at 1 DWPD, and a similar variant at 1.6 TB at 3 DWPD. The 3 DWPD variant often has better performance, at least for the writes. Let’s see if that holds true.

I’ve first flipped my drive to a single 512-byte namespace. See the NVMe Namespace article for instructions on how to do this.

[root@smc-server thorst]# nvme list
Node             SN                   Model                                    Namespace Usage                      Format           FW Rev
---------------- -------------------- ---------------------------------------- --------- -------------------------- ---------------- --------
/dev/nvme0n1     S4YNNE0N801309       SAMSUNG MZWLJ1T9HBJR-00007               1           1.60  TB /   1.60  TB    512   B +  0 B   EPK98B5Q
/dev/nvme1n1     S5H7NS1NA02815E      Samsung SSD 970 EVO 500GB                1           2.71  GB / 500.11  GB    512   B +  0 B   2B2QEXE7

[root@smc-server thorst]# fio --name=4krandread --iodepth=1 --rw=randread --bs=4k --runtime=60 --ramp=5 --group_reporting --numjobs=64 --sync=1 --direct=1 --size=100% --filename=/dev/nvme0n1
4krandread: (g=0): rw=randread, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=psync, iodepth=1
...
fio-3.7
Starting 64 processes
Jobs: 64 (f=64): [r(64)][100.0%][r=1539MiB/s,w=0KiB/s][r=394k,w=0 IOPS][eta 00m:00s]
4krandread: (groupid=0, jobs=64): err= 0: pid=4324: Wed Dec  9 22:04:42 2020
   read: IOPS=394k, BW=1540MiB/s (1614MB/s)(90.2GiB/60002msec)
    clat (usec): min=16, max=2112, avg=161.70, stdev=84.16
     lat (usec): min=16, max=2112, avg=161.79, stdev=84.16
    clat percentiles (usec):
     |  1.00th=[   38],  5.00th=[   56], 10.00th=[   69], 20.00th=[   89],
     | 30.00th=[  109], 40.00th=[  128], 50.00th=[  149], 60.00th=[  174],
     | 70.00th=[  200], 80.00th=[  229], 90.00th=[  265], 95.00th=[  293],
     | 99.00th=[  400], 99.50th=[  578], 99.90th=[  701], 99.95th=[  725],
     | 99.99th=[  766]
   bw (  KiB/s): min=23776, max=25512, per=1.56%, avg=24606.86, stdev=225.85, samples=7621
   iops        : min= 5944, max= 6378, avg=6151.70, stdev=56.47, samples=7621
  lat (usec)   : 20=0.01%, 50=3.50%, 100=22.37%, 250=60.43%, 500=13.05%
  lat (usec)   : 750=0.63%, 1000=0.02%
  lat (msec)   : 2=0.01%, 4=0.01%
  cpu          : usr=0.84%, sys=1.10%, ctx=23648205, majf=0, minf=194
  IO depths    : 1=108.1%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     issued rwts: total=23648187,0,0,0 short=0,0,0,0 dropped=0,0,0,0
     latency   : target=0, window=0, percentile=100.00%, depth=1

Run status group 0 (all jobs):
   READ: bw=1540MiB/s (1614MB/s), 1540MiB/s-1540MiB/s (1614MB/s-1614MB/s), io=90.2GiB (96.9GB), run=60002-60002msec

Disk stats (read/write):
  nvme0n1: ios=25498394/0, merge=0/0, ticks=4042208/0, in_queue=39, util=99.66%


[root@smc-server thorst]# fio --name=4krandwrite --iodepth=1 --rw=randwrite --bs=4k --runtime=60 --ramp=5 --group_reporting --numjobs=64 --sync=1 --direct=1 --size=100% --filename=/dev/nvme0n1
4krandwrite: (g=0): rw=randwrite, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=psync, iodepth=1
...
fio-3.7
Starting 64 processes
Jobs: 63 (f=62): [w(1),f(1),E(1),w(61)][100.0%][r=0KiB/s,w=2455MiB/s][r=0,w=629k IOPS][eta 00m:00s]
4krandwrite: (groupid=0, jobs=64): err= 0: pid=4394: Wed Dec  9 22:06:13 2020
  write: IOPS=625k, BW=2442MiB/s (2560MB/s)(143GiB/60019msec)
    clat (usec): min=13, max=23421, avg=101.57, stdev=65.57
     lat (usec): min=13, max=23421, avg=101.70, stdev=65.58
    clat percentiles (usec):
     |  1.00th=[   62],  5.00th=[   73], 10.00th=[   77], 20.00th=[   83],
     | 30.00th=[   88], 40.00th=[   92], 50.00th=[   95], 60.00th=[  100],
     | 70.00th=[  106], 80.00th=[  116], 90.00th=[  133], 95.00th=[  151],
     | 99.00th=[  190], 99.50th=[  212], 99.90th=[  318], 99.95th=[  433],
     | 99.99th=[ 1811]
   bw (  KiB/s): min=34415, max=40496, per=1.56%, avg=39088.42, stdev=570.21, samples=7629
   iops        : min= 8603, max=10124, avg=9772.08, stdev=142.56, samples=7629
  lat (usec)   : 20=0.01%, 50=0.25%, 100=60.55%, 250=38.97%, 500=0.18%
  lat (usec)   : 750=0.01%, 1000=0.01%
  lat (msec)   : 2=0.02%, 4=0.01%, 10=0.01%, 20=0.01%, 50=0.01%
  cpu          : usr=1.43%, sys=2.27%, ctx=37518842, majf=0, minf=219
  IO depths    : 1=108.1%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     issued rwts: total=0,37518333,0,0 short=0,0,0,0 dropped=0,0,0,0
     latency   : target=0, window=0, percentile=100.00%, depth=1

Run status group 0 (all jobs):
  WRITE: bw=2442MiB/s (2560MB/s), 2442MiB/s-2442MiB/s (2560MB/s-2560MB/s), io=143GiB (154GB), run=60019-60019msec

Disk stats (read/write):
  nvme0n1: ios=102/40536332, merge=0/0, ticks=3/2549648, in_queue=18511, util=99.41%

What gives? This is a bust. Try with 4k block sizes.