Over Provisioning Performance
Over Provisioning NVMe Performance
All solid state drives have a bit of extra space on them. This is called the over provisioning (OP) space. This extra space is used to increase endurance and performance. M.2 drives don’t typically have much OP, U.2’s have a fair amount, and storage class memory must have lots.
In older solid state drives, the OP space was essentially fixed. There may have been some tools that could be run to change it, but they were generally back doors and known primarily by the manfacturer. With the advent of multiple namespaces devices, many drive manufacturers affixed their OP space to effectively be the inverse of the space allocated to namespaces.
What this means is, certain manufacturers can obtain higher performance by reducing the amount of the drive allocated by the namespaces. Your mileage will vary here, and it is not guaranteed to work. Not all vendors do this.
Baseline Performance
To measure with performance, the first thing we need to do is get a baselines. The most stressful thing we generally can do to drives is 4k random reads and writes. Therefore, our baseline will consist of some 4k random read and writes.
[root@smc-server thorst]# nvme list
Node SN Model Namespace Usage Format FW Rev
---------------- -------------------- ---------------------------------------- --------- -------------------------- ---------------- --------
/dev/nvme0n1 S4YNNE0N801309 SAMSUNG MZWLJ1T9HBJR-00007 1 1.92 TB / 1.92 TB 512 B + 0 B EPK98B5Q
/dev/nvme1n1 S5H7NS1NA02815E Samsung SSD 970 EVO 500GB 1 2.71 GB / 500.11 GB 512 B + 0 B 2B2QEXE7
[root@smc-server thorst]# fio --name=4krandread --iodepth=1 --rw=randread --bs=4k --runtime=60 --ramp=5 --group_reporting --numjobs=64 --sync=1 --direct=1 -
-size=100% --filename=/dev/nvme0n1
4krandread: (g=0): rw=randread, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=psync, iodepth=1
...
fio-3.7
Starting 64 processes
Jobs: 63 (f=63): [r(7),E(1),r(56)][100.0%][r=1565MiB/s,w=0KiB/s][r=401k,w=0 IOPS][eta 00m:00s]
4krandread: (groupid=0, jobs=64): err= 0: pid=4147: Wed Dec 9 21:58:35 2020
read: IOPS=401k, BW=1566MiB/s (1642MB/s)(91.7GiB/60001msec)
clat (usec): min=15, max=2171, avg=159.08, stdev=94.21
lat (usec): min=15, max=2171, avg=159.14, stdev=94.21
clat percentiles (usec):
| 1.00th=[ 28], 5.00th=[ 40], 10.00th=[ 51], 20.00th=[ 75],
| 30.00th=[ 98], 40.00th=[ 122], 50.00th=[ 147], 60.00th=[ 176],
| 70.00th=[ 204], 80.00th=[ 235], 90.00th=[ 269], 95.00th=[ 302],
| 99.00th=[ 449], 99.50th=[ 619], 99.90th=[ 709], 99.95th=[ 734],
| 99.99th=[ 783]
bw ( KiB/s): min=24128, max=26208, per=1.56%, avg=25015.35, stdev=255.61, samples=7616
iops : min= 6032, max= 6552, avg=6253.82, stdev=63.91, samples=7616
lat (usec) : 20=0.02%, 50=9.58%, 100=21.48%, 250=53.42%, 500=14.69%
lat (usec) : 750=0.78%, 1000=0.03%
lat (msec) : 2=0.01%, 4=0.01%
cpu : usr=0.81%, sys=1.10%, ctx=24047148, majf=0, minf=198
IO depths : 1=108.2%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
issued rwts: total=24047118,0,0,0 short=0,0,0,0 dropped=0,0,0,0
latency : target=0, window=0, percentile=100.00%, depth=1
Run status group 0 (all jobs):
READ: bw=1566MiB/s (1642MB/s), 1566MiB/s-1566MiB/s (1642MB/s-1642MB/s), io=91.7GiB (98.5GB), run=60001-60001msec
Disk stats (read/write):
nvme0n1: ios=25970761/0, merge=0/0, ticks=4049288/0, in_queue=0, util=99.74%
[root@smc-server thorst]# fio --name=4krandwrite --iodepth=1 --rw=randwrite --bs=4k --runtime=60 --ramp=5 --group_reporting --numjobs=64 --sync=1 --direct=1
--size=100% --filename=/dev/nvme0n1
4krandwrite: (g=0): rw=randwrite, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=psync, iodepth=1
...
fio-3.7
Starting 64 processes
Jobs: 63 (f=63): [w(3),E(1),w(60)][100.0%][r=0KiB/s,w=2461MiB/s][r=0,w=630k IOPS][eta 00m:00s]
4krandwrite: (groupid=0, jobs=64): err= 0: pid=4217: Wed Dec 9 22:00:28 2020
write: IOPS=625k, BW=2441MiB/s (2560MB/s)(143GiB/60012msec)
clat (usec): min=12, max=21547, avg=101.70, stdev=78.34
lat (usec): min=12, max=21547, avg=101.79, stdev=78.34
clat percentiles (usec):
| 1.00th=[ 57], 5.00th=[ 71], 10.00th=[ 76], 20.00th=[ 83],
| 30.00th=[ 87], 40.00th=[ 91], 50.00th=[ 95], 60.00th=[ 99],
| 70.00th=[ 105], 80.00th=[ 115], 90.00th=[ 137], 95.00th=[ 155],
| 99.00th=[ 204], 99.50th=[ 233], 99.90th=[ 338], 99.95th=[ 437],
| 99.99th=[ 1958]
bw ( KiB/s): min=35336, max=40648, per=1.56%, avg=39080.30, stdev=622.58, samples=7625
iops : min= 8834, max=10162, avg=9770.06, stdev=155.65, samples=7625
lat (usec) : 20=0.01%, 50=0.52%, 100=61.68%, 250=37.45%, 500=0.30%
lat (usec) : 750=0.01%, 1000=0.01%
lat (msec) : 2=0.02%, 4=0.01%, 10=0.01%, 20=0.01%, 50=0.01%
cpu : usr=1.35%, sys=2.19%, ctx=37506504, majf=0, minf=287
IO depths : 1=108.2%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
issued rwts: total=0,37506084,0,0 short=0,0,0,0 dropped=0,0,0,0
latency : target=0, window=0, percentile=100.00%, depth=1
Run status group 0 (all jobs):
WRITE: bw=2441MiB/s (2560MB/s), 2441MiB/s-2441MiB/s (2560MB/s-2560MB/s), io=143GiB (154GB), run=60012-60012msec
Disk stats (read/write):
nvme0n1: ios=104/40566026, merge=0/0, ticks=4/2860295, in_queue=25019, util=99.20%
The key data here:
- Starting with 1.92 TB of disk
- 400k random 4k read IOPs
- 625k random 4k write IOPs
Not too shabby, but this is a U.2 enterprise drive.
Going down to 1.6 TB
You often see drives with 1.92 TB at 1 DWPD, and a similar variant at 1.6 TB at 3 DWPD. The 3 DWPD variant often has better performance, at least for the writes. Let’s see if that holds true.
I’ve first flipped my drive to a single 512-byte namespace. See the NVMe Namespace article for instructions on how to do this.
[root@smc-server thorst]# nvme list
Node SN Model Namespace Usage Format FW Rev
---------------- -------------------- ---------------------------------------- --------- -------------------------- ---------------- --------
/dev/nvme0n1 S4YNNE0N801309 SAMSUNG MZWLJ1T9HBJR-00007 1 1.60 TB / 1.60 TB 512 B + 0 B EPK98B5Q
/dev/nvme1n1 S5H7NS1NA02815E Samsung SSD 970 EVO 500GB 1 2.71 GB / 500.11 GB 512 B + 0 B 2B2QEXE7
[root@smc-server thorst]# fio --name=4krandread --iodepth=1 --rw=randread --bs=4k --runtime=60 --ramp=5 --group_reporting --numjobs=64 --sync=1 --direct=1 --size=100% --filename=/dev/nvme0n1
4krandread: (g=0): rw=randread, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=psync, iodepth=1
...
fio-3.7
Starting 64 processes
Jobs: 64 (f=64): [r(64)][100.0%][r=1539MiB/s,w=0KiB/s][r=394k,w=0 IOPS][eta 00m:00s]
4krandread: (groupid=0, jobs=64): err= 0: pid=4324: Wed Dec 9 22:04:42 2020
read: IOPS=394k, BW=1540MiB/s (1614MB/s)(90.2GiB/60002msec)
clat (usec): min=16, max=2112, avg=161.70, stdev=84.16
lat (usec): min=16, max=2112, avg=161.79, stdev=84.16
clat percentiles (usec):
| 1.00th=[ 38], 5.00th=[ 56], 10.00th=[ 69], 20.00th=[ 89],
| 30.00th=[ 109], 40.00th=[ 128], 50.00th=[ 149], 60.00th=[ 174],
| 70.00th=[ 200], 80.00th=[ 229], 90.00th=[ 265], 95.00th=[ 293],
| 99.00th=[ 400], 99.50th=[ 578], 99.90th=[ 701], 99.95th=[ 725],
| 99.99th=[ 766]
bw ( KiB/s): min=23776, max=25512, per=1.56%, avg=24606.86, stdev=225.85, samples=7621
iops : min= 5944, max= 6378, avg=6151.70, stdev=56.47, samples=7621
lat (usec) : 20=0.01%, 50=3.50%, 100=22.37%, 250=60.43%, 500=13.05%
lat (usec) : 750=0.63%, 1000=0.02%
lat (msec) : 2=0.01%, 4=0.01%
cpu : usr=0.84%, sys=1.10%, ctx=23648205, majf=0, minf=194
IO depths : 1=108.1%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
issued rwts: total=23648187,0,0,0 short=0,0,0,0 dropped=0,0,0,0
latency : target=0, window=0, percentile=100.00%, depth=1
Run status group 0 (all jobs):
READ: bw=1540MiB/s (1614MB/s), 1540MiB/s-1540MiB/s (1614MB/s-1614MB/s), io=90.2GiB (96.9GB), run=60002-60002msec
Disk stats (read/write):
nvme0n1: ios=25498394/0, merge=0/0, ticks=4042208/0, in_queue=39, util=99.66%
[root@smc-server thorst]# fio --name=4krandwrite --iodepth=1 --rw=randwrite --bs=4k --runtime=60 --ramp=5 --group_reporting --numjobs=64 --sync=1 --direct=1 --size=100% --filename=/dev/nvme0n1
4krandwrite: (g=0): rw=randwrite, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=psync, iodepth=1
...
fio-3.7
Starting 64 processes
Jobs: 63 (f=62): [w(1),f(1),E(1),w(61)][100.0%][r=0KiB/s,w=2455MiB/s][r=0,w=629k IOPS][eta 00m:00s]
4krandwrite: (groupid=0, jobs=64): err= 0: pid=4394: Wed Dec 9 22:06:13 2020
write: IOPS=625k, BW=2442MiB/s (2560MB/s)(143GiB/60019msec)
clat (usec): min=13, max=23421, avg=101.57, stdev=65.57
lat (usec): min=13, max=23421, avg=101.70, stdev=65.58
clat percentiles (usec):
| 1.00th=[ 62], 5.00th=[ 73], 10.00th=[ 77], 20.00th=[ 83],
| 30.00th=[ 88], 40.00th=[ 92], 50.00th=[ 95], 60.00th=[ 100],
| 70.00th=[ 106], 80.00th=[ 116], 90.00th=[ 133], 95.00th=[ 151],
| 99.00th=[ 190], 99.50th=[ 212], 99.90th=[ 318], 99.95th=[ 433],
| 99.99th=[ 1811]
bw ( KiB/s): min=34415, max=40496, per=1.56%, avg=39088.42, stdev=570.21, samples=7629
iops : min= 8603, max=10124, avg=9772.08, stdev=142.56, samples=7629
lat (usec) : 20=0.01%, 50=0.25%, 100=60.55%, 250=38.97%, 500=0.18%
lat (usec) : 750=0.01%, 1000=0.01%
lat (msec) : 2=0.02%, 4=0.01%, 10=0.01%, 20=0.01%, 50=0.01%
cpu : usr=1.43%, sys=2.27%, ctx=37518842, majf=0, minf=219
IO depths : 1=108.1%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
issued rwts: total=0,37518333,0,0 short=0,0,0,0 dropped=0,0,0,0
latency : target=0, window=0, percentile=100.00%, depth=1
Run status group 0 (all jobs):
WRITE: bw=2442MiB/s (2560MB/s), 2442MiB/s-2442MiB/s (2560MB/s-2560MB/s), io=143GiB (154GB), run=60019-60019msec
Disk stats (read/write):
nvme0n1: ios=102/40536332, merge=0/0, ticks=3/2549648, in_queue=18511, util=99.41%
What gives? This is a bust. Try with 4k block sizes.