Tuesday Tech Tip - Accelerating ZFS Workloads with NVMe Storage

45Drives

zhlédnutí 9 048

Přidat do
- Můj playlist
- Přehrát později
Sdílet

Sdílet

Vložit

Velikost videa:

Zobrazit ovladače přehrávání

Automatické přehrávání

Přehrát

čas přidán 19. 06. 2024
Every second Tuesday, we will be releasing a tech tip video that will give users information on various topics relating to our Storinator storage servers.
Previously, we released our plan for the newly designed Stornado 2U All-Flash storage server. Brett talked about our plan for the SATA version and an NVMe version coming early 2023. Brett also explains how you can incorporate NVMe storage into your workflow now (if you don't want to wait until 2023). You can check out those videos linked below.
Brett is back this week to talk more about NVMe storage. Specifically, we are talking about accelerating a ZFS storage pool using NVMe devices. Brett talks about ZFS, special vdevs, and shows a comparison between NVMe and HDD for metadata performance.
Chapters:
00:00 - Introduction: Accelerating Your ZFS Workload with NVMe Storage
01:20 - Comparing Two ZFS Storage Pools
02:00 - Running Listing Tests in the Terminals with Performance Stats
03:50 - Understanding Metadata and Metadata Performance
05:55 - Comparing HDD vs NVMe Time Performance Listing 1K Files
08:20 - Comparing HDD vs NVMe Time Performance Listing 10K Files
08:55 - Comparing HDD vs NVMe Time Performance Listing 50K Files
09:04 - Comparing HDD vs NVMe Time Performance Listing 100K Files
09:33 - Comparing HDD vs NVMe Time Performance Listing 500K Files
10:30 - Comparing HDD vs NVMe Time Performance Listing 1000K Files
11:27 - Outro
--
Check out our videos about the Stornado and NVMe storage:
Next Generation 2U Stornado: • Tuesday Tech Tip - Nex...
NVMe Carrier Cards: • Tuesday Tech Tip - NVM...
Visit our website: www.45drives.com/
Check out our GitHub: github.com/45drives
Read our Knowledgebase for technical articles: knowledgebase.45drives.com/
Check out our blog: www.45drives.com/blog
Single Server Buying Guide: www.45drives.com/products/net...
Ceph Clustering Buying Guide: www.45drives.com/solutions/di...
To enroll in our 2-day Clustered Storage Bootcamp: www.45drives.com/support/clus...
Have a discussion on our subreddit: / 45drives
#45drives #storinator #stornado #storageserver #serverstorage #storagenas #nasstorage #networkattachedstorage #proxmox #virtualization #cephstorage #storageclustering #virtualmachines #cephcluster #storagecluster #proxmox #ansible #nasstorage #prometheus #samba #cephfs #allflash #ssdstorage #ssdserver #allflashserver #allflashstorage
Věda a technologie

Komentáře • 45

@n8c Před rokem
Do you usually run some performance metrics on your customers' machines once they have been built out?
Feel like you could easily let the same tools run in the background to generate some exemplary "load at 10 am might be this", which should easily show the differences.
For StarWind vSAN I used DiskSPD, which seems to have a linux-port Git repo (YT doesn't like links, it's the first result in Google).
@midnightwatchman1 Před rokem ⁺⁶
actually document management servers actually frequently have over 100 K files in one directory. massive workload do exist. a human may not do directly by but applications frequently do
@TheChadXperience909 Před rokem ⁺¹
I'm sure it would benefit email servers, as well.
@n8c Před rokem
A temperature tracking software (food transport industry) does this as well.
Devs do crap like this all the time, where stuff clearly belongs in a DB or sth 😅
@---tr9qg Před 11 měsíci
🔥🔥🔥
@pivot3india Před rokem
is it good to have meta disk even if we use ZFS primarily as virtualisation target ?
@89tsupra Před rokem ⁺¹
Thank you for the explanation, you mentioned that the metadata is stored on the disks and having an NVME will help speed that up. Would you recommend adding one for an all-flash storage pool?
@steveo6023 Před rokem
As he said it will keep the load from the storage disks (or flash). Depending on the workload it also could improve performance for an all flash storage
@TheChadXperience909 Před rokem
M.2 NVME drives have lower latencies than drives connected via SATA, and often have faster read/write throughput. It would accelerate such an array, but to a lesser extent. When comparing, you should look at their IOPS.
@shittubes Před rokem
it can create higher fairness between multiple applications with different access patterns. so that a high throughput sequential write load won't affect another workload that does mostly very small I/O, either just on metadata or using small blocksizes (handled by the special device).
@ati4280 Před 8 měsíci
It depends on SSD types. If you add a NVMe drive for all-flash SATA pool, the benifits will not as noticeable as accelarate a HDD only pool. The iops difference between NVMe drives and SATA drives is not that significant. The 4k performance of a SSD is not only related to its interface, model of NVMe controller, NAND type, and cache speed and size also play a big role to influence the final performance of a SSD drive.
@chrisparkin4989 Před rokem ⁺²
Great vid but won't 'hot' metadata live in your ARC (RAM) anyway and therefore that is surely the fastest place to have it?
@TheExard3k Před rokem
It would. But ARC evicts stuff all the time, so your desired metadata may not stay there. Tuning parameters can help with this. But having metadata on SSD/NVMe guarantees fast access. And the vdev increases pool capacity, so it's not "wasted" space. Worth considering if you have spare capacity for 2xSSD/NVMe. And you really need it on very large pools or when handing out a lot of zvols (block storage).
@zparihar Před 5 měsíci
Great demo. What are the risks? Do we need to mirror the special device? What happens if it dies?
@meaga Před 3 měsíci
You would loose the data in your pool. So yes, I'd recommend to mirror your metadata device. Also make sure that it is sized correctly corresponding to your data vdevs size.
@teagancollyer Před rokem
Hi, what capacity HDD's and NVME were used for the video, I'm terrible at reading Linux's storage capacity counters. I'm trying to work out a good capacity of NVME to get for my 32TB (raw) pool, is 500GB a good amount?
@45Drives Před rokem ⁺³
We used 16TB HDDs and a 1.8TB NVMe.
How much metadata stored will vary depending on how many files are in the pool not only how big it is. 32 TB of tiny files will use more metadata space than 32TB of larger files. So, its not always straightforward to pick the size of the special vdev needed.
Okay, so where to go from here?
Rule of thumb seems to be about 0.3% of the pool size for a typical workload. This is from Wendell at Level1Tech - a very trusted ZFS guru. See this as reference: forum.level1techs.com/t/zfs-metadata-special-device-z/159954
So, in your case 0.3% of 32TB would be 96GB. Therefore, 512GB NVMe will work. Remember, you will want to at least 2x mirror this drive and buy enterprise NVMe, as you will want power loss protection.
If you already have data on the pool, you can get the total amount of metadata currently being used, using a tool called 'zdb'. Check out this thread as a reference: old.reddit.com/r/zfs/comments/sbxnkw/how_does_one_query_the_metadata_size_of_an/
You can do this by following the steps in the above thread or you can use a script we put together inspired by the thread: scripts.45drives.com/get_zpool_metadata.sh
Usage "bash get_zpool_metadata.sh poolname"
Thanks for the question, hope this helps!
@teagancollyer Před rokem
@@45Drives Thanks for the reply. All of that info will be very useful and I'll be reading those threads you linked in a minute.
@Solkre82 Před 2 měsíci
If you add a metadata vdev to a pool, is it safe to remove later? Is this a cache or is no metadata going to disk anymore?
@45Drives Před 2 měsíci
the metadata vdev houses the data about the data. So things like properties, indexes, etc. essentially pointers to where the data is in the pool/dataset.
If you remove that it's like the data has nothing to tie it to a specific block(s) in the pool rendering all data inaccessible
So no, not safe to remove.
@cyberpunk9487 Před rokem ⁺¹
Im curious does this benefit iscsi luns and vm disks. Say i want to use truenas as a storage target iscsi for windows vms and i would also like to use a SR (storage repo) for vm disks to live on.
@shittubes Před rokem
it's only useful for datasets, not usable for zvol
@TheChadXperience909 Před rokem
Metadata doesn't (only) mean file metadata, in this case. Zvols also consist of metadata nodes and data nodes, and the metadata nodes do get stored on the special vdev, as well. However, you'll likely see acceleration to a lesser degree than with regular datasets. Though, I read somewhere that you may be able to use file based extents for iSCSI, which means dataset rules would apply.
@cyberpunk9487 Před rokem
@@TheChadXperience909 from what I remember you can use file extents for iscsi on truenas but I vagally recall hearing that some of the iscsi benefits are lost when not using zvols.
@TheChadXperience909 Před rokem
@@cyberpunk9487 Makes sense.
@mitcHELLOworld Před rokem
@@shittubes We actually don't use ZVOL's for iSCSI LUNs for a few reasons. We have found much better success deploying fileIO based LUNs that we create within the ZFS dataset. I believe one of my videos here goes over this, but perhaps its time for a good refresher on ZFS iSCSI.
@StephenCunningham1 Před rokem
Stinks that you lose the whole pool if the mirror dies. I'd want to also z2 the special pool
@TheChadXperience909 Před rokem ⁺²
In my experience, it really accelerates file transfers. Especially, when doing large backups of entire drives and file systems.
@steveo6023 Před rokem
How can this improve transfer speed when only metadata are on the nvme?
@TheChadXperience909 Před rokem ⁺²
@@steveo6023 It speeds up, because flash storage is faster at small random IOPS than HDDs. Even though they are small reads/writes, they add up over time. Also, it prevents the read/write head inside the HDD from thrashing around as much, which reduces seek latency, and can also benefit drive longevity.
@steveo6023 Před rokem
@@TheChadXperience909 but metadata is cached in the arc anyway
@TheChadXperience909 Před rokem ⁺¹
@@steveo6023 That applies only to reads, and always depends.
@shittubes Před rokem
@@steveo6023 If spinning drives can spend 99% of their time in sequential writes, they will be very fast. If e.g. 50% of the time is spent in random writes for metadata, the transfer speed will be halved. if the nvme metadata handling doesn't add other unexpected delays (which I do not know, am only wondering if it's the case), this could be completely predictable in this linear way.
@steveo6023 Před rokem
Unfortunately it will add a single point of failure when using only one name device as all data will be gone when the metadata ssd dies
@TheChadXperience909 Před rokem ⁺¹
That's why you should always add it in mirrors, which also has the effect of nearly doubling read speeds, since it can read from both mirrors. Presenter is using mirrors in his example.
@n8c Před rokem ⁺²
This is a lab env, as stated.
You wouldn't run this exact setup prod for various reasons 😅
@shittubes Před rokem ⁺¹
i'm honestly quite disappointed that the speedup for nvme special device is quite a lot smaller in the larger folders:
500k 18/11 1.63636363636364
1k 119/21 5.66666666666667
the first examples were nice, 6x speedup - why not.
but a 2x speedup, not so impressive any more, considering that nvme should normally be 10x faster even at the largest blocksizes.
in the iostat output i also see the nvme being read at often just 5MB/s, why is it so low?!
@TheChadXperience909 Před rokem
The law of diminishing returns.
@mitcHELLOworld Před rokem
5MB/s isn't what matters here. The rated IOPS of a drive is what is going to tell you how fast storage media will run. For example, if your storage IO pipeline is using a 1 KB block size (it isn't but just as an example) then your storage media needs to be able to do 5000 IOPS to even hit 5MB/s, whereas if your block size was 1MB, 5000 IOPS would be 5GB/s. a HDD is capable of in the neighborhood of 400 IOPS total (thats being generous) , making a HDD not even able to hit 450KB/s if you were to use 1KB block size.
As for The special VDEV, is what we consider a "support vdev" and is best when used in conjunction with the ARC. It isn't best used for ALL metadata requests to come from it. However, to easily show the difference between no NVMe and NVMe for special vdev he had to considerably handicap the ZFS ARC because during this test there is no real world workloads happening, and if he kept the ARC fully sized, you woudln't have seen a difference between the two anyways because the ARC would have held everything.
In a production setting, there will be a large subset of the pools metadata stored within the RAM on the ARC, and the special VDEV will be there for any cache misses on metadata lookups that aren't in the ARC. When ZFS has a cache miss and there isn't a dedicated special vdev this can cause quite a bit of latency and slowdown. By adding the special VDEV in, this accelerates metadata lookups by a huge factor.
The special VDEV can also be put to use for small block allocation as well which is really cool and can really improve performance of the overall pool. Perhaps we will cover this in more detail in a follow up video. But in the mean-time, Brett and I did discuss this in our "ZFS Architecture and build" video from a few months ago!
@shittubes Před rokem
@@mitcHELLOworld do i understand correctly that you didn't start with an empty ARC? i can confirm that I often see something around 60-95% ARC hits here for metadata in production, even with a small ARC. this would indeed seem a good enough explanation why the ratios between HDD and nvme times aren't higher.
@shittubes Před rokem
@@mitcHELLOworld what was your recordsize?
not sure what block size is best for just metadata for such an edge-case. would be funny to dig deeper here and check the size of the actual read()s returning.
i agree it's better to look at IOPS, old habits :D
so revisiting, I look at the 1000k scenario, and concentrating just on the IOPS:
the special device IOPS seems to peak somewhere at the beginning with around 19K, but averages ~7K IOPS.
meanwhile, the HDDs (all together) don't do so much worse, averaging 5-7K IOPS.
i feel like something else must be bottlenecking, not the actual IOPS capacity of the nvme drives. or do you consider 7K good? :P

Další v pořadí

Automatické přehrávání