All about POOLS | Proxmox + Ceph Hyperconverged Cluster fäncy Configurations for RBD

Sdílet
Vložit
  • čas přidán 25. 07. 2024
  • In this video, I expand on the last video of my hyper-converged Proxmox + Ceph cluster to create more custom pool layouts than Proxmox's GUI allows. This includes setting the storage class (HDD / SSD / NVME), failure domain, and even erasure coding of pools. All of this is then setup as a storage location in Proxmox for RBD (RADOS Block Device), so we can store VM disks on it.
    After all of this, I now have the flexibility to assign VM disks to HDDs or SSDs, and use erasure coding to get 66% storage efficiency instead of 33% (doubling my usable capacity for the same disks!). With more nodes and disks, I could improve both the storage efficiency and failure resilience of my cluster, but with only the small number of disks I have, I opted to go for a basic 2+1 erasure code.
    Blog post for this video (tbh not a whole lot there):
    www.apalrd.net/posts/2022/clu...
    My Discord Server, where you can chat about any of this:
    / discord
    If you find my content useful and would like to support me, feel free to here: ko-fi.com/apalrd
    This video is part of my Hyperconverged Cluster Megaproject:
    • Hyper-Converged Cluste...
    Find an HP Microserver Gen8 like mine on eBay (maybe you'll get lucky idk):
    ebay.us/1l1HI1
    Timestamps:
    00:00 - Introduction
    01:11 - Cluster Overview
    02:16 - Ceph Web Manager
    03:10 - Failure Domains
    04:24 - Custom CRUSH Ruleset
    06:18 - Storage Efficiency & Erasure Codes
    08:00 - Creating Erasure Coded Pools
    12:35 - Results
    Some links to products may be affiliate links, which may earn a commission for me.
  • Věda a technologie

Komentáře • 73

  • @lawrencerubanka7087
    @lawrencerubanka7087 Před měsícem +1

    Great video, Thanks very much!
    Editorial: The Proxmox UI needs a LOT of work. Having to use the Ceph dashboard to define pools and rule sets manually and having to adjust storage.cfg to detail the Data vs Metadata is just miserable, especially because it doesn't leave any evidence in the Proxmox UI.
    I'm looking forward to checking out your CephFS video. That's the reason I'm here.
    Thanks again!

  • @BrianPuccio
    @BrianPuccio Před rokem +3

    Every few months, the thought of ceph for my home servers crosses my mind. I never sat down to truly understand it and now I don’t have to. Your video explained it all for me so I can understand the pros and cons it offers.
    I’m glad to see you posting again but don’t forget to keep taking care of yourself.

    • @apalrdsadventures
      @apalrdsadventures  Před rokem +2

      Thanks! I'm keeping to a more reasonable schedule now.
      I'm still debating on if I want to migrate to Ceph vs my current two-box system (one TrueNAS + one Proxmox).

  • @junialter
    @junialter Před 8 měsíci +1

    Man you're the best IT youtube maker I've seen in a long while. Thank you so much.

  • @Mikesco3
    @Mikesco3 Před rokem +6

    I really appreciate the work you put into this video. I looked into Ceph a while ago but I don't think it had the web console that you showed.
    I really appreciate on how clearly you explain it.

    • @apalrdsadventures
      @apalrdsadventures  Před rokem

      The Ceph dashboard is pretty great honestly. I'm just using a few modules so far, but I'm sure I'll get to more of it eventually in the series.

  • @NineKeysDown
    @NineKeysDown Před rokem +3

    Thank you, that was really helpful and filled in some the gaps I was missing!

  • @remusvictuelles1669
    @remusvictuelles1669 Před rokem

    very informative and easy to understand explanation... you've got a sub!

  • @tomaszbyczek7611
    @tomaszbyczek7611 Před rokem

    You are my MASTER !! Thank you from Poland :) Keep doing your great work !

  • @berniemeowmeow
    @berniemeowmeow Před rokem +1

    Really enjoy your channel and explanations. Thank you!

  • @seccentral
    @seccentral Před 10 měsíci

    loved this, keep em coming :D

  • @spoonikle
    @spoonikle Před rokem +4

    Hyper Convergence has too cool of a name - but its no joke for storage pools, it just makes sense to keep costs down and get super redundancy and great performance.

    • @apalrdsadventures
      @apalrdsadventures  Před rokem

      Oh yeah, it makes way more sense than loading up on a ton of disk shelves on a single server and pushing that one server harder.

  • @MarkConstable
    @MarkConstable Před rokem +3

    As usual, this one was excellent and clarified erasure coding. Many thanks. Setting up cephfs is super easy. For anyone curious... NODE -> Ceph -> CephFS, create Metadata Server(s) then Create CephFS, create a Pool, and add that pool to Datacenter -> Storage as type CephFS. Job done. Just don't backup to your ceph-fs pool from VMs that are hosted on your RBD ceph-vm pool if both are on the same OSDs. The read/write contention is massive :-)

    • @apalrdsadventures
      @apalrdsadventures  Před rokem +1

      Glad you liked it!

    • @MarkConstable
      @MarkConstable Před rokem

      @@apalrdsadventures FWIW when comparing Ceph/FS vs GlusterFS, the ram usage was a massive difference. There were 4 related Gluster daemons and they used 80 MB ram. The mds, mgr, mon and osd Ceph daemons took up 3.4 GB of ram! Also considering that Gluster works best with XFS there was also no massive ZFS memory requirements either. However, Gluster lacks a RBD block storage system similar to ZFS zvols so considering that Ceph provides block level storage and can almost be completely managed from the Proxmox gui... I'll just make sure I have a minimum of 32 GB of ram for nodes and just go with Ceph/FS,

    • @apalrdsadventures
      @apalrdsadventures  Před rokem +1

      That's partially because the Ceph OSD is doing its own caching (since it operates on the block device directly and skips the kernel filesystem page cache), whereas Gluster is relying on XFS and the kernel page cache (which is still there even without ZFS, it just doesn't show up as consumed memory like ZFS does). The default limit is 4G of cache per OSD and it will scale back with system memory pressure.
      The monitor also does use a decent amount of RAM, although on my cluster it seems to be around 500M which is reasonable I think.

    • @MarkConstable
      @MarkConstable Před rokem

      @@apalrdsadventures Right, got it. At some point I want to do a tiny 3 node nanolab project, a bit like your $250 project, and I'd probably go with Proxmox on LVM (specifically not ZFS) and Gluster and see how few resources could be used to still end up with HA. Gluster can store qcow2 and raw VM images. My current 3 node microlab has been stable for 2 weeks now so if it doesn't require any attention for another month I might try a Gluster based nanolab project.

  • @martinhryniewiecki
    @martinhryniewiecki Před rokem

    Fantastic explanation

  • @BigBadDodge4x4
    @BigBadDodge4x4 Před 8 měsíci

    Thanks for explaining how to INSTALL and setup the CEPH Dashboard! Your instructions on where and how to actually install said dashboard in THIS video is awesome! Can you please mark where in the video this instructions are? Thanks !

  • @drdaddydanger1546
    @drdaddydanger1546 Před rokem

    Thank you. I wonder how can I backup a ceph pool in a good way.

  • @shephusted2714
    @shephusted2714 Před rokem

    great stuff - you need another nas spinning rust node to make the cluster fully redundant - do some homee serving with wireguard and a nginc r proxy on a cheap vps - or two - think about opnsnese ha as independent hw nodes - it can do load balancing to the nodes and it may get better perf than you think - this is a great setup for a small/med biz since it is so easily and cheaply scalable for capacity #live migration

    • @apalrdsadventures
      @apalrdsadventures  Před rokem +1

      Answering all your comments in one place:
      - I'd need 3 nodes minimum with the 2+1 code. 1 drive each on 4 hosts would be better than 4 drives all on one host.
      - NVMe disks need relatively new hardware, something I don't have. But I'm working on a video with new hardware, just not Ceph yet.
      - I'll get to CephFS and distributed filesystems, it's a big topic and it deserves at least a few videos of its own

  • @peronik349
    @peronik349 Před rokem

    Good video as usual.
    Quick question about "data_pool" and "metadata_pool" in erasure-coding mode.
    Is there a "rule" or a "good practice" allowing us to define a good ratio between the capacity stored in the "data_pool" and the size of the SSD(s) that will host the "metadata_pool"

    • @apalrdsadventures
      @apalrdsadventures  Před rokem

      In my experience, the metadata pool is extremely small. It's less than a MB in my testing, although I don't have a ton of data on my test setup (the total capacity is only ~300GB).

  • @MultiMunding
    @MultiMunding Před 6 měsíci

    How fast is this?
    How would a database like postgres perform inside the cluster on one node compared to running natively on the same hardware node but without ceph? I can't find much information about this, just that people have very diverse opinions on running kubernetes/databases/ceph

  • @mikebakkeyt
    @mikebakkeyt Před 11 měsíci +1

    Great video but left me with a headache 🙂
    I'm just starting with PVE so I will likely leave HA and Ceph for later but just settle for replicating between two nodes and manual failover in case of node loss. Not the best but I need to run work on PVE now and don't want to be tearing down and re-installing constantly as I learn.
    I guess next step is a virtual PVE lab - I am *assuming* PVE can be installed on PVE 🙂

  • @AamraNetworksAWS
    @AamraNetworksAWS Před rokem

    Hi, can you please show how to install the ceph dashboard getting lots of error

  • @brunosalhuana7431
    @brunosalhuana7431 Před rokem +1

    Can you do a video about using Ceph RBD with samba?

  • @hpsfresh
    @hpsfresh Před 10 měsíci

    Can I change priority for osd so vm works with osd on the same node (as far is it will be faster via internal virtual 100gbit network)?

  • @thestreamreader
    @thestreamreader Před rokem

    I have 3 nodes right now I have 1 built out with 32 gb ram and 2 tb hdd with 512 gb drive with proxmox installed. How should i build the 1st node knowing later I want to add 2 more when I get the money? What filesystem should I put on the 2tb hdd where the VM/Containers will be stored? I don't want to have to do this over when I get the other 2 read so kinda want it to be ready to add them. I am going to build the other 2 out at the same time.

    • @apalrdsadventures
      @apalrdsadventures  Před rokem

      In general I do ZFS unless I have a good reason not to. This goes for both the OS drive and other drives. You can create separate ZFS pools now to plan on replacing one of them with Ceph later.
      If you want to add Ceph later you'd need to reformat the 2TB HDD, but it's possible to migrate in place (OS disk remains ZFS, 2 new OSDs get added on 2 new nodes, create pool in degraded state, move all VMs/CTs to new pool, then reformat 2TB HDD and add it as third OSD, let it replicate to third OSD).
      Depending on if you care about cluster-wide data guarantees you get from Ceph you could just keep zfs on the 2TB drive, it's not a bad solution either.

  • @hoaxbuster78
    @hoaxbuster78 Před měsícem

    i tried to install ceph dashoard, do you have tutorial ? thanks !

  • @ap5672
    @ap5672 Před rokem

    Thank you for the guide. I created an SSD cache pool for my HDD pool with your guide. Which pool do I add in proxmox as the storage for my VMs, the HDD pool?

    • @apalrdsadventures
      @apalrdsadventures  Před rokem

      If you're using metadata + data pools, you'll add the SSD pool in the GUI and add the HDD pool in the config file ("data-pool" is HDD)

    • @ap5672
      @ap5672 Před rokem

      @@apalrdsadventures I am not using separate metadata + data pools. The metadata is in the osd itself.
      In this case which pool do I add into the pve storage? The SSD cache or HDD pool.
      Thank you for the excellent proxmox guides. You have a new sub.

    • @apalrdsadventures
      @apalrdsadventures  Před rokem +1

      So you're using separate DB/WAL disks? In that case, there is only one pool and you add that to the GUI.
      If you're using the 'tiered storage' commands from the Ceph documentation, be aware that they specifically recommend not using it for RBD. But you'd select the HDD pool in that case.

    • @ap5672
      @ap5672 Před rokem

      @@apalrdsadventures I am using the tiered storage from ceph documentation. Thanks for the warning. I wonder why it isn't recommended.

    • @apalrdsadventures
      @apalrdsadventures  Před rokem

      Filesystems tend to do a lot of disk IO across the whole device in the background (i.e. ZFS scrubs, other filesystem cleanup) that would pull blocks into cache unnecessarily and copy on write filesystems also tend to pull the entire disk into cache as blocks are allocated and discarded when a file is modified, whereas with CephFS and RGW, Ceph is directly aware of the IO operation and knows which files should be kept in cache.

  • @pavlovsky0
    @pavlovsky0 Před rokem

    I run TrueNAS Core on my proxmox cluster as a VM. I pass thru some spinning disks to the VM and use those for ZFS. Could a CephFS implementation replace this? The TrueNAS is rock solid btw, very happily using that ZFS pool for years and I've had disk issues and replaced easily resilvering and recovering with no data loss. I also use Proxmox Backup Server with an LTO4 tape drive and I backup my proxmox cluster. It gets a bit hokey when I backup my TrueNAS data using this.

    • @pavlovsky0
      @pavlovsky0 Před rokem

      Also thank you for your content. You and electronicswizardry do some excellent proxmox videos.

    • @apalrdsadventures
      @apalrdsadventures  Před rokem

      CephFS *could* replace it, or you could mount a dataset from the Proxmox host ZFS pool into an LXC container (and manage sharing there), or install samba and manage sharing on the host.
      Ceph (including CephFS) doesn't really scale down to single servers well though, although it's possible. ZFS is certainly much more optimized for in-memory transactions.

  • @datenkralle1406
    @datenkralle1406 Před rokem

    perfect, I'm in the process of deploying a new promox/ceph cluster at my company.
    I created a ssd pool I want to use for I/O hungry machines, but creating a machine maybe on the ssd pool makes no difference in read and write I/O (tested with dd)
    vs putting them on the default manager pool (maybe because it uses these ssd tools also since they are also available in that pool?)
    maybe you can point me in direction would be highly appreciated

    • @apalrdsadventures
      @apalrdsadventures  Před rokem

      On a pool of mixed drive types, with the failure domain set to Host, Ceph will end up selecting 3 hosts and then from the host group selecting an OSD from that host based on the OSD's weight (by default the weight is the capacity in TB). Some of these PGs will end up with one or more SSDs in the mix of course.
      All IO goes to the 'first' OSD holding the PG, which will then do IO on the replicas if necessary. So a write op requires at least two of the replicas to complete and a read op can be satisfied directly by the 'first' OSD. If that first OSD is an SSD, then the whole PG will get the read performance of the SSD and the write performance of the second fastest disk.
      The disk image will end up spread across many PGs and normal performance will average out, but in a purely sequential, single queue depth workload you'll end up hitting the same PG for awhile and probably running into sequential network bandwidth limits, especially when writing (as data needs to pass over the network 3 times).
      FIO is better for measuring IO bandwidth, especially if you know the block size and queue depth your application supports (i.e. for databases).

  • @ewenchan1239
    @ewenchan1239 Před 6 měsíci

    Does this mean that you would need separate block devices (drives) so that you would be able to create separate replicated and erasure-coded pools?
    Or are you able to use the same drives for both?
    I just bought three mini PCs and it only has a single 2242 M.2 NVMe slot per mini PC.
    So right now, I've managed to partition the drive so that I was able to install Proxmox on it and then use the remainder of the drive as the Ceph OSD, but it's only the replicated pool.
    For me to do this with only one 2242 M.2 NVMe SSD, does that mean that I would have to repartition the SSD so that I would actually split the remaining space into two partitions - one for the Ceph pool replicated rule and one for the Ceph erasure-coded pool?
    Thoughts?
    Your help is greatly appreciated.
    Thank you.

    • @apalrdsadventures
      @apalrdsadventures  Před 6 měsíci

      You can have multiple pools on a single set of drives. The rules for each pool will take affect for that pool only, so there need to be enough drives in the system for it to meet the rules without overfilling one drive.

    • @ewenchan1239
      @ewenchan1239 Před 6 měsíci

      @@apalrdsadventures
      Agreed.
      So I have a 512 GB 2242 M.2 NVMe SSD in each of the 3 nodes.
      100 GB has been allocated for the Proxmox install + 8 GB for swap (per node).
      That leaves about ~404 GB (~376 GiB) available for Ceph.
      Right now, in its current state, I created the first Ceph OSD (per node) which is used in a ceph replicate pool.
      But after watching your video, maybe I can repartition that such that the fourth partition (/dev/nvme0n1p4) in each of the drives is maybe 50-100 GB which will be the OSD for the Ceph replicate pool and then the remainder (~304 GB/~296 GiB) can be used for the Ceph erasure pool (/dev/nvme0n1p5), right?
      I just want to double check with you whether I have understood the material that you have presented in your video correctly and properly.
      And that should still allow for HA failover in case one of my nodes dies -- that the VMs and/or CTs should be able to failover onto the other remaining nodes, correct?
      Your help is greatly appreciated.
      Thank you.

    • @apalrdsadventures
      @apalrdsadventures  Před 6 měsíci

      A given pool isn't tied to specific OSDs (disks / partitions). All pools can use all free space on all disks, according to their rules.
      So no need to partition into replicate / erasure. Replicated an erasure coded pools can share the same disks.

    • @ewenchan1239
      @ewenchan1239 Před 6 měsíci

      @@apalrdsadventures
      Thank you.
      I will have to play around with setting that up to see if I would be able to put both the erasure coded pool and the replicated pool on the same OSD.
      Your help is greatly appreciated.
      *edit*
      Just tried it. It works! Yay! Thank you!
      Happy New Year!

  • @voldllc9621
    @voldllc9621 Před rokem

    Good stuff, but I did not see how you tailor the number of placement groups for the pools.

  • @nevermetme
    @nevermetme Před rokem

    Have you checked out the `pveceph pool create` command with the --erasure-coding parameter? That makes it quite a bit easier to use EC pools in Proxmox. :)

    • @apalrdsadventures
      @apalrdsadventures  Před rokem

      It's about the same amount of work to do it from the command line vs GUI, since you need to do a ceph osd crush rule create for the metadata pool also.

    • @nevermetme
      @nevermetme Před rokem

      True, for a cluster that requires quite specialized settings, like failure domain on the OSDs instead of hosts and device class specifics, the metadata pool needs a different CRUSH rule if it should match the EC pool in these things :)

  • @krzycieslik6650
    @krzycieslik6650 Před rokem

    CrystalMark show me on random 4KiB transfer 2.26 (read) and 1.53 (write).
    Where i can find instructions, how made it faster?

    • @GrishTech
      @GrishTech Před rokem +1

      Ceph has very nice redundancy and scalability, but not random I/o, especially Q depth 1.

  • @minecrafter9099
    @minecrafter9099 Před 5 měsíci

    latest proxmox and ceph don't seem to go along very well with the dashboard, so to do erasure coded it seems like cli is the only option

    • @apalrdsadventures
      @apalrdsadventures  Před 5 měsíci

      Yeah, unfortunately it's a bug in Python that's impacting Dashboard, so dashboard doesn't really work for anyone (Proxmox or not).

  • @shephusted2714
    @shephusted2714 Před rokem

    running it on 2.5 would be nice - making everything nvme would be nice sine there is real price parity now - don't leave perf on the table - run soe simple benchmarks after you get it all built out - consider spinning up vms on ws to join cluster and as a stopgap until you can get more nodes

  • @shephusted2714
    @shephusted2714 Před rokem

    pls bench the netfs and also compare to other options like ocfs2/gluster over zfs/sshfs/smb/nfs - having a couple nass may be a good way to have more than 1 netfs and do backups? this is all semi advanced great to learn and experiment with - most people and smb will want most basic but with best price/perf and easy ui - ymmv - i think the toughest part is just the setup - maybe have a website with the howtos? #simple machines

  • @copper4eva
    @copper4eva Před rokem

    This may be a weird/bad idea, but is there any way to make a SSD pool that puts its replicated data in HDD's? The idea is to get 100% efficient use of your SSD's, and have all the replication overhead on the HDD's. Obviously if you were ever to lose a SDD you would then be stuck on HDD speeds for the time being.
    You could definitely manually do this by simply make two pools, and use some program to just copy all data on the SSD pool to the HDD pool. And set the SSD to have no replication. I was just curious if ceph has anything built in to do something like this.

    • @apalrdsadventures
      @apalrdsadventures  Před rokem

      No, all copies share the same storage rule (which is either any or a specific device class). But you wouldn't want to do this either, since Ceph guarantees writes when 2 out of 3 replicas have completed, the fastest two of three drives will end up determining the write speed of the whole operation. The write will stall until at least 2 replicas are in place, and the pool will be degraded if the third replica is substantially behind the other two.

    • @copper4eva
      @copper4eva Před rokem

      @@apalrdsadventures
      I just found out there are features that mix SSD's and HDD's in the same pool. They are called hybrid pools, and there is also primary affinity. With 3 replicated, for example, you can have it write the 1st copy to a SSD OSD, and then the other 2 copies to HDD's. As you point out, this will not speed up write speeds, as you will have to write to the slow HDD still. But it will speed up read speeds substantially.
      I only just now read about this. But I would be curious if this is viable with erasure coding too, rather than just replication.

    • @apalrdsadventures
      @apalrdsadventures  Před rokem

      You're right, it's possible by writing completely custom CRUSH rules - docs.ceph.com/en/latest/rados/operations/crush-map/#custom-crush-rules
      Depending on your workload it might be easier / better to use tiering to keep recent data in SSDs. That's my plan, with CephFS and video editing data.

  • @yevhenbryukhov
    @yevhenbryukhov Před měsícem

    White theme is a big misconfiguration 😜😄

  • @allards
    @allards Před 11 měsíci

    I noticed that:
    rbd: ceph-ssd
    content rootdir,images
    krbd 0
    pool ceph-nvme
    data-pool ceph-ssd
    Both the same disk images display. Must be the metadata, but not so elegant if it's shared with a regular pool.
    Ended up creating a meta-data pool:
    rbd: ceph-ssd
    content images,rootdir
    krbd 0
    pool ceph-ssd_metadata
    data-pool ceph-ssd
    rbd: ceph-ssd_metadata
    content images,rootdir
    krbd 0
    pool ceph-ssd_metadata
    Now it makes sense again!

    • @apalrdsadventures
      @apalrdsadventures  Před 11 měsíci +1

      Yes, the metadata is shared so Proxmox sees it as two sets of metadata. I don’t really mind but your solution does fix that.