$250 Proxmox Cluster gets HYPER-CONVERGED with Ceph! Basic Ceph, RADOS, and RBD for Proxmox VMs

apalrd's adventures

zhlédnutí 84 120

Přidat do
- Můj playlist
- Přehrát později
Sdílet

Sdílet

Vložit

Velikost videa:

Zobrazit ovladače přehrávání

Automatické přehrávání

Přehrát

čas přidán 24. 07. 2024
I previously setup a Proxmox high availability cluster on my $35 Dell Wyse 5060 thin clients. Now, I'm improving this cluster to make it *hyperconverged*. It's a huge buzzword in the industry now, and basically, it combines storage and compute in the same nodes, with each node having some compute and some storage, and clustering both the storage and compute. In traditional clustering you have a storage system (SAN) and compute system (virtualization cluster / kubernetes / ...), so merging the SAN into the compute nodes means all of the nodes are identical and network traffic is, in aggregate, going from all nodes to all nodes without a bottleneck between the compute half and SAN half.
Today I am limiting this tutorial to only the features provided through the Proxmox GUI for Ceph, and only for RBD (RADOS Block Device) storage (Not CephFS). Ceph is a BIG topic for BIG data, but I'm planning on covering erasure coded RBD pools followed by CephFS in the future. But be sure to let me know if there's anything specific you'd like to see.
Merging your storage and compute can make sense, even in the homelab, if we are concerned with single point failures. I currently rely on TrueNAS for my storage needs, but any maintenance to the TrueNAS server will kick the Proxmox cluster offline. The Proxmox cluster can handle a failed node (or a node down for maintenance), but with shared storage on TrueNAS, we don't get that same level of failure tolerance on the storage side, so we are still a single point of failure away from losing functionality. I could add storage to every Proxmox node and use ZFS replication to keep the VM disks in sync, but then I either need to give in to having a copy of all VMs on all nodes, or individually pick two nodes for each VM and replicate the disk to those two (and create all of the corresponding HA groups so they don't get migrated somewhere else).
With Ceph, I can let Ceph deal with storage balancing on the back end, and know that VM disks are truly stored without a single point of failure. Any node in the cluster can access any VM disk, and as the cluster expands beyond 3 nodes I am only storing the VM 3 times. With erasure coding, I can get this down to 1.5 times or less, but that's a topic for a future video.
As a bonus, I can use CephFS to store files used by the VMs, and the VMs can mount the cluster filesystem themselves if they need to, getting the same level of redundancy while sharing the data with multiple servers, or gateways to NFS/SMB. Of course, that's also a topic for a future video.
Link to the blog post:
www.apalrd.net/posts/2022/clu...
Cost accounting:
I actually spent $99 on the three thin clients (as shown in a previous video). I spent another $25 each for 8G DDR3L SODIMMs to upgrade the thin clients to 12G each (1 8G stick + the 4G stick they came with). And I spent $16 each on the flash drives. Total is $222, so round up to $250 to cover shipping and taxes.
My Discord server:
/ discord
If you find my content useful and would like to support me, feel free to here: ko-fi.com/apalrd
Thumbnails:
00:00 - Introduction
00:35 - Hardware
02:13 - Ceph Installation
06:15 - Ceph Dashboard
08:06 - Object Storage Daemons
16:02 - Basic Pool Creation
18:04 - Basic VM Storage
19:34 - Degraded Operation
21:50 - Conclusions
#Ceph
#Proxmox
#BigData
#Homelab
#Linux
#Virtualization
#Cluster
Proxmox is a trademark of Proxmox Server Solutions GmbH
Ceph is a trademark of Red Hat Inc
Věda a technologie

Komentáře • 209

@robertopontone Před 2 lety ⁺⁴²
Your channel has great potential, it already has its own style. I hope you keep the momentum, I will be watching: meaning I find your video useful and interesting 😁Thanks.
@apalrdsadventures Před 2 lety ⁺⁷
Thanks! I'm really enjoying these projects, and sharing them with you all
@gustersongusterson4120 Před 2 lety ⁺¹²
Thanks for the video, I really appreciate the practical step by step explanation. Looking forward to more ceph videos!
@apalrdsadventures Před 2 lety ⁺²
Thanks! I have a video on custom rules (forcing HDD/SSD/NVMe for a pool) and Erasure Coding next, which is the first step beyond what the Proxmox GUI can provide on its own. Proxmox *just* added Erasure Coding support on their end in February, so AFAIK it's not even in the subscription branch yet and not in the GUI either
@ewenchan1239 Před 6 měsíci
LOVE this video!
Thank you!
I'm just getting around to setting up my 3-node HA Proxmox Cluster with ceph and this video is TREMENDOUSLY helpful.
@mikebakkeyt Před 11 měsíci ⁺⁵
Excellent content thank you. Ceph scares me but I will get there at some point hopefully. Really like your editing which removes all the whitespace which too many others leave in.
@mistakek Před 2 lety
Great video. I'm going to have to watch this a few times. You really go in to great detail on proxmos, which is exactly what I've been looking for.
@apalrdsadventures Před 2 lety
Glad you like it! I have another episode in the works on erasure coded pools, but it's just too long for one video
@JosephJohnson-sq4bu Před rokem ⁺¹
Just stumbled into this absolute gem. Thank you for the incredible content
@twincitiespcmd Před 2 lety ⁺¹
Really appreciate the step by step detail on CEPH as it relates to Proxmox and an inexpensive home lab setup. Looking forward to future videos!
@apalrdsadventures Před 2 lety
Thanks!
@twincitiespcmd Před 2 lety
@@apalrdsadventures I have started to look at your other videos. They're great! Technical step by step how tos for the home lab without breaking the budget. I really like that you have content on Proxmox. One minor suggestion. If you could enlarge your screen when you are typing commands so we can follow along with what you are doing would be great. Thanks for all you doing!
@apalrdsadventures Před 2 lety ⁺¹
Some programs are definitely harder to capture than others, it's something I'm trying to improve as I get better at the production side
@rimonmikhael Před rokem ⁺²
You got it man .. I read hyperconverged and I was like damnnnnn lol 😆.. 250$ for what we pay millions for between azure aws and 3 data centers .. that gotta be awesome 👌 👏 but it still fun to watch thank u
@apalrdsadventures Před rokem ⁺²
It's awesome that it works at all, but definitely way too many corners cut in storage bandwidth to make it usable for anything real
@kelownatechkid Před 19 dny
Fun video! It's cool that proxmox makes ceph available to new users in a stripped-down way. I've found it to be excellent for home use, since it has none of the limitations of traditional NAS solutions it allows the kind of random cobbled together setups that are common outside of enterprise. CephFS in particular is an absolute godsend, and as a whole ceph is among the most reliable software projects I've ever used. Issues have always been possible to work out and I've never lost any data despite hardware failures. I've changed and added parts, altered crush configs, and upgraded across major versions without any downtime too (from 15->16->17 over the years).. It's real FOSS too (I've had some small PRs merged) and despite the IBM happenings, it actually feels like the community aspect is growing still
@dn4419 Před rokem
This was really helpful. Great explanation and just the right amount of detail for me. Thank you very much!
@apalrdsadventures Před rokem ⁺¹
Glad it was helpful!
@bluesquadron593 Před 2 lety ⁺⁵
Man, super enjoyed this presentation of CEPH
@apalrdsadventures Před 2 lety ⁺³
Thanks! I've really enjoyed working with Ceph, but it's just too much content for a single video. This should at least be enough to get started.
@bluesquadron593 Před 2 lety ⁺²
@@apalrdsadventures Yeah, I didn't get into it much at all just set it up from some CZcams tutorial (not much of it) and enjoying it in my three node cluster. My drives are nvmi, so I get a decent speed moving VMs around.
@apalrdsadventures Před 2 lety ⁺³
That would certainly speed things up! I filmed a segment on how to set a crush rule to force pools to a specific device type, and also on creating and adding erasure coded pools to Proxmox, so those will make the next episode.
@Chris_Cable Před rokem ⁺¹⁰
If anyone is getting the message "Error ENOENT: all mgr daemons do not support module 'dashboard', pass --force to force enablement" when trying to enable the dashboard, running apt install ceph-mgr-dashboard on all the nodes in the cluster fixed it for me.
@marconwps Před 2 měsíci
Proxmox 8.2.2. i try and don't work i hope to be integreted in next release of software
@funkijote Před 2 měsíci ⁺¹
So youtubers are just out here hyperconverging their hardware for views now? Disgusting! (Thanks, this was very helpful)
@enkaskal Před rokem ⁺¹
outstanding experiment! thanks for sharing 😀👍🏆
@curtisjones8795 Před rokem
Love your channel. Thanks for the great video!
@apalrdsadventures Před rokem
Glad you enjoyed it!
@goodcitizen4587 Před 10 měsíci ⁺¹
Thanks, very good demo and presentation.
@apalrdsadventures Před 10 měsíci ⁺¹
Glad you liked it!
@PCMagikHomeLab Před 2 lety
great vid! Nice to see You again in new project :)
@apalrdsadventures Před 2 lety ⁺¹
Glad you enjoyed it!
@blckhwk8024 Před rokem
Nice explanation, thanks!
@araujobsdport Před rokem
Really nice example! Well done :)
@apalrdsadventures Před rokem
Thanks a lot!
@MarkConstable Před rokem
I'm about to jump into Ceph so I watched this one again and really appreciate your Ceph coverage. We anxiously await the next PROMISED instalment. Heh, no pressure 🙂
@apalrdsadventures Před rokem ⁺¹
I actually just reinstalled everything with the latest versions (PVE 7.3 / Ceph 17) two days ago, for an episode on more diverse pools (erasure coding, SSD/HDD mix, different failure domains, tired caching ...). Actually started filming already!
CephFS is still on the horizon though.
@MarkConstable Před rokem
@@apalrdsadventures I am really looking forward to this next one. I should have 4 nodes ready to go tomorrow, if mr bezos is on time. Three of them with a pair of 2TB SSDs... a Minisforum HM90, a Terramaster F2-423 and a QNAP TS-453D, all with a pair of 2.5 GbE nics. A bit of a mongrel cluster but it's at least all-flash based and should be good enough to get me through 2023.
@apalrdsadventures Před rokem ⁺¹
I'm actually planning on talking about mixing SSDs and HDDs too, so no need for all flash. It's a super well supported use case with Ceph.
@MarkConstable Před rokem
@@apalrdsadventures The other problem I am trying to solve is that my 10 yo HP Microserver running PBS barely gets to 30 MB/s so if I reboot a VM with 2+ TB of storage it can take 10+ hours to rebuild it's dirty-bitmap.
@apalrdsadventures Před rokem ⁺¹
I actually have an HP Microserver too, and it's faster than any of my thin client based nodes.... so my setup will be a lot slower than yours.
Maybe the VM doesn't need such large disks, and could mount CephFS (or RGW or RBD) on its own? In general I do separate network mounts within the VM for 'bulk' data and then deal with data replication of those separately from Proxmox
@zparihar Před rokem
Great video bud!
@apalrdsadventures Před rokem
Glad you liked it! Plenty of Ceph projects in the works, eventually
@vinaduro Před 2 lety ⁺²
I was eagerly awaiting this video. 🙂
@apalrdsadventures Před 2 lety ⁺³
Hopefully it lived up to your expectations
@vinaduro Před 2 lety ⁺¹
@@apalrdsadventures it might actually have made my life more complicated, because now I'm considering changing my Proxmox cluster to Ceph, instead of ZFS on top of Truenas. Both have pros and cons, I guess I need to figure out what's best for my situation. Although, considering it's a home lab, every situation is my situation. lol
@apalrdsadventures Před 2 lety ⁺¹
The real benefit to Ceph is that you get redundancy at the host level in storage. With TrueNAS and ZFS you get drive failure redundancy, but not host failure. Proxmox clustering already has host level redundancy in compute, but if the storage isn't redundant then it becomes a single point failure (and also a traffic bottleneck over the network, potentially).
Realistically, host failure is actually 'host down for maintenance' in the home lab, and is a real thing that does happen more frequently than we'd like.
@vinaduro Před 2 lety
@@apalrdsadventures Yup, and the 'host down for maintenance' tends to cause a lot of complaints from the wife.
I guess it's the same as trying to decide how much redundancy you can afford in an array. Cost vs. convenience.
This is the reason why we have our home labs though, so we can play around, and change our minds whenever we want.
@apalrdsadventures Před 2 lety ⁺¹
I still use TrueNAS + single node Proxmox for Home Assistant and keep as little as posible on that Proxmox box so experimentation doesn't break important things
While filming the thin client series I had a bunch of problems with the Proxmox host running out of RAM and the OOM killer killing off the security camera VM, but now I can film that sequence on the cluster
@SB-qm5wg Před rokem
I didn't even know prox had a ceph wizard. Cool 👍
@eherlitz Před 2 lety
Just what I was looking for, thank you!
I mainly have the need for HA storage for containers like MinIO and logging from various machines (e.g. vm and docker), but where downtime of such storage is a no-go. I figured that ceph is a great match for this but I'd like to hear your opinion.
@apalrdsadventures Před rokem ⁺¹
Ceph can be a great match depending on how big you need to scale. At the small scale, it's a big pain to setup and expand. As you scale up, the benefits to using Ceph over really anything else become huge.
But, it can certainly keep storage highly available to go with other HA compute solutions.
@MikeDeVincentis Před 2 lety
Awesome video. Just what I've been looking for to get started. I have 3 Dell R410's I'm currently building out. How does it work with expansion of storage across the cluster? If I put 1 physical drive in each server to use as an OSD, (I have an SSD for boot and another SSD for cache, and can add 4 spinning drives) can I expand later by adding a single disk at a time to each server? I'd probably start with 3 x 10 TB drives for storage and expand as needed.
@apalrdsadventures Před rokem ⁺²
It depends on if you want OSD level or host level redundancy. By default the rules require host-level. So having 3xR410s each with 1x10TB each will get you 30TB in the pool and 10TB usable with replication (or 20TB with erasure coding, although not all pool types can be erasure coded). Adding 1x10TB to a random server won't get you anything since it can't maintain host level redundancy, but adding 1x10TB to each server will double your capacity. However, it will suddenly spend a bunch of time moving data around to rebalance the cluster when you do this, so performance might take a hit during the process.
@amyslivets Před rokem
Cool. Keep going 👍🏻
@apalrdsadventures Před rokem ⁺¹
I have more videos coming, eventually!
@user-gw9el1ew2f Před 2 lety
great video! can't wait for ceph file clustering on text youtube
@apalrdsadventures Před rokem ⁺¹
I can't wait either, Ceph is a monster topic though so working through basic RBD first.
@CJ-vg4tg Před 3 měsíci
Hi there. Thanks for the detailed vids. Is there anyway of installing the mgr dashboard on proxmox 8?
@JoeLerner-tu5oc Před rokem ⁺¹
Thanks!
@apalrdsadventures Před rokem
Glad you enjoyed it!
@cberthe067 Před 10 měsíci
Do you plan to continue your videos serie on Ceph ? Talking about CephFS, Ceph Balancer, Ceph Healing, etc ...
@JuriCalleri Před 2 lety ⁺¹
I subscribed (and liked) because these videos happen in the perfect moment! I'm trying to build a proxmox cluster, hyper-converged and HA but I only have 2x identical computer, simple Ryzen 3 4c/8t 32GB rig, an intel nuc 8th gen and 1 raspberry pi 4.
Your prev. video helped me understanding the Qdevice and how to actually get a cluster with quorum to work, and I can use that on the hardware I have, but it is not clear to me if I can install Ceph (or glusterfs) on the 2 identical nodes and let the intel nuc or the raspberry simply out of this but still replicate the VM disk on them for that single VM that, no matter what, has to be HA.
Or, maybe, do ceph and gluster only work when installed on 3 nodes?
Like, literally, your videos showed up to me at the perfect time! Like the Room of Requirement in Harry Potter . That's wicked!
Tanks!
@apalrdsadventures Před 2 lety ⁺²
Ceph doesn't handle 2 node clusters nearly as gracefully as Poxmox does. You do not need more than 1 monitor, but if you have more than 1, then you will have quorum requirements for the monitors which mean you really need 3. Additionally, you will have replication requirements at the that may not be able to be fulfilled with only 2 nodes - by default the 3/2 replication rule requires there be 3 OSDs to store a given placement group, on 3 different hosts. With only 2 hosts, it will be forever stuck at the min_size, which means failure of either host takes the pool offline.
You should be able to run the Ceph monitor on the Pi 4 to get to the quorum requirement, although I'm not sure who builds the latest version of Ceph for aarch64. That doesn't fix the replication issue. You can reduce the replication rule from host to OSD (so then you need at least 3 disks instead of 3 hosts), but a single host failure can then bring you at or below min_size and again take the pool offline.
45Drives recommends starting with extra Ceph nodes in virtual machines initially (i.e. 3x VMs on 1x host, migrate the 3x VMs to 3x hosts when you build more nodes) to deal with this issue without configuring your cluster in a way that allows less redundancy if you plan on growing into a proper cluster in the future. This just you from having to recreate pools with new rules when you expand into a proper HA setup, but doesn't fix the HA issue for a 2 node ceph cluster.
In your setup I'd recommend using ZFS instead of Ceph and relying on Proxmox's ZFS replication for HA VMs. They will potentially be 'behind' (since it syncs the VM disks every 15 minutes instead of truly keeping the storage coherent across the network like Ceph does), but it works in 2 node clusters.
@JL-db2yc Před 2 lety
@@apalrdsadventures thank you for this detailed answer! I have a similar setup to what Juri Calleri described and had the same question. Based on your recommendation I will keep to ZFS.
@subhobroto Před 2 lety ⁺¹
Great ceph videos! It would be awesome to learn how to replicate data between 2 separate ceph clusters for geographical data redundancy
@apalrdsadventures Před 2 lety ⁺²
I'm currently working through a series of videos that cover one ceph cluster, erasure coding, CephFS, etc.
@subhobroto Před 2 lety ⁺¹
@@apalrdsadventures got it. Yeah - if you showed how to expose RBDs to systems that are not in Proxmox, that would be nice.
Imagine I have a Proxmox HA cluster and wanted to expose a reliable (due to Ceph) volume to another machine (external to Proxmox, say a PC/Laptop/Raspi) on the same network. The issue I have with Proxmox's Ceph is that they are behind Ceph releases, which is fine if Ceph just exists to support Proxmox storage but not so great if my objective is to use Ceph itself.
@apalrdsadventures Před 2 lety ⁺²
My plan was to expose CephFS to other systems, but RBD would be similar (using CephX to authorize new clients outside of Proxmox's automation would be the key bit).
Proxmox is actually entirely up to date with Ceph's stable release (16.2.7), they don't use Debian's package repos for Ceph and have a deb repo just for up-to-date Ceph.
@fbifido2 Před rokem
@6:23 - do you have to install the Ceph dashboard on each host?
what if pve1 goes down, would you still have access to the Ceph-dashboard?
@thestreamreader Před rokem
Any experience with Harvester? It seems like it might be a great option and has a different clustering system.
@Catskeep Před 2 lety
thanks for the Great video..!!
i'm still a little confused because i have limitation in language..
I want to ask, if I have a vm that is on host1, and then host 1 has a problem let say its gose down, will the chep automatically move the vm to host2 or host3 ? If the answer is yes, will the vm move in a state on or state off?
@apalrdsadventures Před 2 lety
By default, Proxmox will not move VMs.
If you configure the VM as a high availability resource, then it will wait to be sure that the host has gone down (~3 minutes by default) before restarting it on another host. At that point, the VM will be booted from the disk image, so it won't transfer it live if the host goes down.
You can configure it to transfer the VM live when the host is shut down (for maintenance).
@fbifido2 Před rokem
@3:49 - what speed nic to use for public & private?
which one get the most traffic?
@v0idgrim Před 2 lety
I have a question. In this setup is the vm running on all the nodes (active-active) or is it running on one node (active-passive) and what would the recovery speed take for the vm to be usable / reachable again in case of a say powerfailure of one of the nodes.
@apalrdsadventures Před 2 lety ⁺¹
The Ceph side is all active, so data is always accessible from clients at any time. The Ceph monitor (web dashboard) is active-passive but you don't need it for data access
Proxmox is active-passive, it will migrate a VM to a new node when a node goes down and restart it from disk.
@fbifido2 Před rokem
How would you upgrade your hyper-converged?
hypervisor & ceph ???
@meroxdev Před rokem
For a setup with 3 optiplex, ceph storage will work if the optiplex have only 1 ssd ? ( so 1 ssd per node where will be also proxmox installed ), or ceph need dedicated disks ? Thank you!
Amazing content 🤜🤛
@apalrdsadventures Před rokem ⁺²
It will work with only 1 disk per node. You'd probably want to install Proxmox on Debian instead of using the Proxmox installer, setup a custom partition layout with most of it going to Ceph, and then install Proxmox and Ceph on that.
Performance wise, your options are only 3x replication (so the usable space is the size of the smallest host) or k=2 m=1 erasure code (1 redundant shard, total space is double that of the smallest host).
@johnwalshaw Před rokem
What are your thoughts about using Optane on each Proxmox ceph cluster host for DB and WAL disk? e.g. each host with 2TB NVME OSD plus 118GB Optane for DB and WAL.
@manitoba-op4jx Před 10 měsíci ⁺¹
i tried using an 8gb optane module as a boot drive for my all-HDD proxmox server so the mechanical drives could spin down when not in use. I found that it was too small.... a 16 or 32gb module would be adequate though.
@jaykavathe Před 10 měsíci
I used your guide to set up by 3 node ceph cluster. But somehow during Proxmox update my cluster seems to have gone corrupted/unresponsive. Can I wipe out my nodes, reinstall ceph and import my current OSD/data? Any help or a video on reimporting CEPH would be hugely appreciated.
@zparihar Před rokem
Question: Can you add a WAL Disk and DB disk later? (after you've created your CEPH OSD's)?
@apalrdsadventures Před rokem
Usual recommendation is to delete / re-create OSDs when changing things, although it may be possible to move the LVM VGs around
@patrickjoseph3412 Před 2 lety
i have the wyze 5010, it has a sata port and will fit a SSD if you remove it from its case. ssd i used was a samsung 850 1tb
@metafysikos13 Před rokem ⁺²
Hello dear Aparld! Your awesome video guided me to setup a ceph cluster inside my 3 node proxmox cluster! Thank you very much for that!
I have one question though. Im using a separate nvme disk of 1TB on each proxmox node just for ceph. So my ceph cluster is made of 3 OSDs, 3 monitors, 1 manager and 1 pool.
I am also using a separate 10Gbit private LAN just for ceph's cluster/private network.
Ceph's public network is using the 1Gbit uplink of each proxmox node.
Everything works and i get no error messages or whatsoever. But the "strange" thing is the read/write performance of ceph. I was expecting something aroung 1GByte per second of maximum performance but instead im getting 160MBytes per second of reads and writes, when I benchmark ceph. Is this normal?
Also, when I use only the 1Gbit uplink for ceph's public AND private network, ceph's benchmark results are something like:
- Reads: 150MBytes per second
- Writes: 75Mbytes per second
So, reads are about the same when using either the 1Gb uplink or the 10Gb LAN,
and writes are doubled with 10Gb LAN.
I feed that something is not right here.
p.s I also test the 10Gbit LAN - network performance from node to node using iperf and i get:
- 9,4Gbit bandwidth
- 1,1Gbyte of transfer per second.
p.s2 I am using a 10G switch which has a switching capacity of 320Gbit per second.
p.s3 Sorry for the long message! Have a good day and cheers!
@metafysikos13 Před rokem ⁺²
With a little research, what I understand is that data, destined to ceph storage, is transferred through public network, to cluster network.
So, if your public network is 1Gbit, you wont get read/write speed you expect from the 10Gbit private network..
Maybe I got it all wrong, I dont know.
@apalrdsadventures Před rokem ⁺⁴
Your research is correct. The 'public' network is what Ceph clients use to access Ceph data. The 'cluster' network is used by Ceph to transfer data among itself - for replication, erasure coding, and to rebalance PGs. The Ceph Client isn't the user end client, it's whatever software is accessing the Ceph cluster (often a gateway for user traffic). A normal write has the client connect to the 'first' OSD in a PG via the public network, and that OSD will then connect to the rest of the OSDs involved in replication / erasure coding via the cluster network (so client -> OSD X via public, OSD X -> OSD Y, OSD Z via cluster).
In this case, Proxmox (qemu) is the 'client', so it will access Ceph via the public network. So going to 10G cluster will speed up writes since they normally involve 3 transfers and 2 of them will go across the cluster network, but not reads, since there is still the initial access via the public link either way.
The reason you see >100MB/s (the expected limit with gigabit) is since one of the three OSDs is on the local system, so a random access has a 1/3 chance of going to the same system as the test and not going over the network at all.
@metafysikos13 Před rokem ⁺²
@@apalrdsadventures I actually did it that way. I cleaned all ceph configuration from my 3 proxmox nodes and reconfigured it to use my 10G LAN as public and private network. I created my monitors and pools from scratch and now my benchmark results are way better. I get something like:
- Reads: 1700MB/s
- Writes: 650MB/s
So now i have to simulate some tests so I can see if this performance is acceptable for my production environment between my web apps, desktop apps and databases.
Dude thank you again so much! Keep up the good work!
@thanhlephuong7687 Před rokem
Thanks so much! I have a lot of time contact proxmox but i just lose some minutes for understanding.
@JohnSmith-yz7uh Před 2 lety ⁺²
You can use ZFS as backend for CEPH. This way you get best of both, but speed is not a priority in that setup. Although this is true for ZFS in general
@apalrdsadventures Před 2 lety ⁺⁵
You really can't use ZFS as a backend for Ceph.
You can use ZFS as a backend for Gluster, since Gluster is a filesystem only and distributes files across the cluster to be stored on other filesystems. So ZFS underneath Gluster is a good idea.
Ceph (with Bluestore) uses the raw disk, and you don't really gain any of the zfs benefits since Ceph already has all of those features on its own (data integrity checking, scrubs, data redundancy, snapshots) but get an extra layer of caching and read-modify-write.
LVM to merge a few disks into a single OSD isn't a terrible idea, and LVM to split up an SSD into db disks is common, but LVM is way lighter than ZFS and isn't duplicating features that Ceph already has.
@JohnSmith-yz7uh Před 2 lety
@@apalrdsadventures hmm, I could have sworn I've seen a tutorial on it.
Could have been gluster, all I remember was that the ZFS pools had to have the exact name on each cluster member
@apalrdsadventures Před 2 lety ⁺²
With Gluster it's the recommended setup to use ZFS.
Older versions of Ceph stored data in chunk files on top of another filesystem, so back then ZFS may have been recommended. With the newer backend ('bluestore') it's not recommended to have much if anything between the OSD and the disk.
With Proxmox, you can do ZFS replication with a separate ZFS pool on every single node (all having the same name) and Proxmox can sync data across the cluster using zfs replication. But then you have to keep a copy of the VM disk on any Proxmox node which could potentially run the VM if it were to be migrated due to HA rules.
@randomvideosoninternet7897 Před 10 měsíci
sir, how to add sdb storage on proxmox? there is no sdb storage on my proxmox
tq
@ShahzadKhanSK Před rokem
thanks for explaining the concept. I recently started tinkering with proxmox. I have two physical and one virtual node. Each node has two 1TB ssd, so 2TB each node. For HA, I am using NAS (a single NVME) and all of my HA VM are stored there. Any idea, how this storage could be configured to take advantage of Ceph ?
@apalrdsadventures Před rokem
Proxmox does work fine with 2 nodes, but Ceph really needs at least 3 nodes with storage to work. So, unless you want to migrate your NAS to Proxmox as well, it won't really be a good experience.
@ShahzadKhanSK Před rokem
@@apalrdsadventures thanks for explaining this. I got all 3 nodes up and ceph is working as expected. My situation, i have two SSD and OS is running in a separate SSD in each node. Should i use both SSD as OSD or one OSD and second with DB/Wall disk. What would be a good composition. Any idea?
@apalrdsadventures Před rokem ⁺¹
DB/WAL disks are for when you have significantly faster storage available to store metadata. Since SSDs are (roughly) the same speed, they should be separate OSDs.
If you have NVMe, sometimes it's recommended to partition a drive and run a few OSDs on it for better multithreaded performance of the OSD. For SATA this is not recommended.
@ShahzadKhanSK Před rokem
@@apalrdsadventures I have one 1TB NVME and on 1TB SSD. I can create two 500G partition in NVME and leave single partition in 1TB SSD. The OSD performance will improve on NVME because of two different threads and SSD still operate on single thread. did i pictured it right?
@thinguyen937 Před 10 měsíci
I got this error, please tell me how to fix it : Error ENOENT: module 'dashboard' reports that it cannot run on the active manager daemon: PyO3 modules may only be initialized once per interpreter process (pass --force to force enablement)"
@karloa7194 Před 5 měsíci ⁺¹
It has been a year now since you made this video. Are you still running Ceph?
@apalrdsadventures Před 5 měsíci ⁺¹
Only in testing, I only have two 'real' nodes in the lab (+ the 3 thin clients) but the thin clients are too slow to do more than experiment.
@NirreFirre Před rokem
A bit too deep sys admin for me but ceph seems to be very similar to MongoDB clusters in a lot of areas. Cool but our ops have consolidated to use NetApp Ontap and Trident stuff. My dev teams just wants huge, robust and fast storage :)
@PonlayookMeemeskul Před rokem
If there's a need to frequently migrate VMs across nodes (Overcommitting number of VM on limited physical resource/node)
Would Ceph solve the problem of newly migrated VM having data to start right away?
And what is the "actual" usable storage once the setup has completed following this tutorial?
Thank you very much
@apalrdsadventures Před rokem ⁺³
Yes mostly.
With Proxmox, migration across nodes will always require the VM to either shut down or the RAM to be migrated, which will take the VM offline for a short period. With shared any type of shared storage (Ceph, NFS, iSCSI, SMB), Proxmox will sync the VM disk to the shared storage right before it moves the VM, and rely on the shared storage to keep the VM disk changes up to date. With Ceph, you get shared storage that's also guaranteed to be consistent as entire hosts go down, whereas something like NFS/SMB you can cluster but it's harder to guarantee a file write is atomic across the whole cluster.
The only case where you'll be waiting on VM data is when you use ZFS replication instead of shared storage.
Actual usable storage is 1/3 of the total, since it's keeping 3 copies of the data, assuming all disks are equal sized. With some more manual setup you can use erasure coding for the data (still need 3x replication for metadata), which has math more like RAIDx (5,6,7,...) but usable capacity depends a lot more on how many nodes you have and how much storage is in each node. tl;dr it's usually better to have more nodes than large nodes in Ceph if you want host-level redundancy.
@PonlayookMeemeskul Před rokem
@@apalrdsadventures Thanks a million for very detailed answer, really appreciate it. My friend has a few GPUs laying around from his mining rig, so I'm building 2 gaming PCs for our daughters and us dads to play games remotely. So far, I'm juggling 4 VMs on these 2 hosts, where I'll "migrate" and boot up VM on the host that GPU isn't being used (kinda like four of us doing time-share on the GPUs lol)
The only problem is the data. First I was gonna do simple game running on NAS, but decided to look into Ceph which seems like fun.
Thanks again, looking forward to more of your vids. Cheers.
@apalrdsadventures Před rokem ⁺²
You'll likely have better performance in the VMs with a NAS, since Ceph tends to be better at parallel IOPS across many VMs and less good at throughput and latency for individual VMs. If you don't need to sustain the failure of a storage node, then a NAS will work fine.
@niklasxl Před 2 lety ⁺²
so are there reasons not to use proxmox as a NAS with this? apart from parity/duplicate data needing to go over ethernet instead of within the a node? this seems like a flexible way for home servers to easily expand both compute and storage
@apalrdsadventures Před 2 lety ⁺⁷
It's extremely flexible, but the minimum setup is 3 servers for a functional Ceph cluster. If you want an all-in-one solution, this is not it.
If you want a highly available solution for both compute and data, this is definitely it. ZFS / TrueNAS is usually a single point of failure even with clustered Proxmox. You can keep backup copies synchronized so you don't lose data, but with Ceph the data is not only duplicated as it's written (so a sync write that completed is safe from a host failure immediately), but also keep the data online and available to clients during a host failure. It's like how RAID is not a backup but lets you continue operating when a drive fails, except for entire servers.
The only downside is that you normally use an NFS and/or SMB gateway for filesystem users and that gateway server can become a single point failure for clients who are not native Ceph users. Proxmox is a native Ceph user, but your desktops/laptops probably are not and will go through a gateway server.
@niklasxl Před 2 lety ⁺¹
@@apalrdsadventures oh this is really interesting might have to try it at some point :D
@apalrdsadventures Před 2 lety ⁺³
Basically, do you want to scale up or scale out? that's the zfs vs ceph question.
@niklasxl Před 2 lety
@@apalrdsadventures i dont really know yet exactly what i want :D but flexibility and availability are always nice. but basically just a home server(s) / lab
@jefferytse Před rokem
First, your videos have been awesome, thank you. I'm in the process to migrate some of the VM from vmware to Proxmox. Initially, I was going to do a HA for the VM into a second server but after watching this video. I wonder if I should use ceph instead. Tried to join your discord but it's not working. Love to pick your brain.
@apalrdsadventures Před rokem ⁺²
HA can be used with Ceph or ZFS, are you talking about ZFS replication vs Ceph then?
@jefferytse Před rokem
@@apalrdsadventures let me clarify this. I was going to do zfs for all individual servers but now I have a cross road right now to choose between zfs or ceph. I also have zfs over iscsi setup as well. I want to make sure that I won’t lose accessibility or data if any of the servers go down
@camaycama7479 Před 2 lety
I have a 8 server cluster (mainly dell r830 and r730). I've alwaysbe tempted to start using Ceph but... Watching you video I think I'll give it a try. Does Ceph can restore a failed quorum? It would asume that OS drive are part of Ceph, which is scary.
@apalrdsadventures Před 2 lety ⁺³
I'm running with a ZFS OS drive for Proxmox, so no the OS drive is not part of Ceph. I don't believe there's a way to boot off the Ceph cluster, but you can use partitions or LVM to give Ceph some of the boot drive space (just don't put Ceph on a zfs zvol).
Ceph will recover from a quorum failure (of the 3+ monitors) on its own, but during the quorum failure it will be inaccessible to clients entirely. The Managers (stats and dashboard) are active-passive and can all fail without affecting pool IO. If you fail out enough OSDs without monitor failures you can also get into a scenario where IO failures start to occur because the PGs can't meet the minimum replication rules with OSDs that remain. That's also possible if your replication rules are impossible to achieve with the number of OSDs and hosts in the system.
@camaycama7479 Před 2 lety
@@apalrdsadventures thank you so much! This clarifies even more the fact that I have to test it on the test-lab. Cheers!
@DawidKellerman Před 2 lety
Please also discuss snapshots !!! in ceph
@apalrdsadventures Před 2 lety
There's only so much I can fit in one video! But I already have a follow up planned
@sjefen6 Před 2 lety
Would a 2 node proxmox + nas with ceph and qdevice be a feasible option?
@apalrdsadventures Před 2 lety
It depends on how you deploy Ceph, but a 2 node cluster is really not possible in Ceph while maintaining high availability.
Ceph's monitor needs 3 for high availability. You could install a monitor on the qdevice. The manager is not required, so you can install 2 of them and not lose access to the pool when the managers are all down.
But you still can't get high availability storage with only 2 storage nodes in Ceph. You need at least 3 nodes to meet the placement group requirements (size 3 / min size 2), allowing you to lose a node and continue operating. Otherwise, with 2 nodes, you will always be at min size and losing either node makes the pool inaccessible.
@sjefen6 Před 2 lety
@@apalrdsadventures Yeah. Then would not running ceph and qdevice on a NAS allow one proxmox node to continue operate if one proxmox node fails?
@mithubopensourcelab482 Před 2 lety
Whether snapshot are possible for rolling back under Ceph ????
@apalrdsadventures Před 2 lety
Ceph does have a snapshot features and Proxmox will use it through the Snapshot menu for VMs
@stuarttener6194 Před rokem ⁺¹
I currently use a TrueNAS server with ZFS (an old IBM x3650 "M1" or 7979) well fortified with 48GB of RAM, 8 SAS drives, dual Ethernet ports, and dual XeoN 5500 series CPUs. It works rather well but uses a lot more watts than my 2 NUCs that also have each have a 2TB SSD in them. The system is rather noisy as well (though I aim to put all my x3650 servers in a rack in my garage anyway).
I have read a lot about the overhead that CEPH can place on any sizable server not to mention small home lab style servers (especially given the light weight "servers" you used along with USB sticks, though I run have 2 NUCs with i7 CPUs, 32GB of RAM and SSDs). I would be interested to know what kind of overhead you observed with a VM (or more than one if you tested that) running on each server as juxtaposed against the overhead placed upon said "servers" by running CEPH as well.
Thank you for your videos, some are quite interesting to me.
Stuart
@apalrdsadventures Před rokem ⁺²
In general it's not as big of an overhead problem as a latency problem, in disk IO for the VMs. Every IOP has to go over the network to the 'first' OSD, and from there across the cluster network to the other OSDs involved in that PG. It's not so much that the work is significantly harder than say ZFS, but all of the network hops involved add latency. So random IO and synchronous IO performance tanks (vs ZFS), but high queue depth synchronous throughput is still fine until you run into disk or network bottlenecks.
At least for sata/sas drives, for NVMe it's a bit of a different story, it's not particularly well optimized for NVMe even on high end hardware yet.
With my USB drives I also have issues where Ceph faults OSDs for being too slow, because the USB drives are actually really slow.
@stuarttener6194 Před rokem
@apalrdsadventures So it sounds like you are suggesting that if someone does not have a 10GB or faster network to leverage for CEPH's cluster network, then CEPH is going to really kill network performance and have really bad latency is what you are saying? It seems like in my use case I am way better off keeping the TrueNAS SCALE ZFS NAS going and bagging CEPH.
@apalrdsadventures Před rokem ⁺¹
It's not going to kill network performance, it's going to hurt random read/write disk performance of the VMs. Sequential and high queue depth IO is limited by network bandwidth.
Spread out over a number of nodes with a number of OSDs and VMs the performance is quite good in aggregate, so the scalability is a lot better than a single node.
@stuarttener6194 Před rokem
So if my use case is having a dozen VMs running and most are sitting there doing little work each day (FreeIPA and pfSense do get used for routing and login authorization on my home lab LAN) and I have the VMs distributed across the 3 nodes it will likely run okay then you mean to suggest? Or will it seems very slow and I'll end up moving back to TrueNAS for using shared storage? I know its a bit of a guess, just curious to know your thoughts.
@apalrdsadventures Před rokem ⁺²
I'd guess that neither of those do much disk activity and won't really care about somewhat slower disk IO. The VM will still do its own filesystem caching, so it would be more like running the VMs on a spinning drive (which tends to have poor IOPs but can have good sequential bandwidth). Using a shared network storage has the same effect, so it shouldn't be significantly worse until you start running out of network bandwidth (which Ceph will do sooner than NFS since it has to do network IO for replication).
Try it and see how you like it.
@Megatog615 Před rokem
What are your thoughts on MooseFS/LizardFS?
@apalrdsadventures Před rokem ⁺¹
Ceph and Gluster are much more commonly deployed. Ceph also has the advantage of not requiring a metadata server (which can be a bottleneck) for non-filesystem workloads, and also natively supports workloads with more limited semantics (RBD for block devices and RGW for S3-compatibility, which has fewer features than POSIX compliance so it's lighter weight to implement).
Proxmox also natively integrates Ceph, and Ceph is *extremely* flexible in dealing with mixed storage setups and mixed levels of redundancy
@BigBadDodge4x4 Před 9 měsíci
I have 4 homes, plus servers at a datacenter. All sites connected via VPN's. If I put a proxmox system at each site, can they be clustered? Or should I just put one cluster at datacenter site ( has dual 10Gig internet lines).
@apalrdsadventures Před 9 měsíci ⁺²
Proxmox is not happy about higher latency links. I'd keep the cluster to the datacenter only. You can share a backup server between them though, which can make migrating VMs very that difficult (backup -> restore).
@thebrotherhoodlc Před 2 lety
Do you need 10Gbe + ethernet for this to be used in Production scenarios?
@apalrdsadventures Před 2 lety ⁺⁴
It depends on the bandwidth of your storage and if you actually need to saturate that bandwidth or just need the space.
I have really slow USB3 flash drives with a 120MB/s read speed, which is roughly the same as gigabit Ethernet. In my specific case, gigabit is well matched to these really slow flash drives, assuming there isn't any additional traffic from the Proxmox VMs themselves (or you are using a separate NIC for that traffic)
If you go to spinning rust, you should get roughly 150MB/s write speed per disk, which means you should look at 2.5Gbe minimum, but with a realistic number of spinning drives per node, 10G would be good.
If you want to run NVMe or SSD based storage and want to saturate it bandwidth wise, you'll need above 10G for the cluster network at a minimum. 45Drives usually recommends 10G public and 40G cluster for their large spinning rust pools (~40 drives + SSD DB drives per host).
If you are just doing Ceph for redundancy and scale out space (i.e. archiving data) and not for any scaling out of speeds, you can of course use gigabit and tolerate everything being slow. Rebalanace and backfill will take a long time, so the pool will be degraded for a long time if you have a drive or host failure to recover from.
@thebrotherhoodlc Před 2 lety
@@apalrdsadventures Awesome thanks
@tomokitaniguchi7908 Před rokem
I keep getting the following error when i try to use the ceph pool "modinfo: ERROR: Module rbd not found.” Did I miss a step?
@apalrdsadventures Před rokem ⁺¹
RBD should be installed by the Proxmox kernel package, which the Proxmox installer should have installed. Did you install on Debian or something?
@carsten612 Před rokem
just hit like for the statement "jsut for the clickbait - it is hyper converged" :D
@marco114 Před rokem
I got errors and am reluctant to start over.
@cypher2001 Před 2 lety
For UNDER 20.00, you could have got a low end 120gb SSD. Shucked it, and its a direct replacement for the 16gb drive. Just pops right in the slot. Only modification i've had to make is bending the memory shield a little to accommodate.
@apalrdsadventures Před 2 lety ⁺¹
Sharing the OS drive with Ceph is a bit more painful than it should be on Proxmox, since the installer doesn't let you do custom partitions
@itsmedant Před rokem
@@apalrdsadventures I was able to get custom partitions installed with Proxmox, but it’s still saying I don’t have a disk available for an OSD. Do you have any idea how to do the install into a different partition?
@Breeegz Před rokem ⁺¹
In your small "hyperconverged" cluster, I remember you had a 2.5Gbit USB-Ethernet adapter. I built a slightly larger, more expensive version out of Lenovo Tiny's, and I'm feeling constrained. You mentioned that there's a performance benefit to having a "Private" Ceph network and a "Public" Ceph network, do you think that the trade-offs of adding a USB 3.1 Ethernet Adapter is worth that performance? I'm getting 50-70MB/s write, with some spikes to 150MB/s when I share the Public/Private. If I added a 1Gbit backend, what kind of performance gain could I see? What about adding a 2.5Gbit backend? For perspective, a 1Gbps backend would run me about $35, where the 2.5Gbit would cost me roughly $200.
Basically, I should have bought full towers (optiplex's or equivalent) so I could add cheap NICs, bonded them, room for more drives/ect..
@apalrdsadventures Před rokem ⁺¹
I haven't tested my own cluster with a separate private/public network, but in general the public network is used for client access (Proxmox/Qemu -> Ceph) and the private is used to replicate from one OSD to another (Ceph -> Ceph). A single write requires one public and two private transactions (Qemu -> first OSD, first OSD -> second and third replicas), so it should theoretically see twice as much bandwidth in the standard 3 way replica config. Some of those transactions go straight to the local system OSDs and bypass the network, and some will go between the two other nodes on the network, so that's how you end up with the 50-70 MB/s write speed on a network that should be able to do ~110MB/s.
So, it depends on how badly you need to improve the 50-70MB/s and how much bandwidth also you need for VM traffic, which in my setup is sharing the same 1G link.
I did buy USB 2.5G NICs and they will be in a future Ceph video, so I had the same idea as you. I think the costs I anticipated were lower, but I'm not buying a new switch. For me, with USB flashdrives, they are slower than the network so it's not a major issue yet.
@Breeegz Před rokem ⁺¹
@@apalrdsadventures I would appreciate it if you ran a test on 1Gbit combined, 1+1Gbit and 1+2.5Gbit in that upcoming video. The write tests I was doing was "dd if=dev/random of=file1 bs=10M count=100" and then I would "rsync -ah --progress file1 file2" then I could easily run these two commands to take a second reading.
As far as my 50-70 MB/s is concerned, I'm trying to squeeze out all I can from this cluster, and I can see how very very BIG Ceph is, so tuning it is difficult. If 70 MB/s is all that's expected, then I know I'm not missing a config or some sort of tuning. PGs, WAL, Separate DB disk, etc.. Every where I look, the consenses is, run it on dual 10GBit links!! Dual 40GBit links!! You will be sorry if you don't have at least a 100Gbit backend link!!!! ...then there's the one guy that says one 1Gbit is enough. I just want to know what I can expect so I can make the $0 decision, the $35 decision, or the $200 decision. (yes, I would need a new switch to handle 2.5Gbit)
@apalrdsadventures Před rokem ⁺¹
There's a big difference in what you need for a production network at scale (which assumes your scale is large enough to require Ceph) and what you need for Ceph to run at all. The recommendation for 10G/40G comes from a company selling boxes with up to a petabyte of storage each, so the smallest cluster they are considering is probably on the order of 100TB. Depending on the use of the data (archival vs 'hot' data), 10G to 100G would be prudent for a zfs+nfs array of that size anyway.
Obviously with a slow network, you'll get much lower IO bandwidth than with a fast network. With Ceph and a single network, you'll also get lower IO bandwidth than you would with NFS/SMB over the same network due to the additional bandwidth of Ceph replication across nodes. Your numbers match some back of the hand math and seem in the right ballpark to me.
You'll also be much more vulnerable to PG rebalancing (especially with a failed disk or when new disks are added), since a massive amount of data will need to move to its new location (or be replicated if a replica is lost). This happens in the background as the pool is active and more IO on the pool just delays the rebalancing.
@Breeegz Před rokem ⁺¹
@@apalrdsadventures I really appreciate your time. None of the above is lost on me. I know I'm not building enterprise level file stuff, which is why it's hard to search for what to expect on the internet, because so many people on Reddit are trying to build enterprise level stuff. I want to play with the big toys at home, and I think I'm pretty close, but I may have designed myself into a corner. Before I make another ($200) step in that direction, I'd like to have a better understanding of what to expect. If it really bumps my performance, how much are we talking? That's the crux of my issue, and I'm not asking you to solve it, think of it as a hopeful suggestion for future content.
@apalrdsadventures Před rokem ⁺¹
I'm working on getting a full 2.5G setup for my cluster, but I've been having issues with the 10Gbe to SFP+ transceivers I bought negotiating down to 2.5G. My Mikrotik transceiver works fine, but then I got a cheaper brand for the cluster project and they aren't working even though they claim to support 802.3bz. I might just get the 2.5G switch, eventually I'll need it anyway since I'm planning on bringing the microserver into the cluster on 2.5G + dual 1G and also working on an ingest station that I'd like to connect at 2.5G.
Another option is to add more nodes to spread the bandwidth across more of them. The scale-out nature of Ceph works well for this.
@s.m.ehsanulamin7235 Před 2 lety
while implementing this i didnot find out /dev/sdb. So i cannot able to create the osd. could you propose me some kind of solution?
@apalrdsadventures Před 2 lety
If you ls -l /dev/disk/by-id it should show you all of the disk names. If you don't recognize the disk there, it's a hardware issue. -l will show you what the hard link is to, so you can see the /dev path for the disk name.
It's a bit unfortunate that Proxmox doesn't directly use the by-id paths, since those are more reliable with hardware changes.
@s.m.ehsanulamin7235 Před 2 lety
@@apalrdsadventures Actually I can see /dev /sda with it's partition. But I cannot see /dev/sdb. If I cannot see this then how can I make the ceph? Will be glad to have some sort of solution from your side. What can I do now next?
@FuzzyScaredyCat Před rokem
Newer versions seem to require that ceph-mgr-dashboard is installed on all nodes otherwise you get an error:
*Error ENOENT: all mgr daemons do not support module 'dashboard', pass --force to force enablement*
@apalrdsadventures Před rokem ⁺²
It needs to be installed on all nodes which have Manager installed, in my case I only installed manager on one node since it isn't a critical service
@AdrianuX1985 Před 2 lety
I come across negative comments about CEPH quite often.
What is your opinion?
@apalrdsadventures Před 2 lety
It's not for people who aren't ready to scale OUT. But I've had no negative experiences running it on the cluster.
@posalab Před rokem
For HA lab study is more simple build a gluster FS storage data, on the same 3 external drives.
Just my humble opinion...
But ceph is obviously a good choice.
@jwspock1690 Před 4 měsíci
Top
@thestreamreader Před rokem
I am reading that ceph doesn't run very well on slower harder as displayed here. Can you do a video on using GlusterFS with ZFS as underlying disk type. Then explain the benefits of this vs Ceph and vice versa? I think the other option is ZFS in HA but your sync only happens in 1 min intervals.
@apalrdsadventures Před rokem
Gluster really just pushes resource limits outside of its control onto the OS (i.e. ZFS) where Ceph manages the full stack on its own. Ceph also deals natively with block devices while Gluster only replicates file IO, so you end up with the qcow2 driver on top as well. Hence, the suggestion to use Ceph for native block device VMs, plus its better integration with Proxmox.
@thestreamreader Před rokem
@@apalrdsadventures Is it safe to run ceph on a 1gb network?
@apalrdsadventures Před rokem ⁺²
Safe? Definitely. Fast? Depends on your expectations but it does certainly have some performance loss vs zfs, in return you get strong guarantees that data is always safe from host level failure cluster wide (where zfs can only tolerate disk failure). Latency is also rather important to Ceph, so an underutilized network will help.
@ernestoditerribile Před rokem
Very different scale of computing. We use Lenovo Truscale with 64 Maxed out Thinksystem SR 670 v2 systems. Using Proxmox and Ceph. To have a reliable low latency datacenter. Even €2.500.000 is not enough to buy everything in that datacenter. We only Use Cisco, IBM(only for the supercomputers) and Lenovo.
@apalrdsadventures Před rokem ⁺¹
I'd love to work up to a larger scale, but working with Ceph at a small scale is still a ton of fun
@ernestoditerribile Před rokem
@@apalrdsadventures I got your video in my recommendations. Thought it was good. Was surprised that people even tried to use Ceph at a home environment. With such a cheap solution. Also a good solution to get young kids into networking by playing around with it. Or for IT-Students to get into Proxmox, VMWare, Ceph, and all kinds of different Linux distributions.
@apalrdsadventures Před rokem ⁺¹
I've certainly learned a lot about Ceph just by making this video, too! It's a great solution, even for small-medium sized data, and a lot of people probably overlook it due to the perceived complexity.
@norriemckinley2850 Před 2 lety
Great
@apalrdsadventures Před 2 lety
Glad you enjoyed it!
@EzbonJacob Před rokem
Great video. I've learned alot about proxmox from this channel. I have one question with the ceph-mgr I'm getting an "Access Denied" after login in with the user we created. I'm not seeing any helpful logs anywhere on why I'm getting a http 403 on the dashboard. Any suggestions on how to debug this?
@peanut-sauce Před 2 lety
So is CEPH or useful with only two nodes?
@apalrdsadventures Před 2 lety ⁺¹
Not really, 3 nodes is a much better setup. Replication rules by default require 3 copies to be on 3 hosts. It's technically possible to go down to 2 monitors and change the rules to be per-OSD instead of per-host, and 3 drives is still the absolute minimum.
@peanut-sauce Před 2 lety
@@apalrdsadventures So if I want to set up a proxmox cluster with only 2 devices in total (no NAS) do I just forgo shared storage and not do live migration?
@apalrdsadventures Před 2 lety ⁺²
At 2 nodes, your best bet is local ZFS on each node and ZFS replication between the two. Non-live migration, but still high availability is possible as long as you have a Qdevice or a third node acting as quorum only.
Ceph really doesn't work great below 3 nodes, that's really its minimum.
@peanut-sauce Před 2 lety
@@apalrdsadventures Oh, I see. Thanks for being so helpful! But is there any point in a proxmox cluster at all with two nodes and no quorum-keeper?
@apalrdsadventures Před 2 lety ⁺²
If you need more than one node for CPU/RAM reasons, being able to migrate is handy. You can also force the cluster to maintain quorum with a node out during maintenance (pvecm expected 1), so at least if you shut down a node intentionally to work on it you can keep the system running. You don't get HA without a third source of quorum, but you can live migrate manually
@chriswiggins3896 Před rokem
Add a 2.5Gbe usb dongle to increase network throughput.
@kimcosmos Před 2 lety
The Wal reduces write latency but what about a cache for read latency of that spinning rust? The NVME DB is only for the metadata and lookup speed. VM disks on NVMe are nice but not deduplicated. LAN games levels need a cache
"If you ARE using OSD level redundancy, then don’t use partitions for your DB disks." Lol. Node level redundancy instead.
No redundancy and with a partitioned NVMe DB? Can't it be in a replicated pool?
With a separate backup system for data you care about. Eg entire redundant clusters like HA storage pods? Or a local ML DB cache
P.S. I thought the special in ZFS used cached writes to speed repeat reads (eg gaming levels)? So that the storage merely slows if it fails.
@apalrdsadventures Před 2 lety ⁺²
When you lose the db disk of an osd, you lose the osd. If you can only tolerate the failure of some number of OSDs (instead of host-level redundancy, which can tolerate the failure of some number of hosts), you need to make sure the db disks can't cause more osds to fail than your redundancy level can recover from. In general, you should not be using any raid layers under ceph and should only use lvm for partitioning drives for db/wal disks, not to rely on it for RAID1 or something like that.
I didn't mention it in the video, but yes you can add disks in a way that forces the system to rebalance to what its new layout will be before removing the old disks (it keeps the old 'layout' active but calculates what data will need to be on the new disks and fills them before switching clients to the new layout - this is called a 'backfill'). The other option is a direct disk swap where you disable rebalancing, remove the failed osd, add a new osd on the new disk (which will claim the same ID as the old one), and enable rebalancing again. If the new osd has the same ID and capacity, the CRUSH map shouldn't change, so it won't have to move data all over the cluster like a normal osd add/delete would do.
The ZFS special vdev contains the uberblock and all of the metadata (and meta-metadata), so it's health is more important than the data drives. Once you add the special device, new metadata goes to the special device only until it's full, so failures of the special device mean failure of the whole pool. Single-sector failures might not be quite as bad since zfs triplicates meta-metadata and doubles metadata so there are backup copies (it does not do this for file data by default). ZFS doesn't do tiered write caching at all, but separating metadata to faster storage makes most operations faster since it's faster to look up directories and file block maps.
The slog is different, it's purely to write the ZIL (zfs intent log) to disk faster so it can return a sync write guarantee with lower latency. It is NOT a write cache, the transaction group is still kept in memory and still needs to be written to the data disks before it removes it from the dirty data buffer, meaning it will still increase write pressure while the data is on the ZIL but not yet on the data disks. The ZIL is only read when the system has a hard shutdown, and if you lose it, you only lose the transaction groups which weren't completed on the data disks (a few seconds of data, but data ZFS guaranteed to the application was safe).
@ilducedimas Před 2 měsíci
ceph is tough
@AdrianuX1985 Před 2 lety ⁺¹
+1
@apalrdsadventures Před 2 lety ⁺¹
Glad you liked it!
@zippytechnologies Před rokem
Now we just need ceph nas setup for samba using a vm to manage...
@apalrdsadventures Před rokem ⁺²
I'm working on CephFS (and erasure coded pools in RBD in Proxmox), but the video was already getting too long to include all of that information at once
@zippytechnologies Před rokem
@@apalrdsadventures sounds like a new video shall soon be made, no?
@apalrdsadventures Před rokem ⁺¹
The next Ceph video is going to be erasure coded RBD pools. No guarantees on timing of that. CephFS will come after that video.
@naturkotzladen Před rokem
Beside all the valuable tech tips, please set up some affiliate links for your shirt collection, I would buy the make-fail shirt NOW... ;-)
@apalrdsadventures Před rokem ⁺¹
Almost all of the shirts I wear are from other CZcams creators - the make-fail-make-fail-make one is from Evan & Katelyn - shopevanandkatelyn.com/products/make-fail-mens-tee
@mtartaro Před 2 lety
FYI erasure coding is really zero suppression and usually requires a minimum of nodes.
@apalrdsadventures Před 2 lety ⁺²
You can build an erasure coded pool with 3 nodes as well, you just don't get the same storage efficiency or failure tolerance as you can get with a much larger setup (if you stick with host-level redundancy instead of OSD-level).
The only option would be a 2/1 pool (2 data shards + 1 coded shard). For the same efficiency, a 4/2 pool would handle twice as many failures, or you could go even higher (i.e. 5/1 or 14/4 or whatever) to get good storage efficiency if your cluster is big enough to spread out the shards.
Erasure coding is functionally equivalent to RAID5/6 or RAIDZ1/2/3+ but with a lot more control over how the data is split and how much failure tolerance you have, so like all of Ceph, it's a great solution if you have enough data to make it worthwhile.
@yatokanava Před rokem
Спасибо! очень наглядное видео!
@AamraNetworksAWS Před rokem
Hi, have question while installing ceph dashboard in proxmox version 7.3 - 3, getting error. I have followed your other video czcams.com/video/nyhIqewyDBk/video.html but could not mange to install. If possible share a step by step tutorial or a guide to follow through.
@thinguyen937 Před 10 měsíci
I use the command: ceph mgr module enable dashboard
@FrontLineNerd Před rokem
Nope. Dashboard does not work as shown. Can’t create cert. can’t create user. Those commands don’t work.
@moogs Před rokem
Love ceph but it’s slow AF…
@apalrdsadventures Před rokem ⁺¹
For the decrease in performance you gain guarantees that writes are committed across the cluster with host level redundancy, which is a tradeoff a lot of clusters are willing to make for data security and to use lower cost / less failure-resistant hosts.
@bober1019 Před 8 měsíci
wtf is this? a tutorial on getting the worst performance possible in regards to storage?
post your bench marks if you dare.
@mattblakely7036 Před 9 měsíci
Is ceph possible with 2 nodes and using a qdevise for quorate? Im really interested in having 1 vm and 1 ct online 100% with zero down time accross 2 nodes in an HA. This would allow the vm or ct to migrate without rebooting or starting up again.
@mzimmerman1988 Před 2 lety
Thanks!
@apalrdsadventures Před 2 lety ⁺¹
Thanks for the donation! Glad you enjoyed the video!

Další v pořadí

Automatické přehrávání