|
||||
Progress, but at what Cost?NFS Permissions and My Failure to Understand their Early ImportanceI have avoided unix permissions for as long as possible. Accepting defaults without question and "securing" files and folders with quick "chmod 0640"s since my early years in High School. It has finally caught up to me, and I needed to sit down and properly design an NFS structure using FreeIPA accounts as the basis. I had no idea where to start, so I made a bold decision and asked Claude to explain it to me. It has worked well for me in the past for generating starting points and letting me rubber-duck-troubleshoot with something that knows even a tiny bit about what I'm talking about (my friends are non-technical.) So off I was to the land of planning Kerberos ticketing and diagraming my restructuring the entirety of my 35 existing VMs to use Kerberos accounts and proper automatic ticketing via k5start. TrueNas was already configured to support my Kerberos architecture, so all I had to do was re-do all of the permissions for each service in my stack...yay! Naturally, I distracted myself from that reality for a while with something a bit more fun... Clustering should be easyThe fun thing I decided to distract myself with was to finally set up the other two cluster nodes in my lab. They've been sitting there, just waiting to be utilized. Don't think I ran out of available RAM for VMs a week ago, cause that's DEFINITELY not what happened... When joining a node to a Proxmox cluster via the GUI, there is an automated system that manages the communication between nodes. For one reason or another (read: I still haven't had time to sit down and figure it out, you'll read why in a moment.) my nodes did not like each other's self-signed certificates. Which meant I couldn't login on node 2. Couldn't even select a Realm to login with. A quick search resulted in a "simple enough fix"™, and I was off to the races of running an un-vetted command on my Cluster Nodes. "Oh sweet lord, I hope it's not the command I think it is," I think I can hear a few of you muttering. So, I ran a quick "pvecm updatecerts -f" and it stalls my progress relatively quickly as it complains about quorum for a few minutes, then quits. In my haste to get back to a usable Proxmox VE instance, and in a bout of stupidity/unrelenting stubbornness, I kept searching for solutions while abandoning this one. Not a solution to the initial cluster join issue or on the lack of a quorum, but concerning why a cluster node would not let me login after joining. Imagine my joy when one search even resulted in a nice command to reset everything on the cluster to pre-cluster state! I goofed. Bad...I will admit, I attempted the incredibly bold (read: stupid) move of creating a Proxmox cluster without prior knowledge of how to manage it. I mean, how hard could it be? I've administrered Hyper-V for a while now at work, it can't be much different... Let me tell you how my assumed prior knowledge of mechanisms that I did not truly understand broke my cluster, as well as a bit of my sanity and definitely nuked my willpower for the next few hours. The command I decided ato run at 9am on a Sunday was "pvecm delnode #." Some of you may recognize this as the command that removes a node from a cluster. In the Hyper-V world, this isn't a massive deal, as the hypervisor continues running it's VMs and the cluster basically just becomes unaware of the node (I'm paraphrasing, there's a bit more including (SPOILER:) role draining and High Availability shifts.) Altogether, not terribly impactful. So there I was, peanut butter on my...toast...and I told my cluster to delete node 2 from node 1. The command immediately failed, as it couldn't find that it was part of a functioning cluster. Hey, fair enough! So I did the dumbest thing imaginable, and I re-ran it from node 2, telling it to delete node 1. Failure again, damn! Thinking nothing more of it, I decided to attempt re-booting nodes 1 and 2 one last time to see if the Realms would come back. Reboots solve so much in the Windows Server world, maybe there was some cluster service that needed to be started in order on the cluster? How foolish can a man be, amirite? Logging back into node 1, I was surprised at how slow the node was to populate it's sidebar of containers, VMs, storage, etcetera. The storage was acting weirdly, not showing in the pane but I can see it's disks and it's in the zfs section...wait a minute.... I was slowly realizing I had deleted all of my 35 VMs and Containers I had spent the past two months configuring. In an undoubtable moment of panic, my unknowledgeable ass decided to see about recovering the VMs OS disks by re-creating the LVM Volume. Whatever hellish spirit willed me to choose that option, I curse ye. I wiped my OS disks in the blink of an eye, not making the mental connection that I still had the data, just not the piece to tell Proxmox it was there. I just needed to re-add the volumes that contained the OS disks and then redo the VM/Container/Storage configs. If I had backups in place this would have been trivial... Man, hindsight is a wonderful thing. Let this be a warning for any administrators, panic is the LAST thing that you ever want to do. There is a way out of your situation that just requires a bit more thinking and understanding than the first StackOverflow post you come across will give you. If you don't know, that just means you have to find out. Don't take any rash actions, map out what you did and form an action plan first. The Silver LiningI had a breakthrough whilst commiserating with some more technical folks I was venting about this to. With me having to restart with all of my systems, I could test my effectiveness in note-taking and improve my process as a result. I've got a few of the VMs and Containers back, and my notes are starting to fail on some of the more complex installs where I needed them to be the most effective. The Apache Guacamole install is horrible, in particular. I'm perpetuating my stupidity at the moment by saying backups with PVE Backup are too hard, and I need to stop that. Maybe I will have them set up by next post, maybe not if my notes are doing a lot of lifting and I can get Ansible up before the next incident. We shall see. I plan on continuing to add Blog posts, and I have added an RSS Feed so you can add my awful ramblings to your RSS Tracker. Expect nothing, but updates may eventually be done. |
Site Crudely Constructed By: Dextano