September 2023 – Things I wish I'd Known Earlier

Puppetserver in a can —Docker versus Podman

Things have been too exciting here lately. Attempting to set up Ceph, having a physical machine’s OS get broken, having a virtual machine get totally trashed and rebuilt from the ground up.

It was this last that gave me an unexpected challenge. When I built the new container VM, I moved it from CentOS 7 to AlamLinux 8. Should be no problem, right?

I sicced Ansible on it to provision it — creating support files, network share connections… and containers.

One thing that’s different between CentOS 7 and CentOS (IBM Linux) 8 systems is that the actual container manager changed from Docker to Podman.

For the most part, that change is transparent to the point that if you give a Docker command, it is passed more or less verbatim to Podman.

But one container proved difficult: the Puppetserver. It steadfastly refused to come up. It claimed that it couldn’t find the cert it was supposed to have constructed. Or that the cert didn’t match itself, if you prefer.

Tried it with Podman on my desktop according to instructions. No problem. Changed my Ansible provisioning from “docker_container” to “podman_container”. No luck. Did an extra evil trick that allowed be to freeze startup so I could dial into the container. Cert directories were incomplete or empty!

I hacked deeper and deeper into the initialization process, but no enlightenment came. So finally I tried manually running the container with minimal options from the VM’s command line.

It still failed. Just for giggles, I did one more thing. Since Docker requires a root environment and I was trying to keep changes to a minimal, I was running Podman as root. I tried launching the Puppetserver container from an ordinaty user account.

And it worked!

I’m not sure why, although I suspect a couple of things. I believe that Pupper moved loctions on some of its credential files and was possibly counting on references to the old locations to redirect. Maybe they didn’t because some of this stuff is apparently using internal network operations and networking works differently in userspace.

At any rate, simply running the container non-rooted was all it took. Happiness abounds and I’ve got my recalcitrant server back online!

Ceph — Moving up

OK. I moved from Ceph Octopus to Pacific.

The reason I originally settled on Octopus was I was thinking that it was the newest release that had direct CentOS 7 support for my older servers. I was wrong. Actually CentOS 7 maxed out at Nautilus.

Octopus has been a problematic release. A bug kept scheduled cephadm tasks from executing if even the most trivial warnings existed. And trivial warnings were endemic, since there were major issues in trying to get rgw running. Caused, in part, I think, from a mish-mash of setups via older and cephadm component deployments. Certain directories are not in the same place between the two.

I also couldn’t remove failed rbs pool setups. They kept coming back after I deleted them, no matter how brutally I did so.

And finally, a major lock-up crept into the NFS server subsystem. Although it turns out that the ceph mount utility for Nautilus works just fine for me with Octopus servers so I was able to re-define a lot of NFS stuff to use Ceph directly.

Before I proceed, I should note that despite all my whinging and whining, the Ceph filesystem share has been rock solid. Much more so than Gluster was. Despite all the mayhem and random blunderings, the worst I ever did to the filesystem was trash performance when I under-deployed OSDs. Never did I have to wipe it all out and restore from backups.

OK. So I finally bit the bullet. I’d fortuitiously run across the instructions to level-up all the Ceph depoyments to be cephadm based the day before. Now I followed the simple upgrade instructions.

I had concerns because I had a number of failed rgw daemons I couldn’t get rid of, but actually the worst thing that happened was it spend all night flailing because an OSD had gone offline. Restarted that and things began to happen. Lots of progress messages. Occasional stalls (despite what one source told me it’s NOT desirable to have my Ganesha pool have no replicas!),

But amazingly it was done! Now deleted rgw dæmos stay deleted! I might even get them hooked into my VM deployments! And since one of Pacific’s improvements had to do with NFS, I hope Ganesha will be happy again.

Ceph – another mystery error

Cannot place on myhost: Unknown hosts

Ran into this when attempting to use ceph orch to manage mgr’s.

It’s actually quite simple and has nothing (directly) to do with /etc/hosts or DNS. Only with the list of hosts that ceph knows internally.

And the problem — yet again — is that Ceph doesn’t manage multiple hostnames!!!

The hostname was registered as “myhost.mousetech.com”, but I had requested like this:

ceph orch apply mgr --placement="ceph01 myhost"

“ceph01” was OK, since that’s how Ceph had been set up. But when I added “myhost”, I did so with the fully-qualifed domain name. The bloody names must match what Ceph has on file exactly. Grrr!

So the proper command was:

ceph orch apply mgr --placement="ceph01 myhost.mousetech.com"

Ceph

I migrated from Gluster to Ceph a while back, since Gluster appears to be headed towards oblivion. It’s a shame, since Ceph is massive overkill for a small server farm like mine, but that’s life.

On the whole, Ceph is well-organized and well-documented. It does have some warts however:

The processes for installing and maintaining Ceph are still in flux as of Ceph Octopus, which is the most recent my CentOS 7 servers can easily support. Ny my count, we’ve gone through the following options:
- Brute force RPM install (at least on Red Hat™-style systems)
- Ansible install(deprecated)
- ceph-deploy command. Apparently not included in my installation???
- cephadm command. The presumed wave of the future and definitely an easy way to get bootstrapped
Certain error messages don’t convey enough information to understand or correct serious problems. For example:
monclient(hunting): handle_auth_bad_method server allowed_methods [2] but i only support [2,1]. What does this even mean? How do I make it go away???
A number of systemd dæmons can fail with nothing more informative than “Start request repeated too quickly.” Often the best way to get usable (or sometimes useless) information on why the dæmon failed is to run the associated service manually and in the foreground. Especially since neither the system journal nor default log levels give any hints whatsoever.
Ceph handles the idea that a host can have multiple long and short names very poorly.
I have managed to severely damage the system multiple times, and I suspect a lot of it came from picking the wrong (obsolete) set of maintenance instructions even when I looked under a specific Ceph documentation release. Fortunately the core system is fairly bulletproof.

As a result, I’ve been posting information pages related to my journey. Hopefully they’ll be of help.

M	T	W	T	F	S	S
				1	2	3
4	5	6	7	8	9	10
11	12	13	14	15	16	17
18	19	20	21	22	23	24
25	26	27	28	29	30