Ceph “Module ‘progress’ has failed: Not found or unloadable”

Ceph is a fairly easily-managed system as complex data management systems go. From fairly primitive roots it has managed to evolve into a fairly well-organized set of components with flexible monitoring and control options.

But there is no system so foolproof that fools cannot break it. So I did, and this is the sordid tale of the long painful route that I took to finally achieve a clean system.

It all began because Ceph doesn’t appreciate the fact that servers can have multiple hostnames, including short hostnames and host domain names.

When I first installed Ceph, I was using short hostnames. but additional hosts were added using host domain names. In the end, one host appeared twice in the host list, under both names. Some services were assigned to one hostname, some to the other, and the admin services whined about orphans. So, in short, it was a mess.

And it got messier.

As I tried moving services back and forth between the alternative hostnames, I ended up breaking critical stuff and ultimately got 2 fatal errors:

Module 'progress' has failed: Not found or unloadable
Module 'cephadm' has failed: Not found or unloadable

At this point, I had lost control of the one and only ceph mgr component, having first destroyed its alternates and couldn’t add new/repaired mgr units anywhere.

Worse without the “progress” and “cephadm” modules, I couldn’t run most orchestrator and admin functions. What a disaster!

It wasn’t really viable to nuke everything, reconstruct the entire system and restore from backups, so I looked at restoring the ceph-mgr functions. While ceph-mgr isn’t absolutely essential to keeping the data resources available, without it, monitoring and control was drastically reduced.

The logs really didn’t tell me anything except for this useless message:

ceph-278fcd86-0861-11ee-a7df-9c5c8e86cff8@mgr.host01.service: Start request repeated too quickly.
ceph-278fcd86-0861-11ee-a7df-9c5c8e86cff8@mgr.host01.service: Failed with result 'exit-code'.
Failed to start Ceph mgr.host01 for 278fcd86-0861-11ee-a7df-9c5c8e86cff7.

So what is the ceph mgr? How is it implemented? It’s obviously started from systemd. but where from there?

A lot of ceph these days runs in docker/podman containers. So I looked to see if there was a container, what its state was, and what its logs had to say. However, at least in Ceph Octupus, the mgr doesn’t run containerized. Instead it runs (at least as I installed it) as a native OS application: /usr/bin/ceph-mgr.

I tried launching this program manually, but all I did was end up with 4 instances of ceph-mgr and no way to kill them, because they kept changing their process-ids!

Modules

The operative word in all this was “module”. The Ceph docs helpfully explain that modules add functionality to ceph-mgr, and that you can actually write your own modules in Python. Certain modules are “built is/always on”, some are selectable, such as which orchestrator to use.

So obviously, the thing to do was find out why these modules were “not found”. Which means figuring out where to find them.

At first I expected them to be in the master container image that underpins most ceph services, but the ceph-mgr does not run containerized. Further inspection indicated /usr/share/ceph/mgr. but there was no progress package there. But there should have been, it transpires.

On AlmaLinux 8, Ceph Octopus and cephadm conspire to install many ceph components via RPM packages. A quick repository search indicated 3 mgr-related packages of especial interest:

ceph-mgr
ceph-mgr-modules-core
ceph-mgr-cephadm

The rpm “-qil” option tells what files are installed for each package. The progress package is in ceph-mgr-modules-core, which contains the pre-installed/always on modules. Since it was already installed and couldn’t be uninstalled without major mayhem, the alternative was to use a hammer and force it to install clean using a downloaded copy of the RPM:

rpm -ivh --force ceph-mgr-modules-core-xxxxx.rpm

“xxxx” is the usual version and architecture part of the name and will vary.

That replaced the missing progress module. But somehow the cephadm RPM had been completely uninstalled, so a full install fixed that.

The system didn’t immediately notice the repairs, so a reboot set everything to rights.

Aftermath

For a deliriously short while, I had a totally clean “ceph health” display. Alas, this didn’t last, as the original orphan service message crept back in. But now I had enough functioning ceph CLI to setup alternate managers (and remove the orphan manager systemd file). Hopefully, as time progresses, the errors will slowly slink away. Sometimes that’s how ceph works.