How to run a container that crashes on startup

one of the most frustrating things about running with containers is when a container fails immediately on startup.

This can happen for a number of reasons, and not all of them record errors in logfiles to help diagnose them.

However, there’s a fairly easy way to get around this.

Start the container with the “interactive” option and override the “entrypoint” option to execute “/bin/sh”. This will do two things.

  1. Instead or running the normal container startup, it will start the container executing the command shell
  2. The “interactive” option holds the container open. Without it, the command shell sees an immediate end-of-file and shuts down the container.

At this point, you can then use the docker/podman “exec” command to dial in to the container like so:

docker exec my_container /bin/sh

At that point, you can inspect files, run utilities, and do whatever is necessary to diagnose/repair the image.

An additional help is also available once you have a tame container running. You can use the docker/podman “cp” command to copy files into and out of the container. Many containers have minimal OS images and have neither an installed text editor nor a package installer to install a text editor. So you can pull a faulty file out of a container, fix it, and push it back. The changes will persist as long as you restart the container and don’t start a new instance from the original image.

When things go really bad

Most of the time when Linux has serious problems, it can run in a reduced capacity. But sometimes that capacity is reduced to a really painful point. Specifically, when the root filesystem gets mounted as “read only”.

This usually happens if the root filesystem is detected as corrupt, although recently I had it happen when a core mount failed — and because the root filesystem was read-only, I couldn’t edit the failing mount!

There’s an old technique I used to use for things like this, but changes in the boot process (grub) and system initialization (systemd) caused them to no longer be usable.

Fortunately, there’s a modern-day alternative.

When your system’s grub menu comes up, edit the OS kernel options line. That’s the line that references the “vmlinuz” root file.

Add the option rd.break to that line and proceed with booting. The “break” option will cause the initial ramdisk to load and set up a minimal OS environment capable of loading and initializing the full kernel, but it will halt at the point where the primary filesystem root has been temporarily mounted under /sysroot. You can unmount it and run filesystem repairs, if needed, chroot to it and reset the root password or fix show-stopping errors (like my bad fstab file!). and then reboot the repaired system.

Of course, this is only one recovery option. For best reliability it’s a good idea to keep a stand-alone boot “disk” (USB or whatever) and/or a recovery PXE boot.

For much more detailed information, look here: https://docs.fedoraproject.org/en-US/fedora/latest/system-administrators-guide/kernel-module-driver-configuration/Working_with_the_GRUB_2_Boot_Loader/

Note that on RHEL9, this process apparently broke for a while. An alternative option was to add a boot into /bin/bash on the grub boot command line.