Save Your Linux Machine From Certain Death

Recovering your root password and more

Apr 30 ·7min read

ZfAVfaE.jpg!web

Photo by Regine Tholen on Unsplash

Troubleshooting damaged systems is an essential skill of every SysAdmin, SRE, or DevOps engineer. Every one of us runs into OS-related issues from time to time and it’s better to be prepared when things go terribly wrong.

It’s especially beneficial to be able to identify and act on the issue quickly to prevent any significant damage. To help with that in this article, we will go over a few common problems that you might encounter as well as ways to gather information, troubleshoot, and solve these issues.

Note: This article uses RHEL 8 / CentOS . But, the examples/concepts below can be applied to any Linux distribution.

Recovering R `oot` Password

What if you lose the root password and you don't have access to a privileged user? If you still have access to the machine, then there is a way to solve this inconvenient situation.

First, start by rebooting the machine. When the machine starts, hit any key to access the boot menu:

Qvammy2.png!web

Boot Menu Screenshot

In the boot menu, hit e to edit boot options. Using the arrows, move to the line starting with linux and append rd.break . This breaks the boot process early on.

Optionally, you can also append enforcing=0 , to pause SELinux enforcing. Next, hit CTRL+X to let the machine boot.

RRRjYvb.png!web

Edited Boot Menu Screenshot

After a few seconds of booting, you should get the shell. At this point, you have access to the system in read-only mode.

So, to change anything in the system — like the root password — we need to make the filesystem read-write . We can do that by running mount -o remount,rw /sysroot .

The next thing we need to do is enter the root jail using chroot /sysroot — this changes the root of the filesystem to /sysroot instead of / . This is required so that any further commands we run will be in regards to the /sysroot directory. Now we can change the root password using passwd .

If you added enforcing=0 to boot options, you can now hit CTRL+D (or type exit ) and let the system fully boot. If not, run touch /.autorelabel to trigger the SELinux system relabel.

This is needed because changing the password results in /etc/password having an incorrect SELinux security context. Therefore, we need to relabel the whole filesystem during the next boot (this can take some time, depending on the size of the filesystem).

ZVfmy2n.png!web

Password Change Commands

As an alternative solution, you could also access Linux’ debug-shell . This can be done, again, by accessing GRUB during boot and appending systemd.debug-shell instead of rd.break .

When you let the system boot with this option, you will end up in a normal shell session, which isn’t very helpful. If you, however, try to access terminal 9 using CTRL+ALT+F9 , you will open debug-shell with full root permissions.

Here, you can change the password normally. At this point, you can switch back to a normal shell ( CTRL+ALT+F1 ) and log in.

You shouldn’t forget to stop the debug-shell though, as it is a huge vulnerability to the system. You can do that by running systemctl stop debug-shell.service (you can still switch back to debug-shell but it will be unresponsive; killed-off).

ym6BRjj.png!web

debug-shell

Fixing Unmountable Filesystems

Creating new partitions, creating filesystems, mounting filesystems, etc. are common tasks for most SysAdmins.

But, even though these are basic tasks, it’s easy to make a mistake that may render your system unbootable. Let’s see how you can solve problems related to unmountable filesystems.

As with previous solutions, we start by rebooting the machine, accessing the boot menu and editing it, this time appending systemd.unit=emergency.target . This tells your system to boot into an emergency target instead of the default one (multi-user or graphical).

When the system boots and we get the shell, we login as root and we again remount the filesystem using mount -o remount,rw / . Now we can try mounting all filesystems by running mount -a .

If there is a problem with mounting a specific filesystem, you might see an error message like mount: /wrong-mount: mount point does not exist. or mount /wrong-mount: special device /dev/sdb1 does not exist. . These kinds of issues need to be fixed inside /etc/fstab :

After fixing the issue in /etc/fstab , run systemctl daemon-reload , so that systemd picks up the changes. Now, run mount -a again. If the issue was indeed fixed, you should see no error (no news, is good news). You can now exit using CTRL+D and let the system boot normally.

Aside from a mistyped device or mount point name, you might also encounter issues with VDO (Virtual Data Optimizer) or Stratis, which require extra mount arguments.

E.g. x-systemd.requires=vdo.service or x-systemd.requires=stratisd.service , without which the system won’t boot properly.

Another common and easily fixable mistake might be a missing quote when using UUID="... to specify the device (use /etc/fstab syntax highlighting, it can save you a lot of problems).

Troubleshooting SELinux Problems

This one is not a life and death kind of a situation, but it can cause a lot of problems, so it’s beneficial to be able to identify it quickly when it happens.

It’s important to realize that most of the time, SELinux is doing its job correctly. But it might just happen that you are trying to achieve something SELinux doesn’t expect.

Some of the problems you might encounter may include issues with incorrect file context, for example, after moving a file from one place to another. Sometimes the issue might be with overly restrictive policies (SELinux booleans) or blocked service ports.

One can troubleshoot all of these problems by first temporarily changing SELinux to non-enforcing mode using setenforce 0 and retrying the action that wasn’t working previously.

If the problem was fixed by switching SELinux to non-enforcing mode, then we know that the problem was caused by an SELinux violation.

Now, if we turn SELinux back on using setenforce 1 , we can try to analyze and fix the violation.

First, install setroubleshoot-server using yum -y install setroubleshoot-server . This troubleshooting server will listen to /var/log/audit/audit.log and send summary messages to /var/log/messages .

Next, to analyze these messages, run grep sealert /var/log/messages which should give you messages like this:

As an example here, I configured httpd to run on port 8012 which is blocked because of SELinux service’s allowed ports. If we were not aware of this, then it would be quite hard to find the root cause of this issue.

The output above can help with that. We can see a description of the SELinux violation as well as a command that can help us troubleshoot further, so let’s try it out:

This produces a full report of what caused the violation. Including a suggested (not necessarily the most appropriate) fix.

If you have some experience with SELinux, you might realize that the most appropriate way to fix this issue is to add the relevant port to the SELinux service ( http_port_t ). This can be done by running semanage port -a -t http_port_t -p tcp 8012 .

This pattern of replicating the violation, looking for sealert messages in var/log/messages and viewing the report, and analyzing the report can be applied to any SELinux violation/problem, not just the one example above.

Alternatively, you can also search directly in /var/log/audit/audit.log using ausearch . The specific command you would want to run: ausearch -m AVC -ts recent . This shows all recent denials.

The output should look something like this (same information, but a little less user friendly):

Getting Logs From a Crashing System

By default, logs stored in /run/log/journal are not persisted across system reboots. That might become a problem if you need to debug logs on a crashing system.

To preserve journal logs, we need to modify /etc/systemd/journald.conf . More specifically, the Storage parameter:

By uncommenting and changing Storage to persistent , we tell systemd to store all logs in var/log/journal . Aside from this change, we also need to run systemctl reload systemd-journald to make sure that the change takes effect.

Even though this change will persist logs on your system, it won’t keep all of them forever. By default, journald is configured to not exceed 10% of the filesystem or leave the system with less than 15% of free space.

Now, to actually inspect the previously-stored logs. First, switch to root user. Run journalctl --list-boots . This will give you a list like this:

Based on the dates and times, choose from which boot you want to see logs. For example, to view logs from the boot with id -2 with log level err or higher:

If the logs above are not enough to troubleshoot your issues, then there are other log files to check:

/var/log/messages
/var/log/boot.log

Alternatively, if you can’t boot your machine normally, then you can access emergency.target as shown above, a view logs there in the same way.

Conclusion

There is a lot more that can go wrong with a Linux machine than what I have shown in the sections above. These examples/approaches, however, can be applied to a variety of other problems that you might encounter.

Also, not all of them are life-and-death kind of situations, but it’s always preferable to be able to solve them rather quickly, especially if this problematic machine is a production system.

Solving most of the issues depends on getting the right information and being able to restore previous configurations, therefore, it’s crucial to always store logs and to backup critical files on your system before modifying them

This article was originally posted at martinheinz.dev

Save Your Linux Machine From Certain Death