Skip to content

Xfce4 opening links in Chromium despite Firefox having been set as the default browser

Debian

Installing a laptop with the shiny new Debian Bookworm release finds a few interesting things broken that I probably had fixed in the past already on the old laptop.

One, that was increadibly unintuitive to fix, was lots of applications (like xfce4-terminal or Telegram) opening links in Chromium despite Firefox being set as the preferred webbrowser everywhere.

update-alternatives --config x-www-browser was pointing at Firefox already, of course.
The Xfce4 preferred application from settings was Firefox, of course.
xdg-mime query default text/html delivered firefox-esr.desktop, of course.

Still nearly every link opens in ███████ Chromium...

As usually the answer is out there. In this case in a xfce4-terminal bug report from 2015.

The friendly "runkharr" has debugged the issue and provides the fix as well. As usually, all very easy once you know where to look. And why to hate GTK again a bit more:

The GTK function gtk_show_uri() uses glib's g_app_info_launch_default_for_uri() and that - of course - cannot respect the usual mimetype setting.

So quoting "runkharr" verbatim:

1. Create a file `exo-launch.desktop´ in your `~/.local/share/applications´ directory with something like the following content:

    [Desktop Entry]
    Name=Exo Launcher
    Type=Application
    Icon=gtk-open
    Categories=Desktop;
    Comment=A try to force 'xfce4-terminal' to use the preferred application(s)
    GenericName=Exo Launcher
    Exec=exo-open %u
    MimeType=text/html;application/xhtml+xml;x-scheme-handler/http;x-scheme-handler/https;x-scheme-handler/ftp;application/x-mimearchive;
    Terminal=false
    OnlyShowIn=XFCE;

2. Create (if not already existing) a local `defaults.list´ file, again in your `~/.local/share/applications´ directory. This file must start with a "group header" of

    [Default Applications]

3. Insert the following three lines somewhere below this `[Default Applications]´ group header [..]:

    x-scheme-handler/http=exo-launch.desktop;
    x-scheme-handler/https=exo-launch.desktop;
    x-scheme-handler/ftp=exo-launch.desktop;

And ... links open in Firefox again. Thank you "runkharr"!

Linux kernel USB errors -71 and -110

Linux

After an upgrade of my PC's mainboard BIOS the boot would take a minute or more to complete and sometimes the lightdm login screen would sit there but not accept keyboard input for another minute or so. Then the keyboard got enabled and I could log in normally. Everything worked fine after that bootup struggle completed. This was fully reproducible and persisted across reboots. Weird.

The kernel dmesg log showed entries that looked suspicious:

dmesg log excerpt showing USB error messages

Googleing these error -110 and error -71 is a bit hard. Now why the USB driver does not give useful error messages instead of archaic errno-style numbers escapes me. This is not the 80s anymore.

Citation needed (Wikipedia style) The wisdom of the crowd says error -110 is something around "the USB port power supply was exceeded" [source].

Now lsusb -tv shows device 1-7 ... to be my USB keyboard. I somehow doubt that wants more power than the hub is willing to provide.

The Archlinux BBS Forums recommend to piece together information from drivers/usb/host/ohci.h and (updated from their piece which is from 2012) /tools/include/uapi/asm-generic/errno.h. This is why some people then consider -110 to mean "Connection timed out". Nah, not likely either.

Reading through the kernel source around drivers/usb/host did not enlighten me either. To the contrary. Uuugly. There seems to be no comprehensive list what these error codes mean. And the numbers are assigned to errors conditions quite arbitrarily. And - of course - there is no documentation. "It was hard to do, so it should be hard to understand as well."

Luckily some of the random musings I read through contained some curious advice: power cycle the host. So I did and that did not make the error go away. Other people insisted on removing cables out of wall sockets, unplugging everything and conducting esoteric rituals. That made it dawn on me, the mainboard of course nicely powers the USB in "off" state, too. So switching the power supply off (yes, these have a separate switch, go find yours), waiting a bit for capacitors to drain and switching things back on and ... the errors were gone, the system booted within seconds again.

So the takeaway message: If you get random error messages like

device descriptor read/64, error -110
device not accepting address 42, error -71

on devices that previously worked fine ... completely remove power from the host, the hubs and the USB devices. So they forget they saw each other on the bus before. And when they see each other after that blackout, they will happily go through negotiating protocol details with each other again successfully.

Install "kept back" updates on Ubuntu

Linux

Canonical has implemented a staged roll-out for some Ubuntu package updates.

I find that rather annoying at times, e.g. when preparing the laptop for traveling.

So for my memory and for the benefit of others:

# disable the phased roll-out feature on this apt upgrade run
sudo apt -o "APT::Get::Always-Include-Phased-Updates=true" dist-upgrade

Screenshot of apt with the option to disable staged rollouts

This can - for permanent use - be put into a config file, e.g. Gerrit Heim puts it into /etc/apt/apt.conf.d/99-Phased-Updates [German]. Some other options around this staged roll-out feature are "documented" on a thread in the Ubuntu discourse forum.

Thunderbird gpg key import

Open Source

Thunderbird, srsly?

5MB (or 4.8MiB) import limit. Sure. My modest pubring (111 keys) is 18MB. The Debian keyring is 28MB.

May be, just may be, add another 0 to that if statement?

So, until that happens, workarounds ...

Option 1:

Export each pubkey into a separate file. The import dialog allows to select them all in one go. But - of course - it will ask confirmation for each. So prepare some valerian tea.

gpg --with-colons --list-public-keys | grep ^pub | cut -d : -f 5 | xargs -I {} -n 1 gpg -ao {}.pub --export {};

Option 2:

Strip all the signatures, so Thunderbird gets a smaller file to chew on. This uses pgp-clean from signing-party.

gpg --with-colons --list-public-keys | grep ^pub | cut -d : -f 5 | xargs pgp-clean -s >> there_you_go_thunderbird.pub

Option 1 will retain the signatures on individual keys, Option 2 will not.

Getting gpg to import signatures again

Open Source

The GnuPG (gpg) ecosystem has been played with a bit in 2019 by adding fake signatures en masse to well known keys. The main result is that the SKS Keyserver network based on the OCaml software of the same name is basically history. A few other keyservers have come up like Hagrid (Rust) and Hockeypuck (Go) but there seems to be no clear winner yet. In case you missed it in 2019, see my take on cleaning these polluted keys.

Now the changed defaults in gpg to "mitigate" this issue are trickling down to even the conservative distributions. Debian Bullseye has self-sigs-only on gpg 2.2.27 and it looks like Debian Bookworm will get gpg 2.2.40. This would add import-clean but Daniel Kahn Gillmor patched it out. He argues correctly that this new default could delete data from good locally stored pubkeys.

This all ends in you getting some random combination of self-sigs-only and / or import-clean depending on which Linux distribution and version you happen to use.

Better be explicit. I recommend to add:

# disable new gpg defaults
keyserver-options no-self-sigs-only
keyserver-options no-import-clean

to your ~/.gnupg/gpg.conf to make sure you can manage signatures yourself and receive them from keyservers or local imports as intended.

In case you care: See info gnupg --index-search=keyserver-options for the fine documentation. Of course apt install info first to be able to read info pages. 'cause who still used them in 2023? Oh, wait...

The Stallman wars

Open Source

So, 2021 isn't bad enough yet, but don't despair, people are working to fix that:

Welcome to the Stallman wars

Team Cancel: https://rms-open-letter.github.io/ (repo)

Team Support: https://rms-support-letter.github.io/ (repo)

Current Final stats are:

Team Cancel:  3019 signers from 1415 individual commit authors
Team Support: 6853 signers from 5418 individual commit authors

Git shortlog (Top 10):

rms_cancel.git (Last update: 2021-08-16 00:11:15 (UTC))
  1230  Neil McGovern
   251  Joan Touzet
    99  Elana Hashman
    73  Molly de Blanc
    36  Shauna
    19  Juke
    18  Stefano Zacchiroli
    17  Alexey Mirages
    16  Devin Halladay
    14  Nader Jafari

rms_support.git (Last update: 2021-09-29 07:14:39 (UTC))
  1821  shenlebantongying
  1585  nukeop
  1560  Ivanq
  1057  Victor
   880  Job Bautista
   123  nekonee
   101  Victor Gridnevsky
    41  Patrick Spek
    25  Borys Kabakov
    17  KIM Taeyeob

(data as of 2021-10-01)

Technical info:
Signers are counted from their "Signed / Individuals" sections. Commits are counted with git shortlog -s.
Team Cancel also has organizational signatures with Mozilla, Suse and X.Org being among the notable signatories. The 16 original signers of the Cancel petition are added in their count. Neil McGovern, Juke and shenlebantongying need .mailmap support as they have committed with different names.

Further reading:

12.04.2021 statements from the accused:

18.04.2021 Debian General Resolution

The Debian General Resolution (GR) vote of the developers has concluded to not issue a public statement at all, see https://www.debian.org/vote/2021/vote_002#outcome for the results.

It is better to keep quiet and seem ignorant than to speak up and remove all doubt.

See Quote Investigator for the many people that rephrased these words over the centuries. They still need to be recalled more often as too many people in the FLOSS community have forgotten about that wisdom...

01.10.2021 Final stats

It seems enough dust has settled on this unfortunate episode of mob activity now. Hence I stopped the cronjob that updated the stats above regularly. Team Support has kept adding signature all the time while Team Cancel gave up very soon after the FSF decided to stand with Mr. Stallman. So this battle was decided within two months. The stamina of the accused and determined support from some dissenting web devs trumped the orchestrated outrage of well known community figures and their publicity power this time. But history teaches us that does not mean the war is over. There will a the next opportunity to call for arms. And people will call. Unfortunately.

Compiling and installing the Gentoo Linux kernel on emerge without genkernel (part 2)

Gentoo

The first install of a Gentoo kernel needs to be somewhat manual if you want to optimize the kernel for the (virtual) system it boots on.

In part 1 I laid out how to improve the subsequent emerges of sys-kernel/gentoo-sources with a small drop in script to build the kernel as part of the ebuild.

Since end of last year Gentoo also supports a less manual way of emerging a kernel:

The following kernel blends are available:

  • sys-kernel/gentoo-kernel (the Gentoo kernel you can configure and compile locally - typically this is what you want if you run Gentoo)
  • sys-kernel/gentoo-kernel-bin (a pre-compiled Gentoo kernel similar to what genkernel would get you)
  • sys-kernel/vanilla-kernel (the upstream Linux kernel, again configurable and locally compiled)

So a quick walk-through for the gentoo-kernel variant:

1. Set up the correct package USE flags

We do not want an initrd and we want our own config to be re-used so:

echo "sys-kernel/gentoo-kernel -initramfs savedconfig" >> /etc/portage/package.use/gentoo-kernel

2. Preseed the saved config

The current kernel config needs to be saved as the initial savedconfig so it is found and applied for our emerge below:

mkdir -p /etc/portage/savedconfig/sys-kernel
cp -n "/usr/src/linux-$(uname -r)/.config" /etc/portage/savedconfig/sys-kernel/gentoo-kernel

3. Emerge the new kernel

emerge sys-kernel/gentoo-kernel

4. Update grub and reboot

Unfortunately this ebuild does not update grub, so we have to run grub-mkconfig manually. This can again be automated via a post_pkg_postinst() script. See the step 7 below.

But for now, let's do it manually:

grub-mkconfig -o /boot/grub/grub.cfg
# All fine? Time to reboot the machine:
reboot

5. (Optional) Prepare for the next kernel build

Run etc-update and merge the new kernel config entries into your savedconfig.

Screenshot of etc-update

The kernel should auto-build once new versions become available via portage.

Again the etc-update can be automated if you feel that is sufficiently safe to do in your environment. See step 7 below for details.

6. (Optional) Remove the old kernel sources

If you want to switch from the method based on gentoo-sources to the gentoo-kernel one, you can remove the kernel sources:

emerge -C "=sys-kernel/gentoo-sources-5*"

Be sure to update the /usr/src/linux symlink to the new kernel sources directory from gentoo-kernel, e.g.:

rm /usr/src/linux; ln -s "/usr/src/$(uname -r)" /usr/src/linux

This may be a good time for a bit more house-keeping: Clean up a bit in /usr/src/ to remove old build artefacts, /boot/ to remove old kernels and /lib/modules/ to get rid of old kernel modules.

7. (Optional) Further automate the ebuild

In part 1 we automated the kernel compile, install and a bit more via a helper function for post_pkg_postinst().

We can do the similarly for what is (currently) missing from the gentoo-kernel ebuilds:

Create /etc/portage/env/sys-kernel/gentoo-kernel with the following:

post_pkg_postinst() {
        etc-update --automode -5 /etc/portage/savedconfig/sys-kernel
        grub-mkconfig -o /boot/grub/grub.cfg
}

The upside of gentoo-kernel over gentoo-sources is that you can put "config override files" in /etc/kernel/config.d/. That way you theoretically profit from config improvements made by the upstream developers. See the Gentoo distribution kernel documentation for a sample snippet. I am fine with savedconfig for now but it is nice that Gentoo provides the flexibility to support both approaches.

Compiling and installing the Gentoo Linux kernel on emerge without genkernel (part 1)

Gentoo

Gentoo emerges of sys-kernel/gentoo-sources will nicely install the current kernel into /usr/src/linux-* but it will not compile them.

The Gentoo wiki kernel documentation has a script snippet to automate the kernel build with genkernel.

I do not like to use genkernel as it brings in lots of firmware files to build initrds that are not needed on virtual hardware. It also makes building the kernel slower.

So, the plain approach:

Make emerge sys-kernel/gentoo-sources symlink the latest kernel to /usr/src/linux so we can find it easily:

echo "sys-kernel/gentoo-sources symlink" >> /etc/portage/package.use/gentoo-sources

Create /etc/portage/env/sys-kernel/gentoo-sources with the following:

post_pkg_postinst() {
        CURRENT_KV=$(uname -r)
        unset ARCH
        if [[ -f "${EROOT:-/}usr/src/linux-${CURRENT_KV}/.config" ]] ; then
                cp -n "${EROOT:-/}usr/src/linux-${CURRENT_KV}/.config" "${EROOT:-/}usr/src/linux/.config"
                cd "${EROOT:-/}usr/src/linux/" && \
                make olddefconfig && \
                make -j5 && make modules_install && make install && \
                grub-mkconfig -o /boot/grub/grub.cfg
        fi
}

This will compile the next kernel on the basis of the config of the currently running kernel, install the modules and the kernel bzImage and update grub so it knows about the new kernel for the next reboot.

If you forget to unset ARCH the Linux build system will complain like:

Makefile:583: arch/amd64/Makefile: No such file or directory
make: *** No rule to make target 'arch/amd64/Makefile'.  Stop.

You can test the new magic by re-emerging the latest kernel, e.g. currently emerge =sys-kernel/gentoo-sources-5.4.80-r1:

Installing System Rescue (CD) to a flash drive

Linux

System Rescue, the project formerly known as System Rescue CD, has moved from being based on Gentoo to being built on Arch Linux packages.

With this their ISO layout changed substantially so when updating my trusty recue USB flash drive, I could not just update the kernel, initrd and the root filesystem image as I had typically done every other year before.

The "Installing on a USB memory stick" documentation is good for Windows (use Rufus, it's nice) but rather useless for Linux. They recommend a dd or the fancy graphical version of that, called usbimager.

I much prefer to have a flash drive that I can write to over an image of a CD (ISO) written 1:1 onto the flash media.

The basic idea is to use the bulk of the System Rescue ISO contents but amend these with your own grub and syslinux so they work as intended over the supplied ones that are bound to the ISO layout a bit too much.

I did this on Debian Buster but with some adjustments to paths and what packages to install, any recent Linux distribution should do:

Continue reading "Installing System Rescue (CD) to a flash drive"

Upgrading Limesurvey with (near) zero downtime

Open Source

Limesurvey is an online survey tool. It is very powerful and commonly used in academic environments because it is Free Software (GPLv2+), allows for local installations protecting the data of participants and allowing to comply with data protection regulations. This also means there are typically no load-balanced multi-server szenarios with HA databases. But simple VMs where Limesurvey runs and needs upgrading in place.

There's an LTS branch (currently 3.x) and a stable branch (currently 4.x). There's also a 2.06 LTS branch that is restricted to paying customers. The main developers behind Limesurvey offer many services from template design to custom development to support to hosting ("Cloud", "Limesurvey Pro"). Unfortunately they also charge for easy updates called "ComfortUpdate" (currently 39€ for three months) and the manual process is made a bit cumbersome to make the "ComfortUpdate" offer more attractive.

Due to Limesurvey being an old code base and UI elements not being clearly separated, most serious use cases will end up patching files and symlinking logos around template directories. That conflicts a bit with the opaque "ComfortUpdate" process where you push a button and then magic happens. Or you have downtime and a recovery case while surveys are running.

If you do not intend to use the "ComfortUpdate" offering, you can prevent Limesurvey from connecting to http://comfortupdate.limesurvey.org daily by adding the updatable stanza as in line 14 to limesurvey/application/config/config.php:

  1. return array(
  2.  [...]
  3.          // Use the following config variable to set modified optional settings copied from config-defaults.php
  4.         'config'=>array(
  5.         // debug: Set this to 1 if you are looking for errors. If you still get no errors after enabling this
  6.         // then please check your error-logs - either in your hosting provider admin panel or in some /logs directory
  7.         // on your webspace.
  8.         // LimeSurvey developers: Set this to 2 to additionally display STRICT PHP error messages and get full access to standard templates
  9.                 'debug'=>0,
  10.                 'debugsql'=>0, // Set this to 1 to enanble sql logging, only active when debug = 2
  11.                 // Mysql database engine (INNODB|MYISAM):
  12.                  'mysqlEngine' => 'MYISAM'
  13. ,               // Update default LimeSurvey config here
  14.                 'updatable' => false,
  15.         )
  16. );

The comma on line 13 is placed like that in the current default limesurvey config.php, don't let yourself get confused. Every item in a php array must end with a comma. It can be on the next line.

The basic principle of low risk, near-zero downtime, in-place upgrades is:

  1. Create a diff between the current release and the target release
  2. Inspect the diff
  3. Make backups of the application webroot
  4. Patch a copy of the application in-place
  5. (optional) stop the web server
  6. Make a backup of the production database
  7. Move the patched application to the production webroot
  8. (if 5) Start the webserver
  9. Upgrade the database (if needed)
  10. Check the application

So, in detail:

Continue reading "Upgrading Limesurvey with (near) zero downtime"

Wiping harddisks in 2019

Linux

Wiping hard disks is part of my company's policy when returning servers. No exceptions.

Good providers will wipe what they have received back from a customer, but we don't trust that as the hosting / cloud business is under constant budget-pressure and cutting corners (wipefs) is a likely consequence.

With modern SSDs there is "security erase" (man hdparm or see the - as always well maintained - Arch wiki) which is useful if the device is encrypt-by-default. These devices basically "forget" the encryption key but it also means trusting the devices' implementation security. Which doesn't seem warranted. Still after wiping and trimming, a secure erase can't be a bad idea :-).

Still there are three things to be aware of when wiping modern hard disks:

  1. Don't forget to add bs=4096 (blocksize) to dd as it will still default to 512 bytes and that makes writing even zeros less than half the maximum possible speed. SSDs may benefit from larger block sizes matched to their flash page structure. These are usually 128kB, 256kB, 512kB, 1MB, 2MB and 4MB these days.1
  2. All disks can usually be written to in parallel. screen is your friend.
  3. The write speed varies greatly by disk region, so use 2 hours per TB and wipe pass as a conservative estimate. This is better than extrapolating what you see initially in the fastest region of a spinning disk.
  4. The disks have become huge (we run 12TB disks in production now) but the write speed is still somewhere 100 MB/s ... 300 MB/s. So wiping servers on the last day before returning is not possible anymore with disks larger than 4 TB each (and three passes). Or 12 TB and one pass (where e.g. fully encrypted content allows to just do a final zero-wipe).

hard disk size one pass three passes
1 TB2 h6 h
2 TB4 h12 h
3 TB6 h18 h
4 TB8 h24 h (one day)
5 TB10 h30 h
6 TB12 h36 h
8 TB16 h48 h (two days)
10 TB20 h60 h
12 TB24 h72 h (three days)
14 TB28 h84 h
16 TB32 h96 h (four days)
18 TB36 h108 h
20 TB40 h120 h (five days)

Hard disk wipe animation


  1. As Douglas pointed out correctly in the comment below, these are IT Kilobytes and Megabytes, so 210 Bytes and 220 Bytes. So Kibibytes and Mebibytes for those firmly in SI territory. 

Apple Time Machine backups on Debian 9 (Stretch)

Debian
Warning: Superseded by v3.1.13. Do not use the below packages anymore. Update from 28.04.2022: Do not use the packages below any more. There is Netatalk 3.1.13 out with fixes for multiple remote code execution (RCE) bugs. Use packages from recent Debian again, they have been updated.

Netatalk 3.1.12 has been released which fixes an 18 year old RCE bug. The Medium write up on CVE-2018-1160 by Jacob Baines is quite an entertaining read.

The full release notes for 3.1.12 are unfortunately not even half as interesting.

Warning: Read the original blog post before installing for the first time. Be sure to read the original blog post if you are new to Netatalk3 on Debian Jessie or Stretch!
You'll get nowhere if you install the .debs below and don't know about the upgrade path from 2.2.x which is still in the Debian archive. So RTFA.

For Debian Buster (Debian 10) we'll have Samba 4.9 which has learnt (from Samba 4.8.0 onwards) how to emulate a SMB time machine share. I'll make a write up how to install this once Buster stabilizes. This luckily means there will be no need to continue supporting Netatalk in normal production environments. So I guess bug #690227 won't see a proper fix anymore. Waiting out problems helps at times, too :/.

Update instructions and downloads:

Continue reading "Apple Time Machine backups on Debian 9 (Stretch)"

Xfce 4.12 not suspending on laptop-lid close

Linux

Xfce 4.12 as default in Ubuntu/Xubuntu 18.04 LTS did not suspend a laptop after closing the lid. In fact running xfce4-power-manager --quit ; xfce4-power-manager --no-daemon --debug showed that xfce4 wasn't seeing a laptop lid close event at all.

To the contrary acpi_listen nicely finds button/lid LID close and button/lid LID open events when folding the screen and opening it up again.

As so often the wonderful docs / community of Arch Linux to the rescue. This forum thread from 2015 received the correct answer in 2017:

Xfce4 basically recognizes systemd and thus disables its built-in power-management options for handling these "button events" (but doesn't tell you so in the config UI for power-manager). Systemd is configured to handle these events by default (/etc/systemd/logind.conf has HandleLidSwitch=suspend but for unknown reasons decides not to honor that).

So best is to teach Xfce4 to handle the events again as in pre-systemd times:

xfconf-query -c xfce4-power-manager -p /xfce4-power-manager/logind-handle-lid-switch -s false

Now the UI options will work again as intended and the laptop suspends on lid close and resumes on lid open.

Update:

07.01.19: Changed XFCE -> Xfce as per Corsac's suggestion in the comments below. Thank you!

Background info:

The name "XFCE" was originally an acronym for "XForms Common Environment", but since that time it has been rewritten twice and no longer uses the XForms toolkit. The name survived, but it is no longer capitalized as "XFCE", but rather as "Xfce". The developers' current stance is that the initialism no longer stands for anything specific. After noting this, the FAQ on the Xfce Wiki comments "(suggestion: X Freakin' Cool Environment)".

(quoted from Wikipedia's Xfce article also found in the Xfce docs FAQ).

Openssh taking minutes to become available, booting takes half an hour ... because your server waits for a few bytes of randomness

Linux

So, your machine now needs minutes to boot before you can ssh in where it used to be seconds before the Debian Buster update?

Problem

Linux 3.17 (2014-10-05) learnt a new syscall getrandom() that, well, gets bytes from the entropy pool. Glibc learnt about this with 2.25 (2017-02-05) and two tries and four years after the kernel, OpenSSL used that functionality from release 1.1.1 (2018-09-11). OpenSSH implemented this natively for the 7.8 release (2018-08-24) as well.

Now the getrandom() syscall will block1 if the kernel can't provide enough entropy. And that's frequenty the case during boot. Esp. with VMs that have no input devices or IO jitter to source the pseudo random number generator from.

First seen in the wild January 2017

I vividly remember not seeing my Alpine Linux VMs back on the net after the Alpine 3.5 upgrade. That was basically the same issue.

Systemd. Yeah.

Systemd makes this behaviour worse, see issues #4271, #4513 and #10621.
Basically as of now the entropy file saved as /var/lib/systemd/random-seed will not - drumroll - add entropy to the random pool when played back during boot. Actually it will. It will just not be accounted for. So Linux doesn't know. And continues blocking getrandom(). This is obviously different from SysVinit times2 when /var/lib/urandom/random-seed (that you still have lying around on updated systems) made sure the system carried enough entropy over reboot to continue working right after enough of the system was booted.

#4167 is a re-opened discussion about systemd eating randomness early at boot (hashmaps in PID 0...). Some Debian folks participate in the recent discussion and it is worth reading if you want to learn about the mess that booting a Linux system has become.

While we're talking systemd ... #10676 also means systems will use RDRAND in the future despite Ted Ts'o's warning on RDRAND [Archive.org mirror and mirrored locally as 130905_Ted_Tso_on_RDRAND.pdf, 205kB as Google+ will be discontinued in April 2019].
Update: RDRAND doesn't return random data on pre-Ryzen AMD CPUs (AMD CPU family <23) as per systemd bug #11810. It will always be 0xFFFFFFFFFFFFFFFF (264-1). This is a known issue since 2014, see kernel bug #85991.

Debian

Debian is seeing the same issue working up towards the Buster release, e.g. Bug #912087.

The typical issue is:

[    4.428797] EXT4-fs (vda1): mounted filesystem with ordered data mode. Opts: data=ordered
[ 130.970863] random: crng init done

with delays up to tens of minutes on systems with very little external random sources.

This is what it should look like:

[    1.616819] random: fast init done
[    2.299314] random: crng init done

Check dmesg | grep -E "(rng|random)" to see how your systems are doing.

If this is not fully solved before the Buster release, I hope some of the below can end up in the release notes3.

Solutions

You need to get entropy into the random pool earlier at boot. There are many ways to achieve this and - currently - all require action by the system administrator.

Kernel boot parameter

From kernel 4.19 (Debian Buster currently runs 4.18 [Update: but will be getting 4.19 before release according to Ben via Mika]) you can set RANDOM_TRUST_CPU at compile time or random.trust_cpu=on on the kernel command line. This will make recent Intel / AMD systems trust RDRAND and fill the entropy pool with it. See the warning from Ted Ts'o linked above.

Update: Since Linux kernel build 4.19.20-1 CONFIG_RANDOM_TRUST_CPU has been enabled by default in Debian.

Using a TPM

The Trusted Platform Module has an embedded random number generator that can be used. Of course you need to have one on your board for this to be useful. It's a hardware device.

Load the tpm-rng module (ideally from initrd) or compile it into the kernel (config HW_RANDOM_TPM). Now, the kernel does not "trust" the TPM RNG by default, so you need to add

rng_core.default_quality=1000

to the kernel command line. 1000 means "trust", 0 means "don't use". So you can chose any value in between that works for you depending on how much you consider your TPM to be unbugged.

VirtIO (KVM, QEMU, ...)

For Virtual Machines (VMs) you can forward entropy from the host (that should be running longer than the VMs and have enough entropy) via virtio_rng.

So on the host, you do:

kvm ... -object rng-random,filename=/dev/urandom,id=rng0 -device virtio-rng-pci,rng=rng0,bus=pci.0,addr=0x7

and within the VM newer kernels should automatically load virtio_rng and use that.

You can confirm with dmesg as per above.

Or check:

# cat /sys/devices/virtual/misc/hw_random/rng_available
virtio_rng.0
# cat /sys/devices/virtual/misc/hw_random/rng_current
virtio_rng.0

Patching systemd

The Fedora bugtracker has a bash / python script that replaces the systemd rnd seeding with a (better) working one. The script can also serve as a good starting point if you need to script your own solution, e.g. for reading from an entropy provider available within your (secure) network.

Chaoskey

The wonderful Keith Packard and Bdale Garbee have developed a USB dongle, ChaosKey, that supplies entropy to the kernel. Hard- and software are open source.

Jitterentropy_RNG

Kernel 4.2 introduced jitterentropy_rng which will use the jitter in CPU timings to generate randomness.

modprobe jitterentropy_rng

This apparently needs a userspace daemon though (read: design mistake) so

apt install jitterentropy-rngd (available from Buster/testing).

The current version 1.0.8-3 installs nicely on Stretch. dpkg -i is your friend.

But - drumroll - that daemon doesn't seem to use the kernel module at all.

That's where I stopped looking at that solution. At least for now. There are extensive docs if you want to dig into this yourself.

Update: The Linux kernel 5.3 will have an updated jitterentropy_rng as per Commit 4d2fa8b44. This is based on the upstream version 2.1.2 and should be worth another look.

Haveged

apt install haveged

Haveged is a user-space daemon that gathers entropy though the timing jitter any CPU has. It will only run "late" in boot but may still get your openssh back online within seconds and not minutes.

It is also - to the best of my knowledge - not verified at all regarding the quality of randomness it generates. The haveged design and history page provides and interesting read and I wouldn't recommend haveged if you have alternatives. If you have none, haveged is a wonderful solution though as it works reliably. And unverified entropy is better than no entropy. Just forget this is 2018 2019 :-).

early-rng-init-tools

Thorsten Glaser has posted newly developed early-rng-init-tools in a debian-devel thread. He provides packages at http://fish.mirbsd.org/~tg/Debs/dists/sid/wtf/Pkgs/early-rng-init-tools/ .

First he deserves kudos for naming a tool for what it does. This makes it much more easily discoverable than the trend to name things after girlfriends, pets or anime characters. The implementation hooks into the early boot via initrd integration and carries over a seed generated during the previous shutdown. This and some other implementation details are not ideal and there has been quite extensive scrutiny but none that discovered serious issues. Early-rng-init-tools look like a good option for non-RDRAND (~CONFIG_RANDOM_TRUST_CPU) capable platforms.

Linus to the rescue

Luckily end of September Linus Torvalds was fed up with the entropy starvation issue and the non-conclusive discussions about (mostly) who's at fault and ... started coding.

With the kernel 5.4 release on 25.11.2019 his patch has made it into mainline. He created a try_to_generate_entropy function that uses CPU jitter to generate seed entropy for the PRNG early in boot.

In the merge commit Linus explains:

This is admittedly partly "for discussion". We need to have a way forward for the boot time deadlocks where user space ends up waiting for more entropy, but no entropy is forthcoming because the system is entirely idle just waiting for something to happen.

While this was triggered by what is arguably a user space bug with GDM/gnome-session asking for secure randomness during early boot, when they didn't even need any such truly secure thing, the issue ends up being that our "getrandom()" interface is prone to that kind of confusion, because people don't think very hard about whether they want to block for sufficient amounts of entropy.

The approach here-in is to decide to not just passively wait for entropy to happen, but to start actively collecting it if it is missing. This is not necessarily always possible, but if the architecture has a CPU cycle counter, there is a fair amount of noise in the exact timings of reasonably complex loads.

We may end up tweaking the load and the entropy estimates, but this should be at least a reasonable starting point.

So once this kernel is available in your distribution, you should be safe from entropy starvation at boot on any platform that has hardware timers (I haven't encountered one that does not in the last decade).

Ted Ts'o reviewed the approach and was fine and Ahmed Dawish did some testing of the quality of randomness generated and that seems fine, too.

Updates

14.01.2019

Stefan Fritsch, the Apache2 maintainer in Debian, OpenBSD developer and a former Debian security team member stumbled over the systemd issue preventing Apache libssl to initialize at boot in a Debian bug #916690 - apache2: getrandom call blocks on first startup, systemd kills with timeout.

The bug has been retitled "document getrandom changes causing entropy starvation" hinting at not fixing the underlying issue but documenting it in the Debian Buster release notes.

Unhappy with this "minimal compromise" Stefan wrote a comprehensive summary of the current situation to the Debian-devel mailing list. The discussion spans over December 2018 and January 2019 and mostly iterated what had been written above already. The discussion has - so far - not reached any consensus. There is still the "systemd stance" (not our problem, fix the daemons) and the "ssh/apache stance" (fix systemd, credit entropy).

The "document in release notes" minimal compromise was brought up again and Stefan warned of the problems this would create for Buster users:

> I'd prefer having this documented in the release notes:
> https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=916690
> with possible solutions like installing haveged, configuring virtio-rng,
> etc. depending on the situation.

That would be an extremely user-unfriendly "solution" and would lead to 
countless hours of debugging and useless bug reports.

This is exactly why I wrote this blog entry and keep it updated. We need to either fix this or tell everybody we can reach before upgrading to Buster. Otherwise this will lead to huge amounts of systems dead on the network after what looked like a successful upgrade.

Some interesting tidbits were mentioned within the thread:

Raphael Hertzog fixed the issue for Kali Linux by installing haveged by default. Michael Prokop did the same for the grml distribution within its December 2018 release.

Ben Hutchings pointed to an interesting thread on the debian-release mailing list he kicked off in May 2018. Multiple people summarized the options and the fact that there is no "general solution that is both correct and easy" at the time.

Sam Hartman identified Debian Buster VMs running under VMware as an issue, because that supervisor does not provide virtio-rng. So Debian VMs wouldn't boot into ssh availability within a reasonable time. This is an issue for real world use cases albeit running a proprietary product as the supervisor.

16.01.2019

Daniel Kahn Gillmor wrote in to explain a risk for VMs starting right after the boot of the host OS:

If that pool is used by the guest to generate long-term secrets because it appears to be well-initialized, that could be a serious problem.
(e.g. "Mining your P's and Q's" by Heninger et al -- https://factorable.net/weakkeys12.extended.pdf)
I've just opened https://bugs.launchpad.net/qemu/+bug/1811758 to report a way to improve that situation in qemu by default.

So ... make sure that your host OS has access to a hardware random number generator or at least carries over its random seed properly across reboots. You could also delay VM starts until the crng on the host Linux is fully initialized (random: crng init done).
Otherwise your VMs may get insufficiently generated pseudo-random numbers and won't even know.

12.03.2019

Stefan Fritsch revived the thread on debian-devel again and got a few more interesting tidbits out of the developer community:

Ben Hutchings has enabled CONFIG_RANDOM_TRUST_CPU for Debian kernels from 4.19.20-1 so the problem is somewhat contained for recent CPU AMD64 systems (RDRAND capable) in Buster.

Thorsten Glaser developed early-rng-init-tools which combine a few options to try and get entropy carried across boot and generated early during boot. He received some scrutiny as can be expected but none that would discourage me from using it. He explains that this is for early boot and thus has initrd integration. It complements safer randomness sources or haveged.

16.04.2019

The Debian installer for Buster is running into the same problem now as indicated in the release notes for RC1. Bug #923675 has details. Essentially choose-mirror waits serveral minutes for entropy when used with https mirrors.

08.05.2019

The RDRAND use introduced in systemd to bypass the kernel random number generator during boot falls for a AMD pre-Ryzen bug as RDRAND on these systems doesn't return random data after a suspend / resume cycle. Added an update note to the systemd section above.

03.06.2019

Bastian Blank reports the issue is affecting Debian cloud images now as well as cloud-init generates ssh keys during boot.

10.07.2019

Added the update of jitterentropy_rng to a version based on upstream v2.1.2 into the Jitterentropy section above.

16.09.2019

The Linux Kernel Mailing List (LKML) is re-iterating the entropy starvation issue and the un-willingness of systemd to fix its usage of randomness in early boot. Ahmed S. Darwish has reported the issue leading to ext4 reproducibly blocking boot with Kernel 5.3-r8. There are a few patches floated and the whole discussion it worth reading albeit non-conclusive as of now.

Ted Ts'o says "I really very strongly believe that the idea of making getrandom(2) non-blocking and to blindly assume that we can load up the buffer with 'best efforts' randomness to be a terrible, terrible idea that is going to cause major security problems that we will potentially regret very badly. Linus Torvalds believes I am an incompetent systems designer." in this email.

In case you needed a teaser to really start reading the thread! Linus Torvalds also mentions the issue (and a primer on what "never break userspace" means) in the Linux kernel 5.3 release notes.

18.09.2019

... and Martin Steigerwald kindly noticed that I update this blog post with the relevant discussions I come across as this entropy starvation mess continues to haunt us.

25.11.2019

Added the "Linus to the rescue" section after the Linux kernel 5.4 has been released.

02.04.2020

I ran into the same issue on a Gentoo system today. Luckily OpenRC handeled this gracefully but it delayed booting: syslog-ng actually hangs the boot for some time ... waiting for entropy. Argh. The Gentoo forums thread on the topic clearly listed the options:

  1. Make syslog-ng depend on haveged by adding rc_syslog_ng_need="haveged" to /etc/rc.conf (and obviously having haveged installed)
  2. Re-compiling the kernel with CONFIG_RANDOM_TRUST_CPU=y where that is an option

  1. it will return with EAGAIN in the GRND_NONBLOCK use case. The blocking behaviour when lacking entropy is a security measure as per Bug #1559 of Google's Project Zero

  2. Update 18.12.2018: "SysVinit times" ::= "The times when most Linux distros used SysVinit over other init systems." So Wheezy and previous for Debian. Some people objected to the statement, so I added this footnote as a clarification. See the discussion in the comments below. 

  3. there is no Buster branch in the release notes repository yet (17.12.2018). Update: I wrote a section for the release notes 06.05.2019 and Paul Gevers amended and committed that. So when users of affected systems read the release notes before upgrading to Buster they will hopefully not be surprised (and worried) by the long boot delays.