OK, likely I should have split this commit in many more, to ease reviews and improve bisectability. But since I didn't, I'll at least try to summarize here what was changed by this big commit, starting with the more complex additions to the trivial ones: - Added a log submission system; for now, it doesn't submit logs to anywhere since we don't have such API well-defined. But it saves the logs locally in a tar.zst, using timestamps to define the moment of log collection and clears the old files. The naming of the log compressed tarball makes use of the Deck serial number and Steam most recent logged account (thanks @TonyP for that idea). - Still on the log submission topic: we try first pstore logs, and if we have none of them, we check kdump logs. In case we have a vmcore, we save it locally (after renaming), but for now we consider that this kind of file won't be submitted. - Since we now have makedumpfile in Holo (thanks @xexaxo), I've removed this binary from here and made this package dependent on makedumpfile. Also, adjusted the respective scripts. - Added some error messages through logger, in case we fail to load pstore/kdump or fail in the log submission script. - Changed systemd service to just do the pstore/kdump loading and call the submit_report.sh script, which is then "disowned" in order we finish as fast as possible the systemd service; boot delays due to this service wouldn't be nice. - Increased the "crashkernel" memory recommended in the documentation, since I've noticed the most recent version of the kernel requires a bit more - let's play safe! - Changed timestamps to use UTC tz - this is due kdump collection happening so early, that timezone is not set, so let's stick with UTC in all cases. - Checked scripts with shellcheck[0] and improved the README. [0] Accepted most suggestions, but some are polemic, and may introduce issues, so the scripts are not fully passing shellcheck and I don't expect them to be in the future, it's just a minor style improvement and failsafe for multiple shells. Signed-off-by: Guilherme G. Piccoli <gpiccoli@igalia.com>
4.2 KiB
4.2 KiB
# SPDX-License-Identifier: LGPL-2.1+
#
# Copyright (c) 2021 Valve.
#
# Code by Guilherme G. Piccoli <gpiccoli@igalia.com>
#
#
# ###########################################################################
# ############################ SteamOS Kdump ##############################
# ###########################################################################
#
# This is the first version of SteamOS Kdump infrastructure. The goal is to
# collect data whenever a kernel crash is detected. There is a lightweight
# collection, that only grabs dmesg, and a more complete setting to grab the
# whole (compressed) vmcore. The tunnings are available at /etc/default/kdump.
#
# Also, the infrastructure is able to configure and save pstore-RAM logs;
# this is the default option.
#
# After installation and a reboot, things should be all set EXCEPT for GRUB
# config - please check the CAVEATS/INSTRUCTIONS section below. Notice the
# package is under active development, this version should still be considered
# a kind of "Proof Of Concept" - improvements are expected in the near future.
# Thanks for testing!!!
#
#
# CAVEATS / INSTRUCTIONS
# ###########################################################################
# (a) For now, we don't automatically edit any GRUB config, so the minimum
# necessary action after installing this package is to add "crashkernel=192M"
# to your GRUB config in order subsequent boots pick this setting and do reserve
# the memory, or else kdump cannot work. The memory amount was empirically
# determined - 144M wasn't enough and 160M is unstable, so 192M seems good enough.
# If you prefer to rely on pstore-RAM, no GRUB setting should be required; this
# is currently the default (see /etc/default/kdump).
#
# (b) It requires (obviously) a RW rootfs - we've used tune2fs in order to make
# it read-write, since it's RO by default. Also, we assume the nvme partition
# scheme is default across all versions and didn't change with new updates
# for example - both kdump and pstore relies in mounting partitions, etc.
#
# (c) Due to a post-transaction hook executed by libalpm (90-dracut-install.hook),
# unfortunately after installing the kdump-steamos package *all* initramfs images
# are recreated - this is not necessary, we're thinking on how to prevent that,
# but for now be prepared: the installation take some (long) minutes due to that ={
#
# (d) Unfortunately makedumpfile from Arch Linux is not available on official
# repos, only in AUR. But it is available on Holo, so we make use of that.
# Also, a discussion was started to get it included on official repos:
# https://lists.archlinux.org/pipermail/aur-general/2022-January/036767.html
# https://aur.archlinux.org/packages/makedumpfile/#comment-843853
#
#
# TODOs (for now - we expect to have more after some testing by the colleagues)
#
# (1) We'd like to be able to automatically edit GRUB and recreate its config
# file - after some future discussion on the proper parameters, this is expected
# to be added to the package.
#
# (2) Hopefully we can fix/prevent the unnecessary re-creation of all initramfs
# images - it happens due to our pkg installing files on /usr/lib/dracut/modules.d
# which is a trigger for this initramfs recreation.
#
# (3) We have a "fragile" way of determining a mount point required for kdump;
# this is something to improve in order to make the kdump more reliable.
#
# (4) Add a more reliable reboot mechanism - we had seen issues with "reboot -f"
# in the past and relying in sysrq reboot as a quirk managed to be a safe option,
# so this is something to think about here. Should be easy to implement.
#
# (5) The log submission mechanism is incomplete - we save the logs as tar.zst
# files, but they are not submitted to any remote server, etc.
#
# (6) Pstore ramoops backend has some limitations that we're discussing with
# the kernel community - right now we can only collect ONE dmesg and its
# size is truncated on "record_size" bytes, not allowing a file split like
# efi-pstore; hopefully we can improve that.
#
# (7) Maybe a good idea would be to allow creating the minimum image for any
# specified kernel, not only for the running one (which is what we do now).
# Low-priority idea, easy to implement.
#