03e916405f3852b273432b9edf5212fc27f1e2db
Currently a lot of our configuration settings and panic sysctls are highly specific to SteamOS, so let's make them distro-agnostic, more focused on generic HW / usage of a regular Arch Linux user. The affected configs/sysctls are detailed below: (a) Pstore memory settings: since on Steam Deck we have somewhat pre-reserved RAM for pstore (~15M due to kernel memory alignment rounding), makes sense to have a bit more of such memory effectively available for pstore. In the general case though, likely users will require to manually reserve it, so 4M of total memory with 1M buffer seems more than enough to collect a dmesg, specially considering point (e) below. (b) The log storage folder was tuned for Deck, in which we have A/B partitioning scheme and a persistent /home, but in general (following standard kdump tools "on the market", like Debian's/Fedora's), /var is used for that, so we follow the trend here. (c) Grub file location was also special on SteamOS, so let's make it follow the default /boot/grub/grub.cfg here. (d) Kdump-specific tunings: the goal for people using kdump (not pstore!) is usually to collect the vmcore of the panicked kernel to explore it, using tools like crash/drgn. This is not the main goal on SteamOS, in which we want to collect as much info we can get *on dmesg* and that's it for most cases... With that in mind, we needed "crash_kexec_post_notifiers" parameter to dump more info on dmesg during a panic (a potentially problematic parameter in some HWs BTW, but tested in depth on Deck) and we disabled the vmcore saving by default as well. So, let's "revert" it here, having vmcore capturing enabled by default and dropping the post_notifiers parameter (see next point as well). (e) About the sysctls, we are more aggressive on panicking on Deck (like panic on soft lockups) and the goal is to collect the most info we can on dmesg, so needed to enable panic_print to dump tasks and whatnot on dmesg during a panic event. In the general case, people that wish to have the most information as possible would go with kdump, collecting vmcore, not with pstore collecting just the dmesg. With that said, "reduce" panic_print here to only show memory info and CPUs backtraces, and disable soft lockup panic. (Also we cleared the file to drop mentioning the choices of *not* panicking on hung tasks or RCU stalls). Signed-off-by: Guilherme G. Piccoli <gpiccoli@igalia.com>
# ###########################################################################
# ########################## Arch Kdump / Pstore ###########################
# ###########################################################################
#
#
# This is the Arch Kdump/Pstore infrastructure; the goal is to collect
# data whenever a kernel crash is detected. There is a lightweight
# collection, that only grabs dmesg, and a more complete setting to grab the
# whole (compressed) vmcore. See the DETAILS section below for more info.
#
#
# ############################ HOW-TO USE IT ##############################
#
# 1. Install the package with pacman if not available in your system; to check
# if it's already installed look the pacman installed package list. Also, be
# sure the systemd service was properly loaded by checking
# 'systemctl status kdump-init.service'.
#
# 2. In a crash event, the dmesg log is collected, and by default this happens
# via the Pstore mechanism, i.e., no extra memory should be reserved and no
# GRUB change is required. If 'lsmod' shows "ramoops", then Pstore is in use.
# Some extra files are collected besides dmesg, like dmidecode output and the
# "/etc/os-release" file.
#
# 3. The logs are stored in a ZIP file in the folder at "$MOUNT_FOLDER/logs"
# (see the config file); this file is named as: "kdump-TIMESTAMP.zip",
# where TIMESTAMP is the current timestamp (tz is UTC).
#
# 4. (IMPORTANT) Please, test the infrastructure in order to see if a dummy
# crash log is collected before using it to try debugging complex issues.
# In order to do that, login to a shell and execute, as root user:
# 'echo 1 > /proc/sys/kernel/sysrq ; echo c > /proc/sysrq-trigger'
#
# This action will trigger a dummy crash and reboot the system; check if
# there is a ZIP file with the crash logs in the directory described in (3).
#
# 5. Various tunings are available at "/usr/share/kdump.d/*" files; for
# example, the users can choose Kdump instead of Pstore (USE_PSTORE_RAM),
# and if using Kdump, collect the full vmcore (FULL_COREDUMP). The vmcore is
# not stored in the ZIP file, but it's saved in "$MOUNT_FOLDER/crash".
# NOTICE that, if Kdump is used instead of Pstore (either per user's choice
# or due to some failure in Pstore), a reboot is necessary before kdump is
# usable, in order to effectively reserve crashkernel memory.
#
# 6. Error and succeeding messages are sent to systemd journal, so running
# 'journalctl -b | grep kdump' would hopefully bring some information.
#
#
# ############################## DETAILS ##################################
# CAVEATS / INSTRUCTIONS
# ###########################################################################
# (a) We automatically edit GRUB config in case Pstore fails or if the user's
# choice is to use Kdump. But it requires one reboot in order the crashkernel
# memory is effectively reserved by kernel.
#
# In case Kdump is used, the crashkernel necessary memory was empirically
# determined; setting 144M wasn't enough, 160M is unstable, so 192M seems
# good enough. This amount might change in future kernel versions, requiring
# tests using the approach suggested in the step (4) above.
#
#
# TODOs
# ###########################################################################
# * Would be interesting to have a clean-up mechanism, to keep up to N most
# recent ZIP log files, instead of keeping all of them forever.
#
# * Pstore ramoops back-end has some limitations that we're discussing with
# the kernel community - right now we can only collect ONE dmesg and its
# size is truncated on "record_size" bytes, not allowing a file split like
# efi-pstore; thankfully we still can collect 2MiB dmesg, but hopefully we can
# improve that upstream.
#
# * Add a more reliable reboot mechanism - we had seen issues in the past
# with "reboot -f", and relying in sysrq reboot as a quirk managed to be a safe
# option, so this is something to think about. Should be easy to implement.
#
Description
Fork of https://gitlab.freedesktop.org/gpiccoli/kdumpst that works if you use btrfs with subvolumes
Languages
Shell
87.5%
Makefile
12.5%