Guilherme G. Piccoli 03e916405f config/sysctl: Adjust some configuration options/sysctls to be more agnostic
Currently a lot of our configuration settings and panic sysctls are
highly specific to SteamOS, so let's make them distro-agnostic, more
focused on generic HW / usage of a regular Arch Linux user. The affected
configs/sysctls are detailed below:

(a) Pstore memory settings: since on Steam Deck we have somewhat
pre-reserved RAM for pstore (~15M due to kernel memory alignment
rounding), makes sense to have a bit more of such memory effectively
available for pstore. In the general case though, likely users will
require to manually reserve it, so 4M of total memory with 1M buffer
seems more than enough to collect a dmesg, specially considering point
(e) below.

(b) The log storage folder was tuned for Deck, in which we have A/B
partitioning scheme and a persistent /home, but in general (following
standard kdump tools "on the market", like Debian's/Fedora's), /var
is used for that, so we follow the trend here.

(c) Grub file location was also special on SteamOS, so let's make
it follow the default /boot/grub/grub.cfg here.

(d) Kdump-specific tunings: the goal for people using kdump (not pstore!)
is usually to collect the vmcore of the panicked kernel to explore it,
using tools like crash/drgn. This is not the main goal on SteamOS, in
which we want to collect as much info we can get *on dmesg* and that's
it for most cases...

With that in mind, we needed "crash_kexec_post_notifiers" parameter
to dump more info on dmesg during a panic (a potentially problematic
parameter in some HWs BTW, but tested in depth on Deck) and we disabled
the vmcore saving by default as well. So, let's "revert" it here, having
vmcore capturing enabled by default and dropping the post_notifiers
parameter (see next point as well).

(e) About the sysctls, we are more aggressive on panicking on Deck
(like panic on soft lockups) and the goal is to collect the most info
we can on dmesg, so needed to enable panic_print to dump tasks and
whatnot on dmesg during a panic event. In the general case, people
that wish to have the most information as possible would go with
kdump, collecting vmcore, not with pstore collecting just the dmesg.

With that said, "reduce" panic_print here to only show memory info
and CPUs backtraces, and disable soft lockup panic.
(Also we cleared the file to drop mentioning the choices of *not*
panicking on hung tasks or RCU stalls).

Signed-off-by: Guilherme G. Piccoli <gpiccoli@igalia.com>
2023-03-31 15:34:42 -03:00

#  ###########################################################################
#  ########################## Arch Kdump / Pstore  ###########################
#  ###########################################################################
#
#
#  This is the Arch Kdump/Pstore infrastructure; the goal is to collect
#  data whenever a kernel crash is detected. There is a lightweight
#  collection, that only grabs dmesg, and a more complete setting to grab the
#  whole (compressed) vmcore. See the DETAILS section below for more info.
#
#
#  ############################  HOW-TO USE IT  ##############################
#
#  1. Install the package with pacman if not available in your system; to check
#  if it's already installed look the pacman installed package list. Also, be
#  sure the systemd service was properly loaded by checking
#  'systemctl status kdump-init.service'.
#
#  2. In a crash event, the dmesg log is collected, and by default this happens
#  via the Pstore mechanism, i.e., no extra memory should be reserved and no
#  GRUB change is required. If 'lsmod' shows "ramoops", then Pstore is in use.
#  Some extra files are collected besides dmesg, like dmidecode output and the
#  "/etc/os-release" file.
#
#  3. The logs are stored in a ZIP file in the folder at "$MOUNT_FOLDER/logs"
#  (see the config file); this file is named as: "kdump-TIMESTAMP.zip",
#  where TIMESTAMP is the current timestamp (tz is UTC).
#
#  4. (IMPORTANT) Please, test the infrastructure in order to see if a dummy
#  crash log is collected before using it to try debugging complex issues.
#  In order to do that, login to a shell and execute, as root user:
#  'echo 1 > /proc/sys/kernel/sysrq ; echo c > /proc/sysrq-trigger'
#
#  This action will trigger a dummy crash and reboot the system; check if
#  there is a ZIP file with the crash logs in the directory described in (3).
#
#  5. Various tunings are available at "/usr/share/kdump.d/*" files; for
#  example, the users can choose Kdump instead of Pstore (USE_PSTORE_RAM),
#  and if using Kdump, collect the full vmcore (FULL_COREDUMP). The vmcore is
#  not stored in the ZIP file, but it's saved in "$MOUNT_FOLDER/crash".
#  NOTICE that, if Kdump is used instead of Pstore (either per user's choice
#  or due to some failure in Pstore), a reboot is necessary before kdump is
#  usable, in order to effectively reserve crashkernel memory.
#
#  6. Error and succeeding messages are sent to systemd journal, so running
#  'journalctl -b | grep kdump' would hopefully bring some information.
#
#
#  ##############################  DETAILS  ##################################
#  CAVEATS / INSTRUCTIONS
#  ###########################################################################
#  (a) We automatically edit GRUB config in case Pstore fails or if the user's
#  choice is to use Kdump. But it requires one reboot in order the crashkernel
#  memory is effectively reserved by kernel.
#
#  In case Kdump is used, the crashkernel necessary memory was empirically
#  determined; setting 144M wasn't enough, 160M is unstable, so 192M seems
#  good enough. This amount might change in future kernel versions, requiring
#  tests using the approach suggested in the step (4) above.
#
#
#  TODOs
#  ###########################################################################
#  * Would be interesting to have a clean-up mechanism, to keep up to N most
#  recent ZIP log files, instead of keeping all of them forever.
#
#  * Pstore ramoops back-end has some limitations that we're discussing with
#  the kernel community - right now we can only collect ONE dmesg and its
#  size is truncated on "record_size" bytes, not allowing a file split like
#  efi-pstore; thankfully we still can collect 2MiB dmesg, but hopefully we can
#  improve that upstream.
#
#  * Add a more reliable reboot mechanism - we had seen issues in the past
#  with "reboot -f", and relying in sysrq reboot as a quirk managed to be a safe
#  option, so this is something to think about. Should be easy to implement.
#
Description
Fork of https://gitlab.freedesktop.org/gpiccoli/kdumpst that works if you use btrfs with subvolumes
Readme LGPL-2.1 844 KiB
Languages
Shell 87.5%
Makefile 12.5%