Big but self-explanatory commit: rename the tool. The name choice was
kdumpst, since it's a tool to enable both kdump and pstore setting, also
it's a silly wordplay with the superlative of kdump, as in "kdumpest".
It's an invasive change (touches most of the files), but should
offer no functional change other than logging messages showing
kdumpst now, instead of kdump, and some filenames.
Notice it doesn't touch documentation, which will be done in
a subsequent commit.
Signed-off-by: Guilherme G. Piccoli <gpiccoli@igalia.com>
There is a bunch of improvements done here in the file collecting
loops during the log saving routine; special thanks to Clayton
Craft (craftyguy) for the suggestions that led to most of this
refactor.
(a) We were using the LOOP_CNT variable, and it proved to
to be unnecessary; so removed it.
(b) Overall we were looping on a log counter and running
"find/grep" at each iteration. Hereby we reworked this logic
to loop on top of the file entries directly, improving both
performance and (likely) the code readability.
(c) Fixed some comments and a shellcheck complain for not
guarding LOGS_FOUND on quotes ¬¬
With that refactor, we fixed a "counting" issue: if there are 2+
pstore logs (a console dump plus oops/panic dump), despite we only
save the dmesg from panic, it was saved as log#1 instead of log#0.
Really minor, but...fixed now.
Also, fixed here another relevant issue: we were always removing
the saved logs, even in the case of the zip compression fail; now
we bail-out if the compressed blob wasn't created, preserving the
logs collected.
P.S. Worth to mention the use of here-strings in the code to
make global variables modified inside the loop available after
the loop - very interesting/useful trick but bash restricted,
unfortunately.
Signed-off-by: Guilherme G. Piccoli <gpiccoli@igalia.com>
Currently a lot of our configuration settings and panic sysctls are
highly specific to SteamOS, so let's make them distro-agnostic, more
focused on generic HW / usage of a regular Arch Linux user. The affected
configs/sysctls are detailed below:
(a) Pstore memory settings: since on Steam Deck we have somewhat
pre-reserved RAM for pstore (~15M due to kernel memory alignment
rounding), makes sense to have a bit more of such memory effectively
available for pstore. In the general case though, likely users will
require to manually reserve it, so 4M of total memory with 1M buffer
seems more than enough to collect a dmesg, specially considering point
(e) below.
(b) The log storage folder was tuned for Deck, in which we have A/B
partitioning scheme and a persistent /home, but in general (following
standard kdump tools "on the market", like Debian's/Fedora's), /var
is used for that, so we follow the trend here.
(c) Grub file location was also special on SteamOS, so let's make
it follow the default /boot/grub/grub.cfg here.
(d) Kdump-specific tunings: the goal for people using kdump (not pstore!)
is usually to collect the vmcore of the panicked kernel to explore it,
using tools like crash/drgn. This is not the main goal on SteamOS, in
which we want to collect as much info we can get *on dmesg* and that's
it for most cases...
With that in mind, we needed "crash_kexec_post_notifiers" parameter
to dump more info on dmesg during a panic (a potentially problematic
parameter in some HWs BTW, but tested in depth on Deck) and we disabled
the vmcore saving by default as well. So, let's "revert" it here, having
vmcore capturing enabled by default and dropping the post_notifiers
parameter (see next point as well).
(e) About the sysctls, we are more aggressive on panicking on Deck
(like panic on soft lockups) and the goal is to collect the most info
we can on dmesg, so needed to enable panic_print to dump tasks and
whatnot on dmesg during a panic event. In the general case, people
that wish to have the most information as possible would go with
kdump, collecting vmcore, not with pstore collecting just the dmesg.
With that said, "reduce" panic_print here to only show memory info
and CPUs backtraces, and disable soft lockup panic.
(Also we cleared the file to drop mentioning the choices of *not*
panicking on hung tasks or RCU stalls).
Signed-off-by: Guilherme G. Piccoli <gpiccoli@igalia.com>
Also document a bit better why some parameters are added and why
we remove huge pages parameters, for example.
Signed-off-by: Guilherme G. Piccoli <gpiccoli@igalia.com>
Way fast to compress/decompress and produce smaller archives (compared
to deflate/gzip in general), but only supported in kernels 5.9+.
Notice we fallback to the default "--compress=gzip" in case kernel
does not support zstd.
Special thanks to Clayton (craftyguy) for the kernel version test.
Signed-off-by: Guilherme G. Piccoli <gpiccoli@igalia.com>
Currently we just look the command-line for the systemd BOOT_IMAGE
entry and try to load the vmlinux as described there.
But in the field we noticed we have 2 patterns: either this cmdline
entry is complete/correct and maps to the vmlinux binary, OR it shows
the name of the file but not the full path (Arch case). So, let's
hereby attempt prepending /boot in case the validation for the first
case fails - if both ways don't work, we just fail kdump for now.
Signed-off-by: Guilherme G. Piccoli <gpiccoli@igalia.com>
The way kdump-load parses /proc/iomem currently expects it has only
a single RAM buffer - but what if we have more, how to choose?
This commit enables multi RAM buffers parsing, and the tool now
copes with various available buffers, selecting the biggest one.
Signed-off-by: Guilherme G. Piccoli <gpiccoli@igalia.com>
This is one of the major changes/refactors so far, touches a lot of
files, and more important, it completely changes some premises.
With this patch, we now support fully both dracut-based and initcpio
initramfs systems.
For that to happen, we needed to decouple the initramfs creation from
scripts, by using alpm-hooks. These hooks allow scripts to be run on
events like kernel package installation or in the installation of the very
package responsible to create the initramfs image. We still have the
"kdump-load create-initrd" command though.
One of the biggest modifications here was in the Makefile, that now
composes multiple files by changing keywords (like INITRD) to the
respective initramfs system (dracut or mkinitcpio). Notice that this
brought some extra complexity to the package.
The logic used for supporting both initramfs systems was basically
de-duplicate all possible code (having dup code in common files),
using Makefile tricks to merge such files and have the unique
bits in dracut/initcpio specific files. We currently support dracut
and both mkinitcpio and mkinitcpio-git packages.
Caveats: currently the initramfs specific package removal is not handled
here. So, if the user has dracut and installs kdump, we install the
dracut hooks. In case this user decides to remove dracut and installs
mkinitcpio, we install the mkinitcpio hooks and all should work, but
the previous dracut hooks installed are not unistalled by us; likely
the dracut package removal would drop the files itself.
This was a deliberate move to avoid even more alpm-hooks, should be
a rare case and as said, the package removal should clear the files
itself, without requiring our interaction. Also, by using the
alpm-hooks, we see "errors" (warnings really) about the other
initramfs package not being present - not sure if it's possible to
disable this behavior.
Finally, while at it:
* Added a new approach to dracut initramfs creation to pick the most
common block drivers - since it's hostonly, it doesn't add the ones
that aren't loaded, hence image is not bloated by that.
* Chenged the "command -v makedumpfile" validation to something
more elegant - thanks for the suggestion Clayton (@craftyguy).
Signed-off-by: Guilherme G. Piccoli <gpiccoli@igalia.com>
The kdump-load script is the only one that has more than a
single functionality, potentially being invoked in different ways.
With that in mind, just add a small usage/help for users'
information (thanks Emil for the suggestion).
While at it, changed the additional commands to be more friendly
and require exactly one command to be passed - changes summary:
s/initrd/create-initrd
s/clear/clear-initrd
s//load
Signed-off-by: Guilherme G. Piccoli <gpiccoli@igalia.com>
This is somewhat a big refactor. The early design of kdump/pstore was
meant to support the A/B scheme of Steam Deck and a dracut-based initrd
only. In this scheme, we had a DEVNODE (like nvme partition or a devlink)
that represented the device node to be mounted early in boot when kdump
was in use. Also, we had a folder defined in the config file on top of
such dev node, and a temporary file used to "carry" the composition of
the full kdump path across boot time scripts. Yeah, kinda complex setup.
We are now moving to a more generic approach, and for that, the design
choice was a more convenient/simple one for the common cases, that
requires some operations to properly work on SteamOS dracut-based initrd.
Now we have only a single path on config file, which should be accessible
in a R/W filesystem by both scripts executed in the systemd service. No
devnode information or temp file is used anymore.
But with that, comes the need of discovering the proper devnode and base
folder for kdump'ing early in boot, from the initrd. Using the findmnt
tool we manage to derive all the necessary data during the initrd
preparation phase. Also, while at it we manage to fix an "inconsistency"
of our dracut initrd creation script: installkernel() should be responsible
to deal with DRM modules removal, not install().
On top of this (already big) change, now our dracut initrd excludes not
only amdgpu driver/FWs, but radeon, nvidia and i915 as well. And due to
our refactor of the mount point information (using findmnt to collect info
during dracut initrd creation), we also allow now arbitrary filesystem
drivers to be included, i.e., we don't hardcode/limit for ext4 only.
Again, mea culpa for not splitting this in multiple atomic/simple commits,
the burden to keep a pretty git log is starting to consume precious time.
Signed-off-by: Guilherme G. Piccoli <gpiccoli@igalia.com>
The main goal of this tooling is to collect panic logs, but
pstore/ramoops could be set to use other backends, like the
console or ftrace ones. With that, save-dumps may end-up
collecting these other logs as well, which is out of the
scope - hence, fix it to only deal with "dmesg-ramoops" logs.
While at it, check the existence of the pstore file; the loop
might end-up having a null element, so just mimic the kdump
loop and check if it's a valid/existing file.
Signed-off-by: Guilherme G. Piccoli <gpiccoli@igalia.com>
Based on Emil's (xexaxo) feedback, we now have a common.sh file
that contains the implementation of the routine to read all config
files for kdump/pstore, and we use Makefile to join the files,
having the same implemention in all users.
Signed-off-by: Guilherme G. Piccoli <gpiccoli@igalia.com>
Tried here to give a brief reasoning on why we followed Debian;
or at least, have it explicitely mentioned in the comments, also
mentioning Ubuntu. These parameters make sense since Debian/Ubuntu
parameter approach is quite simple, hence mimic'ed here.
Signed-off-by: Guilherme G. Piccoli <gpiccoli@igalia.com>
Add hereby "set -uo pipefail", with the goal of improving
reliability (suggested by Emil/@xexaxo). Notice that the
suggestion included "-e", but we make use of this, by checking
non-zero pipes, so instead of refactoring the code to just have
this option, the choice was to not have it.
Also, make use of bash as the shell to execute the tools - after
some analysis, we make use of few bashisms that are a bummer to
change, since a lot of scripts in SteamOS make use of bash and
in general it is a very common shell, let's just go along with it.
Signed-off-by: Guilherme G. Piccoli <gpiccoli@igalia.com>
Currently, for no reason we have a different folder structure in
the kdump initrd compared to the installed package. Change it here,
so both directories' structure match now.
We also changed the copy command for the config files, removing
some unnecessary quotes.
Signed-off-by: Guilherme G. Piccoli <gpiccoli@igalia.com>
So far, many changes were implemented to accommodate the
upstreaming of the kdump/pstore tool, so let's hereby
update the docs to match that effort.
Signed-off-by: Guilherme G. Piccoli <gpiccoli@igalia.com>
This is not really related to the upstreaming effort, but many
changes happened so it's a lot easier to just add this patch on
top of it.
The issues are two: first, if the kdump directory is empty, the
"find" tool complains, which is harmless but an unnecessary log
pollution. But more important, the second issue is related to
writing the kernel versions file, which was unreliable - we now
fix that by being more explicit in the usage of stdout redirection.
Signed-off-by: Guilherme G. Piccoli <gpiccoli@igalia.com>
As part of the upstreaming effort, we need to add some extra tunings
in the package. Especially related to GRUB autosetting and Pstore
memory settings:
(a) Currently the ramoops record size and memory amount are hardcoded
in the kdump-load script - we change it here, by having these settings
on the kdump config file;
(b) GRUB autosetting is pretty simple and everything is hardcoded.
We hereby add a bunch of configurable settings in the kdump conf file,
in order we can customize the GRUB handling, to make it work in both
Arch and SteamOS.
While at it, fixed some related comments and renamed some variables,
usually dropping KDUMP_ name when it applies to pstore as well.
Also, bumped the crashkernel memory from 192M to 256M - recent kernels
demand more memory, let's play safe.
Signed-off-by: Guilherme G. Piccoli <gpiccoli@igalia.com>
We have been way too conservative to prevent boot flaws, returning
always success even on failures. This is changed now.
Signed-off-by: Guilherme G. Piccoli <gpiccoli@igalia.com>
This is somewhat an intrusive change, but necessary if we want
to upstream the kdump tooling while allowing great extent of
customizations on SteamOS.
With this change, we have now a kdump.d folder on /usr/share,
that holds configuration files in the same way sysctl.d does.
In other words, we can easily override default settings by
just having more configuration files, which are sourced
following natural name sorting, i.e., we have now the concept
of config file precedence in kdump.
Our default config file is called 00-default, so we eventually
might have a 01-steamos e.g., with Deck's custom settings.
This is planned to other package though.
Signed-off-by: Guilherme G. Piccoli <gpiccoli@igalia.com>
Remove Steam/SteamOS references from things like headers,
journal error messages, etc.
While at it, also improve wording in some points.
Signed-off-by: Guilherme G. Piccoli <gpiccoli@igalia.com>
This is a pretty big refactor in the logic / goals of this kdump
implementation.
* WHY?
We want to decouple completely the log submission mechanism from the
kdump tooling, for mainly two reasons: reuse this submission API/mechanism
in other log collection tools, and to allow upstreaming the kdump tooling
for Arch Linux generically, not embedding SteamOS particulars to it.
* HOW:
First of all, we dropped the log submission bits from this codebase.
We also deleted the particulars of SteamOS/Deck in the log naming,
like collecting the serial of the device if "Jupiter" model is found
in the DMI info or getting the Steam user account via the VDF file.
All of that will happen in a later stage of the log processing, done by
*another tool* that shall rename the logs and transmit them to the
Valve servers.
While at it, we've done other small changes in the logic to make this
kdump tool more generic and reliable, like allowing the collection
of kdump *AND* pstore logs (not choosing one of them).
* CAVEATS / TODO:
More to come in this front, we still definitely need to remove more
references to SteamOS and clear a bit the code from its particulars.
Important also is to update the README to reflect the changes made
by the upstreaming effort.
Mea culpa: these changes are invasive, switch some logic and
expectations around the package, so making them fully bisectable
would be way harder than not. Hence, please take that into account:
this series should be tested/merged as a whole, it's not guaranteed
that individual patches work correctly in a standalone fashion.
Signed-off-by: Guilherme G. Piccoli <gpiccoli@igalia.com>
Change the service target to basic instead of multi-user; this
allows coping with a subsequential log submission service, for
example, that could run on multi-user target.
Signed-off-by: Guilherme G. Piccoli <gpiccoli@igalia.com>
Drop "steamos" references as well as "submit/submitter" wording
given that we are decoupling the submission mechanism of the
logs' collection.
Signed-off-by: Guilherme G. Piccoli <gpiccoli@igalia.com>
Currently we have a loader script, that just calls the dump saving
script and exits successfully. This was meant to prevent potential
boot stalls due to systemd delays, but it proved to be just
unnecessary...so, let's drop it here.
Signed-off-by: Guilherme G. Piccoli <gpiccoli@igalia.com>
Or else, the service doesn't work properly (yet it doesn't
fail to prevent boot stalls).
Reported-by: John Schoenick <johns@valvesoftware.com>
Signed-off-by: Guilherme G. Piccoli <gpiccoli@igalia.com>
I found a small RCU stall in v6.1-rc3 that is quite insignificant,
but panick'ed the system. I feel we're in a risk to have it
enabled by default, seems better to stay in the "safe" side
of the trade-off and eventually lose some report than to panic
in meaningless stall and affect user experience.
Signed-off-by: Guilherme G. Piccoli <gpiccoli@igalia.com>
As per Emil (@xexaxo) suggestion, change the dracut mechanism
added in a recent commit for log pollution prevention.
It is waaay simpler to just directly use the variable that prevents
dracut to flood the output with xattr harmless complaints.
Previous approach was an unfortunate braino from myself.
Signed-off-by: Guilherme G. Piccoli <gpiccoli@igalia.com>
The kdump-steamos tooling creates an initrd file based in the running
kernel version for kdump, during package installation (even if kdump
is not the default crash collection mechanism). Such file lives in
the /home partition.
When SteamOS image is upgraded, with a new kernel, there is a mechanism
to create a new initrd either manually or automatically, just before the
kdump load; in the end, we may have lots of initrds (one per kernel
version ever installed), but we don't have currently a way to clear that.
Well, until now. We hereby introduce such a simple mechanism, to prevent
waste of precious /home space with useless initrd files. It works by
comparing installed kernels [0] with the kdump-initrd-* files in /home,
and if we have some of these kdump initrds that have no match with any
installed kernel, they are removed (and such operation is logged in the
systemd journal).
[0] Definition: installed kernel in our context is a kernel that
has a modules folder in /lib/modules .
Signed-off-by: Guilherme G. Piccoli <gpiccoli@igalia.com>
Due to some xattr vs. btrfs issues, we see a lot of warnings
when creating the initrd. These are harmless, but pollute logs
and may cause some unnecessary concern for the users.
Let's just suppress these warnings in the kdump initrd creation.
Signed-off-by: Guilherme G. Piccoli <gpiccoli@igalia.com>
If we don't have makedumpfile, it doesn't make sense to construct
the kdump initrd and let it be loaded; it's going to fail in the
kdump dmesg collection, during a panic event, with no clear traces
for users to diagnose the issue.
So, let's bail-out if we don't have makedumpfile, forcing the
kdump load to fail instead, which is clearly warned in journalctl.
Also, change the approach for the kdump.conf file as well, in
order to fail creating the initrd if any of the files are missing.
While at it, fix a trailing space in the module-setup.sh file.
Signed-off-by: Guilherme G. Piccoli <gpiccoli@igalia.com>
There might be a (rare) case of missing initrd when loading a kdump.
It's rare mainly for 2 reasons:
(a) Pstore is the default log collection mechanism, kdump should only
be used as a fallback;
(b) When the package is installed, initrd is created for the
running kernel.
But imagine the user installs a new kernel with no Deck image upgrade;
this would cause the issue of a missing initrd if/when kdump is loaded.
We hereby fix it by attempting to create the initrd before kdump load,
in case it doesn't exist.
Signed-off-by: Guilherme G. Piccoli <gpiccoli@igalia.com>
This patch main goal is to "un-Debianize" the configuration file
for kdump-steamos - thanks Emil (@xexaxo) for the discussions; it
is with a bit of a heavy heart I do that, but let's comply with the
modern distros ;-)
We hereby put the config file in a more standard path: /usr/share.
Usually users could override that with /etc/ file, but not in this
case, or at least, not for now. Kdump/pstore is expected to work
quietly, with no users' interference. Advanced users might want to
play with the configs though; and those can just go ahead and edit
the /usr/share/kdump/kdump.conf - it's all documented in the README.
In the future we can improve that by having the override mechanism
with the /etc file, let's see if we have a demand for that.
Signed-off-by: Guilherme G. Piccoli <gpiccoli@igalia.com>
Currently we hardcode the account-related VDF file path in kdump,
and expose it in "/etc/default/kdump" - this is unnecessary since
this path is not expected to change nor users to mess with it; thanks
Emil (@xexaxo) for this suggestion.
So, this patch improves things in some ways:
(a) Do not expose VDF path or Valve's server URL in user configurable
file - no reasons for users to mess with that.
(b) Generate "/home" mount point based on DEVNODE, also determine
the username based on "getent'ing" the passwd database. See CAVEAT
below.
(c) Move the VDF parsing to a separate function to clean up the
submit log path on submit-report.sh .
No functional change is expected after this commit.
CAVEAT: Notice that "getent passwd" is *VERY* slow, and if we follow
a generic approach of doing it for UID_MIN..UID_MAX, it takes quite
some time. So, instead we simplify and just query the user 1000; this
might be a bit incomplete, but it's still better than hardcoding a
username as it's done until now.
Signed-off-by: Guilherme G. Piccoli <gpiccoli@igalia.com>
Thanks to Emil (@xexaxo) suggestion, we hereby implement a less fragile
way of obtaining the "/home" mount point. Emil suggested that instead of
using device name directly, we could use the generic link, as in:
"/dev/disk/by-partsets/shared/home".
In principle the change would be simple, but it proved to be a bit tricky
due to the early boot stage kdump executes - in such point we don't have
this link available, so we need to rely in the full device name directly
on kdump collection. We achieve that by saving this information in the
kdump initrd - this is not completely safe, see the CAVEAT below.
Also, we improved kdump loading script by using "findmnt", a less
fragile / more elegant way of getting the "/home" mount point.
CAVEAT: NVMe multipathing introduced a "randomness" level to device
naming on Linux, so "nvme0n1" could be "nvme1n1" in some boots, if we
have more than one device. There is a kernel parameter to avoid that
("nvme_core.multipath=0"), see [0] for more information.
Due to this reason, we could in theory have different NVMe device
names between regular kernel boot and the kdump one, hence causing a
failure in kdump collection.
But this is pretty much safe since we don't have multiple NVMe
devices, also we could disable multipath in kernel config
(CONFIG_NVME_MULTIPATH) or use the above cmdline.
[0] https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1792660/
Signed-off-by: Guilherme G. Piccoli <gpiccoli@igalia.com>
Currently half of the files are hyphenated, while the rest use
underscore. Just move everything to hyphens.
Signed-off-by: Emil Velikov <emil.velikov@collabora.com>
Users may want to not submit logs to Valve servers, either for debug
purposes or just due to their preference. This patch adds a setting
for that, exposed in /etc/default/kdump . Information about this
tuning was added to README as well.
Finally, this commit also improves journal output when we bail-out
without submitting the logs to Valve servers, showing a friendly
message pointing to the locally saved file.
Signed-off-by: Guilherme G. Piccoli <gpiccoli@igalia.com>
With this addition, kdump-steamos is now capable of editing grub.cfg
to automatically add the required kdump parameters, in case kdump
is used. If Pstore ends-up being used and the grub.cfg was mofified
by kdump-steamos automatically, we're also able to undo the change
and save users from memory waste.
Signed-off-by: Guilherme G. Piccoli <gpiccoli@igalia.com>