This is somewhat an intrusive change, but necessary if we want
to upstream the kdump tooling while allowing great extent of
customizations on SteamOS.
With this change, we have now a kdump.d folder on /usr/share,
that holds configuration files in the same way sysctl.d does.
In other words, we can easily override default settings by
just having more configuration files, which are sourced
following natural name sorting, i.e., we have now the concept
of config file precedence in kdump.
Our default config file is called 00-default, so we eventually
might have a 01-steamos e.g., with Deck's custom settings.
This is planned to other package though.
Signed-off-by: Guilherme G. Piccoli <gpiccoli@igalia.com>
Remove Steam/SteamOS references from things like headers,
journal error messages, etc.
While at it, also improve wording in some points.
Signed-off-by: Guilherme G. Piccoli <gpiccoli@igalia.com>
This is a pretty big refactor in the logic / goals of this kdump
implementation.
* WHY?
We want to decouple completely the log submission mechanism from the
kdump tooling, for mainly two reasons: reuse this submission API/mechanism
in other log collection tools, and to allow upstreaming the kdump tooling
for Arch Linux generically, not embedding SteamOS particulars to it.
* HOW:
First of all, we dropped the log submission bits from this codebase.
We also deleted the particulars of SteamOS/Deck in the log naming,
like collecting the serial of the device if "Jupiter" model is found
in the DMI info or getting the Steam user account via the VDF file.
All of that will happen in a later stage of the log processing, done by
*another tool* that shall rename the logs and transmit them to the
Valve servers.
While at it, we've done other small changes in the logic to make this
kdump tool more generic and reliable, like allowing the collection
of kdump *AND* pstore logs (not choosing one of them).
* CAVEATS / TODO:
More to come in this front, we still definitely need to remove more
references to SteamOS and clear a bit the code from its particulars.
Important also is to update the README to reflect the changes made
by the upstreaming effort.
Mea culpa: these changes are invasive, switch some logic and
expectations around the package, so making them fully bisectable
would be way harder than not. Hence, please take that into account:
this series should be tested/merged as a whole, it's not guaranteed
that individual patches work correctly in a standalone fashion.
Signed-off-by: Guilherme G. Piccoli <gpiccoli@igalia.com>
Change the service target to basic instead of multi-user; this
allows coping with a subsequential log submission service, for
example, that could run on multi-user target.
Signed-off-by: Guilherme G. Piccoli <gpiccoli@igalia.com>
Drop "steamos" references as well as "submit/submitter" wording
given that we are decoupling the submission mechanism of the
logs' collection.
Signed-off-by: Guilherme G. Piccoli <gpiccoli@igalia.com>
Currently we have a loader script, that just calls the dump saving
script and exits successfully. This was meant to prevent potential
boot stalls due to systemd delays, but it proved to be just
unnecessary...so, let's drop it here.
Signed-off-by: Guilherme G. Piccoli <gpiccoli@igalia.com>
Or else, the service doesn't work properly (yet it doesn't
fail to prevent boot stalls).
Reported-by: John Schoenick <johns@valvesoftware.com>
Signed-off-by: Guilherme G. Piccoli <gpiccoli@igalia.com>
I found a small RCU stall in v6.1-rc3 that is quite insignificant,
but panick'ed the system. I feel we're in a risk to have it
enabled by default, seems better to stay in the "safe" side
of the trade-off and eventually lose some report than to panic
in meaningless stall and affect user experience.
Signed-off-by: Guilherme G. Piccoli <gpiccoli@igalia.com>
As per Emil (@xexaxo) suggestion, change the dracut mechanism
added in a recent commit for log pollution prevention.
It is waaay simpler to just directly use the variable that prevents
dracut to flood the output with xattr harmless complaints.
Previous approach was an unfortunate braino from myself.
Signed-off-by: Guilherme G. Piccoli <gpiccoli@igalia.com>
The kdump-steamos tooling creates an initrd file based in the running
kernel version for kdump, during package installation (even if kdump
is not the default crash collection mechanism). Such file lives in
the /home partition.
When SteamOS image is upgraded, with a new kernel, there is a mechanism
to create a new initrd either manually or automatically, just before the
kdump load; in the end, we may have lots of initrds (one per kernel
version ever installed), but we don't have currently a way to clear that.
Well, until now. We hereby introduce such a simple mechanism, to prevent
waste of precious /home space with useless initrd files. It works by
comparing installed kernels [0] with the kdump-initrd-* files in /home,
and if we have some of these kdump initrds that have no match with any
installed kernel, they are removed (and such operation is logged in the
systemd journal).
[0] Definition: installed kernel in our context is a kernel that
has a modules folder in /lib/modules .
Signed-off-by: Guilherme G. Piccoli <gpiccoli@igalia.com>
Due to some xattr vs. btrfs issues, we see a lot of warnings
when creating the initrd. These are harmless, but pollute logs
and may cause some unnecessary concern for the users.
Let's just suppress these warnings in the kdump initrd creation.
Signed-off-by: Guilherme G. Piccoli <gpiccoli@igalia.com>
If we don't have makedumpfile, it doesn't make sense to construct
the kdump initrd and let it be loaded; it's going to fail in the
kdump dmesg collection, during a panic event, with no clear traces
for users to diagnose the issue.
So, let's bail-out if we don't have makedumpfile, forcing the
kdump load to fail instead, which is clearly warned in journalctl.
Also, change the approach for the kdump.conf file as well, in
order to fail creating the initrd if any of the files are missing.
While at it, fix a trailing space in the module-setup.sh file.
Signed-off-by: Guilherme G. Piccoli <gpiccoli@igalia.com>
There might be a (rare) case of missing initrd when loading a kdump.
It's rare mainly for 2 reasons:
(a) Pstore is the default log collection mechanism, kdump should only
be used as a fallback;
(b) When the package is installed, initrd is created for the
running kernel.
But imagine the user installs a new kernel with no Deck image upgrade;
this would cause the issue of a missing initrd if/when kdump is loaded.
We hereby fix it by attempting to create the initrd before kdump load,
in case it doesn't exist.
Signed-off-by: Guilherme G. Piccoli <gpiccoli@igalia.com>
This patch main goal is to "un-Debianize" the configuration file
for kdump-steamos - thanks Emil (@xexaxo) for the discussions; it
is with a bit of a heavy heart I do that, but let's comply with the
modern distros ;-)
We hereby put the config file in a more standard path: /usr/share.
Usually users could override that with /etc/ file, but not in this
case, or at least, not for now. Kdump/pstore is expected to work
quietly, with no users' interference. Advanced users might want to
play with the configs though; and those can just go ahead and edit
the /usr/share/kdump/kdump.conf - it's all documented in the README.
In the future we can improve that by having the override mechanism
with the /etc file, let's see if we have a demand for that.
Signed-off-by: Guilherme G. Piccoli <gpiccoli@igalia.com>
Currently we hardcode the account-related VDF file path in kdump,
and expose it in "/etc/default/kdump" - this is unnecessary since
this path is not expected to change nor users to mess with it; thanks
Emil (@xexaxo) for this suggestion.
So, this patch improves things in some ways:
(a) Do not expose VDF path or Valve's server URL in user configurable
file - no reasons for users to mess with that.
(b) Generate "/home" mount point based on DEVNODE, also determine
the username based on "getent'ing" the passwd database. See CAVEAT
below.
(c) Move the VDF parsing to a separate function to clean up the
submit log path on submit-report.sh .
No functional change is expected after this commit.
CAVEAT: Notice that "getent passwd" is *VERY* slow, and if we follow
a generic approach of doing it for UID_MIN..UID_MAX, it takes quite
some time. So, instead we simplify and just query the user 1000; this
might be a bit incomplete, but it's still better than hardcoding a
username as it's done until now.
Signed-off-by: Guilherme G. Piccoli <gpiccoli@igalia.com>
Thanks to Emil (@xexaxo) suggestion, we hereby implement a less fragile
way of obtaining the "/home" mount point. Emil suggested that instead of
using device name directly, we could use the generic link, as in:
"/dev/disk/by-partsets/shared/home".
In principle the change would be simple, but it proved to be a bit tricky
due to the early boot stage kdump executes - in such point we don't have
this link available, so we need to rely in the full device name directly
on kdump collection. We achieve that by saving this information in the
kdump initrd - this is not completely safe, see the CAVEAT below.
Also, we improved kdump loading script by using "findmnt", a less
fragile / more elegant way of getting the "/home" mount point.
CAVEAT: NVMe multipathing introduced a "randomness" level to device
naming on Linux, so "nvme0n1" could be "nvme1n1" in some boots, if we
have more than one device. There is a kernel parameter to avoid that
("nvme_core.multipath=0"), see [0] for more information.
Due to this reason, we could in theory have different NVMe device
names between regular kernel boot and the kdump one, hence causing a
failure in kdump collection.
But this is pretty much safe since we don't have multiple NVMe
devices, also we could disable multipath in kernel config
(CONFIG_NVME_MULTIPATH) or use the above cmdline.
[0] https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1792660/
Signed-off-by: Guilherme G. Piccoli <gpiccoli@igalia.com>
Currently half of the files are hyphenated, while the rest use
underscore. Just move everything to hyphens.
Signed-off-by: Emil Velikov <emil.velikov@collabora.com>
Users may want to not submit logs to Valve servers, either for debug
purposes or just due to their preference. This patch adds a setting
for that, exposed in /etc/default/kdump . Information about this
tuning was added to README as well.
Finally, this commit also improves journal output when we bail-out
without submitting the logs to Valve servers, showing a friendly
message pointing to the locally saved file.
Signed-off-by: Guilherme G. Piccoli <gpiccoli@igalia.com>
With this addition, kdump-steamos is now capable of editing grub.cfg
to automatically add the required kdump parameters, in case kdump
is used. If Pstore ends-up being used and the grub.cfg was mofified
by kdump-steamos automatically, we're also able to undo the change
and save users from memory waste.
Signed-off-by: Guilherme G. Piccoli <gpiccoli@igalia.com>
Finally we have a functional mechanism to upload the crash logs
to Valve servers (special thanks to TonyP for the API help).
Documentation is present in the README.MD as usual.
NOTE: worth to reinforce here what was alread mentioned in the
README: kdump-steamos doesn't perform any significant validation
against malicious usage of the log submission mechanism, like DDoS,
submitting a malicious ZIP binary, a very huge file, etc.
All of this is expected to be handled by the server side.
Signed-off-by: Guilherme G. Piccoli <gpiccoli@igalia.com>
In order to make the dump "tarball" easy for consumption in both
Windows and Linux (and maybe more OS styles), we changed here the
compression format to ZIP; also, we're not creating a tarball anymore,
just a simple ZIP file with all the logs collected, no directory
structure.
Signed-off-by: Guilherme G. Piccoli <gpiccoli@igalia.com>
...like DMI data, os-release build information, running kernel
version - in the future, we should also collect Steam application /
Proton / Games logs.
Signed-off-by: Guilherme G. Piccoli <gpiccoli@igalia.com>
Before this patch, the sysctl parameters for panic on oops, lockups, etc
were tracked in the "image-recipes/jupiter" gitlab repositories. After
MR [0] this changed and now we must keep track of the sysctls inside
the kdump-steamos package.
Hence, this commit adds the sysctl config file into "/usr/lib/sysctl.d",
also introducing the "panic_print" sysctl, to enable dumping more info
on panic events.
[0] https://gitlab.steamos.cloud/jupiter/jupiter/-/merge_requests/1
Signed-off-by: Guilherme G. Piccoli <gpiccoli@igalia.com>
OK, likely I should have split this commit in many more, to ease
reviews and improve bisectability. But since I didn't, I'll at least
try to summarize here what was changed by this big commit, starting
with the more complex additions to the trivial ones:
- Added a log submission system; for now, it doesn't submit logs
to anywhere since we don't have such API well-defined. But it saves
the logs locally in a tar.zst, using timestamps to define the moment
of log collection and clears the old files. The naming of the log
compressed tarball makes use of the Deck serial number and Steam
most recent logged account (thanks @TonyP for that idea).
- Still on the log submission topic: we try first pstore logs, and
if we have none of them, we check kdump logs. In case we have a
vmcore, we save it locally (after renaming), but for now we consider
that this kind of file won't be submitted.
- Since we now have makedumpfile in Holo (thanks @xexaxo), I've
removed this binary from here and made this package dependent on
makedumpfile. Also, adjusted the respective scripts.
- Added some error messages through logger, in case we fail to
load pstore/kdump or fail in the log submission script.
- Changed systemd service to just do the pstore/kdump loading
and call the submit_report.sh script, which is then "disowned"
in order we finish as fast as possible the systemd service; boot
delays due to this service wouldn't be nice.
- Increased the "crashkernel" memory recommended in the documentation,
since I've noticed the most recent version of the kernel requires
a bit more - let's play safe!
- Changed timestamps to use UTC tz - this is due kdump collection
happening so early, that timezone is not set, so let's stick with
UTC in all cases.
- Checked scripts with shellcheck[0] and improved the README.
[0] Accepted most suggestions, but some are polemic, and may introduce
issues, so the scripts are not fully passing shellcheck and I don't
expect them to be in the future, it's just a minor style improvement
and failsafe for multiple shells.
Signed-off-by: Guilherme G. Piccoli <gpiccoli@igalia.com>
Check README.MD and /etc/default/kdump for instructions on
pstore usage - should be simple, it's automatically configured.
Notice that we expect all units to have the same e820 memory
map, hence to have the RAM buffer available. This point should
be better clarified by the team working with firmware.
Also, the package now enables the kdump systemd service
automatically, in a post-installer hook.
Signed-off-by: Guilherme G. Piccoli <gpiccoli@igalia.com>
Hopefully there's a way to throw this in some area on the repo
other than intermixed with code - I'll check.
Signed-off-by: Guilherme G. Piccoli <gpiccoli@igalia.com>