bb7a1b67c7301f85640d2bdca6278b71215a5766
The kdump-steamos tooling creates an initrd file based in the running kernel version for kdump, during package installation (even if kdump is not the default crash collection mechanism). Such file lives in the /home partition. When SteamOS image is upgraded, with a new kernel, there is a mechanism to create a new initrd either manually or automatically, just before the kdump load; in the end, we may have lots of initrds (one per kernel version ever installed), but we don't have currently a way to clear that. Well, until now. We hereby introduce such a simple mechanism, to prevent waste of precious /home space with useless initrd files. It works by comparing installed kernels [0] with the kdump-initrd-* files in /home, and if we have some of these kdump initrds that have no match with any installed kernel, they are removed (and such operation is logged in the systemd journal). [0] Definition: installed kernel in our context is a kernel that has a modules folder in /lib/modules . Signed-off-by: Guilherme G. Piccoli <gpiccoli@igalia.com>
# SPDX-License-Identifier: LGPL-2.1+
#
# Copyright (c) 2021 Valve.
#
# Maintained by Guilherme G. Piccoli <gpiccoli@igalia.com>
#
#
# ###########################################################################
# ############################ SteamOS Kdump ##############################
# ###########################################################################
#
#
# This is the SteamOS Kdump/Pstore infrastructure; the goal is to collect
# data whenever a kernel crash is detected. There is a lightweight
# collection, that only grabs dmesg, and a more complete setting to grab the
# whole (compressed) vmcore. See the DETAILS section below for more info.
#
#
# ############################ HOW-TO USE IT ##############################
#
#
# 1. Install the package with pacman if not available in your image; to check
# if it's already installed look the pacman installed package list. Also, be
# sure the systemd service was properly loaded by checking
# 'systemctl status kdump-steamos.service'.
#
# 2. In a crash event, the dmesg log is collected, and by default this happens
# via the Pstore mechanism, i.e., no extra memory should be reserved and no
# GRUB change is required. If 'lsmod' shows "ramoops", then Pstore is in use.
# Besides the dmesg with some extra information (like tasks running, memory
# usage on crash, etc), more logs are collected like the image build version,
# running kernel version and dmidecode.
#
# 3. The logs are stored in a ZIP file at "/home/.steamos/offload/var/kdump/";
# if this ZIP file was successfully submitted to Valve servers, this file is
# then moved into the sub-folder "sent_logs/". This file is named as:
# "steamos-SERIAL-STEAM_USER.timestamp.zip", where SERIAL is the machine
# serial (from dmidecode), STEAM_USER is the Steam account name (based on the
# last logged Steam user) and timestamp tz is UTC.
#
# 4. (IMPORTANT) Please, test the infrastructure in order to see if a dummy
# crash log is collected before using it to try debugging complex issues.
# In order to do that, login to a shell and execute, as root user:
# 'echo 1 > /proc/sys/kernel/sysrq ; echo c > /proc/sysrq-trigger'
#
# This action will trigger a dummy crash and reboot the system; check if
# there is a ZIP file with the crash logs in the directory described in (3).
#
# 5. Some tunings are available at "/usr/share/kdump/kdump.conf"; e.g. users
# can choose Kdump instead of Pstore (USE_PSTORE_RAM), and if using Kdump,
# collect the full vmcore (FULL_COREDUMP). The vmcore is not stored in the
# ZIP file, but it's saved in "/home/.steamos/offload/var/kdump/crash/".
# NOTICE that, if Kdump is used instead of Pstore (either per user's choice
# or due to some failure in Pstore), a reboot is necessary before kdump is
# usable, in order to effectively reserve crashkernel memory.
#
# 6. Error and succeeding messages are sent to systemd journal, so running
# 'journalctl | grep kdump' would hopefully bring some information. Also,
# the ZIP file collected is automatically submitted to Valve servers (unless
# the feature is disabled by the user); see below under DETAILS/LOG SUBMISSION
# for API details, decisions made, how to disable the feature, etc.
#
#
# ############################## DETAILS ##################################
# CAVEATS / INSTRUCTIONS
# ###########################################################################
# (a) We automatically edit GRUB config in case Pstore fails or if the user's
# choice is to use Kdump. But it requires one reboot in order the crashkernel
# memory is effectively reserved by kernel.
#
# In case Kdump is used, the crashkernel necessary memory was empirically
# determined; setting 144M wasn't enough, 160M is unstable, so 192M seems
# good enough. This amount might change in future kernel versions, requiring
# tests using the approach suggested in the step (4) above.
#
# (b) The kdump-steamos package requires a RW rootfs in case it's not currently
# embedded in your image. Users can make use of 'tune2fs' or 'steamos-readonly'
# in order to make the rootfs RW, since it's RO by default. Also, we assume the
# nvme partitioning scheme is default across all versions (A/B, nvme0n1p4 / p5
# are the root ones, etc) and didn't change with new updates, for example. Both
# Kdump and Pstore facilities relies in mounting partitions.
#
# (c) Due to a post-transaction hook exec'ed by libalpm (90-dracut-install.hook)
# unfortunately after installing the kdump-steamos package *all* initramfs
# images are recreated - this is not necessary, we're thinking how to prevent
# that, but for now be prepared: the installation take some (long) minutes only
# due to that...
#
# (d) Unfortunately makedumpfile from Arch Linux is not available on official
# repos, only in AUR. But it is available on Holo, so we make use of that.
# Also, a discussion was started to get it included on official repos:
# https://lists.archlinux.org/pipermail/aur-general/2022-January/036767.html
# https://aur.archlinux.org/packages/makedumpfile/#comment-843853
#
# (e) NVMe multipathing introduced a "randomness" level to device naming on
# Linux, so "nvme0n1" could be "nvme1n1" in some boots, if we have more than
# one NVMe device. There's a kernel parameter to avoid that
# ("nvme_core.multipath=0"). So, since we rely in getting the NVMe device name
# to be used in kdump during the regular boot process, we could in theory have
# different names between regular kernel boot and the kdump one, hence causing
# a failure in kdump collection. But this is pretty much safe now since we
# don't have multiple NVMe devices, also we could disable multipath in kernel
# config (CONFIG_NVME_MULTIPATH) or use the above cmdline.
#
#
# TODOs
# ###########################################################################
# * Would be interesting to have a clean-up mechanism, to keep up to N most
# recent ZIP log files, instead of keeping all of them forever.
#
# * Hopefully we can fix/prevent the unnecessary re-creation of all initramfs
# images - it happens due to our package installing files on directory
# "/usr/lib/dracut/modules.d" which triggers the unfortunate initramfs rebuild.
#
# * VDF parsing would benefit from some improvement, it's at least "fragile"
# for now, to be generous...but that seems a bit complicated.
#
# * Pstore ramoops back-end has some limitations that we're discussing with
# the kernel community - right now we can only collect ONE dmesg and its
# size is truncated on "record_size" bytes, not allowing a file split like
# efi-pstore; thankfully we still can collect 2MiB dmesg, but hopefully we can
# improve that upstream.
#
# * Add a more reliable reboot mechanism - we had seen issues in the past
# with "reboot -f", and relying in sysrq reboot as a quirk managed to be a safe
# option, so this is something to think about. Should be easy to implement.
#
# * Maybe a good idea would be to allow creating the minimum image for any
# specified kernel, not only for the running one (which is what we do now).
# Low-priority idea, easy to implement.
#
#
# LOG SUBMISSION
# ###########################################################################
# The logs collected and compressed in the ZIP file are kept in the system,
# but they provide valuable data to Valve in order to determine issue in the
# field, and hopefully fix them, so users are happy. Hence, the kdump-steamos
# is capable to submit logs to Valve servers, through an API. If users wish
# to disable this feature, just set LOG_SUBMISSION=0 in the config file
# "/usr/share/kdump/kdump.conf". Below such API is described, but first worth
# to mention some assumptions / decisions made in the log submission mechanism:
#
# * First of all, we attempt to verify network connectivity by pinging the
# URL "steampowered.com" - quick pings (2 packets, 0.5s between each one)
# are attempted, but if after 99 of such pings network is considered not
# reliable, the log submission is aborted, but the ZIP file is kept
# locally of course.
#
# * The 'curl' tool is used to submit the requests to Valve servers; for
# that, some temporary files named ".curl_XXX" are saved in the kdump
# folder - mentioned in the point (3) above. These files are deleted
# if the log submission mechanism works fine, or else they're currently
# kept for debug purposes, along with a new ".curl_err" file.
#
# * It is assumed that any throttling / anti-DoS mechanism comes from the
# server portion, so the kdump-steamos doesn't perform any significant
# validations with this respect, only basic correctness validations.
#
#
# => The API details: it works by a first POST request to Valve servers,
# which, when succeed, returns 3 main components in the response. We use
# these values to perform a PUT request with the ZIP compressed file, and
# finally a last POST request is necessary to finish the transaction.
# Below, the specific format of such requests:
#
# The first POST takes the following fields:
#
# steamid = user Steam ID, based on the latest Steam logged user;
# have_dump_file = 0/1 - should be 1 when sending a ZIP file;
# dump_file_size = the ZIP file size, in bytes;
# product = "holo" (hard-coded for now);
# build = the SteamOS build ID, from '/etc/os-release' file;
# version = running kernel version;
# platform = "linux" (hard-coded for now);
# crash_time = the timestamp (epoch) of log collection/submission;
# stack = a really concise call trace summary, only functions/addrs;
# note = summary of the dmesg crash info, specifically a full stack trace;
# format = "json" (hard-coded for now).
#
# The response of a succeeding POST will have multiple fields, that can
# be split in 3 categories:
#
# PUT_URL = a new URL to be used in the PUT request;
# GID = special ID used to finish the submission process in the next POST;
# header name/value pairs = multiple pairs of name/value fields used as
# headers in the PUT request.
#
# After parsing the response, we perform a PUT request to the PUT_URL, with
# the ZIP file as a "--data-binary" component and the additional headers that
# were collected in the first POST's response. Finally, we just POST the GID
# to the finish URL ("gid=GID_NUM") and the process is terminated.
#
# Notice we heavily use 'jq' tool to parse the JSON response, so we assume
# this format is the response one and that it's not changing over time.
#
Description
Fork of https://gitlab.freedesktop.org/gpiccoli/kdumpst that works if you use btrfs with subvolumes
Languages
Shell
87.5%
Makefile
12.5%