This patch main goal is to "un-Debianize" the configuration file for kdump-steamos - thanks Emil (@xexaxo) for the discussions; it is with a bit of a heavy heart I do that, but let's comply with the modern distros ;-) We hereby put the config file in a more standard path: /usr/share. Usually users could override that with /etc/ file, but not in this case, or at least, not for now. Kdump/pstore is expected to work quietly, with no users' interference. Advanced users might want to play with the configs though; and those can just go ahead and edit the /usr/share/kdump/kdump.conf - it's all documented in the README. In the future we can improve that by having the override mechanism with the /etc file, let's see if we have a demand for that. Signed-off-by: Guilherme G. Piccoli <gpiccoli@igalia.com>
198 lines
10 KiB
Markdown
198 lines
10 KiB
Markdown
```
|
|
# SPDX-License-Identifier: LGPL-2.1+
|
|
#
|
|
# Copyright (c) 2021 Valve.
|
|
#
|
|
# Maintained by Guilherme G. Piccoli <gpiccoli@igalia.com>
|
|
#
|
|
#
|
|
# ###########################################################################
|
|
# ############################ SteamOS Kdump ##############################
|
|
# ###########################################################################
|
|
#
|
|
#
|
|
# This is the SteamOS Kdump/Pstore infrastructure; the goal is to collect
|
|
# data whenever a kernel crash is detected. There is a lightweight
|
|
# collection, that only grabs dmesg, and a more complete setting to grab the
|
|
# whole (compressed) vmcore. See the DETAILS section below for more info.
|
|
#
|
|
#
|
|
# ############################ HOW-TO USE IT ##############################
|
|
#
|
|
#
|
|
# 1. Install the package with pacman if not available in your image; to check
|
|
# if it's already installed look the pacman installed package list. Also, be
|
|
# sure the systemd service was properly loaded by checking
|
|
# 'systemctl status kdump-steamos.service'.
|
|
#
|
|
# 2. In a crash event, the dmesg log is collected, and by default this happens
|
|
# via the Pstore mechanism, i.e., no extra memory should be reserved and no
|
|
# GRUB change is required. If 'lsmod' shows "ramoops", then Pstore is in use.
|
|
# Besides the dmesg with some extra information (like tasks running, memory
|
|
# usage on crash, etc), more logs are collected like the image build version,
|
|
# running kernel version and dmidecode.
|
|
#
|
|
# 3. The logs are stored in a ZIP file at "/home/.steamos/offload/var/kdump/";
|
|
# if this ZIP file was successfully submitted to Valve servers, this file is
|
|
# then moved into the sub-folder "sent_logs/". This file is named as:
|
|
# "steamos-SERIAL-STEAM_USER.timestamp.zip", where SERIAL is the machine
|
|
# serial (from dmidecode), STEAM_USER is the Steam account name (based on the
|
|
# last logged Steam user) and timestamp tz is UTC.
|
|
#
|
|
# 4. (IMPORTANT) Please, test the infrastructure in order to see if a dummy
|
|
# crash log is collected before using it to try debugging complex issues.
|
|
# In order to do that, login to a shell and execute, as root user:
|
|
# 'echo 1 > /proc/sys/kernel/sysrq ; echo c > /proc/sysrq-trigger'
|
|
#
|
|
# This action will trigger a dummy crash and reboot the system; check if
|
|
# there is a ZIP file with the crash logs in the directory described in (3).
|
|
#
|
|
# 5. Some tunings are available at "/usr/share/kdump/kdump.conf"; e.g. users
|
|
# can choose Kdump instead of Pstore (USE_PSTORE_RAM), and if using Kdump,
|
|
# collect the full vmcore (FULL_COREDUMP). The vmcore is not stored in the
|
|
# ZIP file, but it's saved in "/home/.steamos/offload/var/kdump/crash/".
|
|
# NOTICE that, if Kdump is used instead of Pstore (either per user's choice
|
|
# or due to some failure in Pstore), a reboot is necessary before kdump is
|
|
# usable, in order to effectively reserve crashkernel memory.
|
|
#
|
|
# 6. Error and succeeding messages are sent to systemd journal, so running
|
|
# 'journalctl | grep kdump' would hopefully bring some information. Also,
|
|
# the ZIP file collected is automatically submitted to Valve servers (unless
|
|
# the feature is disabled by the user); see below under DETAILS/LOG SUBMISSION
|
|
# for API details, decisions made, how to disable the feature, etc.
|
|
#
|
|
#
|
|
# ############################## DETAILS ##################################
|
|
# CAVEATS / INSTRUCTIONS
|
|
# ###########################################################################
|
|
# (a) We automatically edit GRUB config in case Pstore fails or if the user's
|
|
# choice is to use Kdump. But it requires one reboot in order the crashkernel
|
|
# memory is effectively reserved by kernel.
|
|
#
|
|
# In case Kdump is used, the crashkernel necessary memory was empirically
|
|
# determined; setting 144M wasn't enough, 160M is unstable, so 192M seems
|
|
# good enough. This amount might change in future kernel versions, requiring
|
|
# tests using the approach suggested in the step (4) above.
|
|
#
|
|
# (b) The kdump-steamos package requires a RW rootfs in case it's not currently
|
|
# embedded in your image. Users can make use of 'tune2fs' or 'steamos-readonly'
|
|
# in order to make the rootfs RW, since it's RO by default. Also, we assume the
|
|
# nvme partitioning scheme is default across all versions (A/B, nvme0n1p4 / p5
|
|
# are the root ones, etc) and didn't change with new updates, for example. Both
|
|
# Kdump and Pstore facilities relies in mounting partitions.
|
|
#
|
|
# (c) Due to a post-transaction hook exec'ed by libalpm (90-dracut-install.hook)
|
|
# unfortunately after installing the kdump-steamos package *all* initramfs
|
|
# images are recreated - this is not necessary, we're thinking how to prevent
|
|
# that, but for now be prepared: the installation take some (long) minutes only
|
|
# due to that...
|
|
#
|
|
# (d) Unfortunately makedumpfile from Arch Linux is not available on official
|
|
# repos, only in AUR. But it is available on Holo, so we make use of that.
|
|
# Also, a discussion was started to get it included on official repos:
|
|
# https://lists.archlinux.org/pipermail/aur-general/2022-January/036767.html
|
|
# https://aur.archlinux.org/packages/makedumpfile/#comment-843853
|
|
#
|
|
# (e) NVMe multipathing introduced a "randomness" level to device naming on
|
|
# Linux, so "nvme0n1" could be "nvme1n1" in some boots, if we have more than
|
|
# one NVMe device. There's a kernel parameter to avoid that
|
|
# ("nvme_core.multipath=0"). So, since we rely in getting the NVMe device name
|
|
# to be used in kdump during the regular boot process, we could in theory have
|
|
# different names between regular kernel boot and the kdump one, hence causing
|
|
# a failure in kdump collection. But this is pretty much safe now since we
|
|
# don't have multiple NVMe devices, also we could disable multipath in kernel
|
|
# config (CONFIG_NVME_MULTIPATH) or use the above cmdline.
|
|
#
|
|
#
|
|
# TODOs
|
|
# ###########################################################################
|
|
# * Would be interesting to have a clean-up mechanism, to keep up to N most
|
|
# recent ZIP log files, instead of keeping all of them forever.
|
|
#
|
|
# * Hopefully we can fix/prevent the unnecessary re-creation of all initramfs
|
|
# images - it happens due to our package installing files on directory
|
|
# "/usr/lib/dracut/modules.d" which triggers the unfortunate initramfs rebuild.
|
|
#
|
|
# * VDF parsing would benefit from some improvement, it's at least "fragile"
|
|
# for now, to be generous...but that seems a bit complicated.
|
|
#
|
|
# * Pstore ramoops back-end has some limitations that we're discussing with
|
|
# the kernel community - right now we can only collect ONE dmesg and its
|
|
# size is truncated on "record_size" bytes, not allowing a file split like
|
|
# efi-pstore; thankfully we still can collect 2MiB dmesg, but hopefully we can
|
|
# improve that upstream.
|
|
#
|
|
# * Add a more reliable reboot mechanism - we had seen issues in the past
|
|
# with "reboot -f", and relying in sysrq reboot as a quirk managed to be a safe
|
|
# option, so this is something to think about. Should be easy to implement.
|
|
#
|
|
# * Maybe a good idea would be to allow creating the minimum image for any
|
|
# specified kernel, not only for the running one (which is what we do now).
|
|
# Low-priority idea, easy to implement.
|
|
#
|
|
#
|
|
# LOG SUBMISSION
|
|
# ###########################################################################
|
|
# The logs collected and compressed in the ZIP file are kept in the system,
|
|
# but they provide valuable data to Valve in order to determine issue in the
|
|
# field, and hopefully fix them, so users are happy. Hence, the kdump-steamos
|
|
# is capable to submit logs to Valve servers, through an API. If users wish
|
|
# to disable this feature, just set LOG_SUBMISSION=0 in the config file
|
|
# "/usr/share/kdump/kdump.conf". Below such API is described, but first worth
|
|
# to mention some assumptions / decisions made in the log submission mechanism:
|
|
#
|
|
# * First of all, we attempt to verify network connectivity by pinging the
|
|
# URL "steampowered.com" - quick pings (2 packets, 0.5s between each one)
|
|
# are attempted, but if after 99 of such pings network is considered not
|
|
# reliable, the log submission is aborted, but the ZIP file is kept
|
|
# locally of course.
|
|
#
|
|
# * The 'curl' tool is used to submit the requests to Valve servers; for
|
|
# that, some temporary files named ".curl_XXX" are saved in the kdump
|
|
# folder - mentioned in the point (3) above. These files are deleted
|
|
# if the log submission mechanism works fine, or else they're currently
|
|
# kept for debug purposes, along with a new ".curl_err" file.
|
|
#
|
|
# * It is assumed that any throttling / anti-DoS mechanism comes from the
|
|
# server portion, so the kdump-steamos doesn't perform any significant
|
|
# validations with this respect, only basic correctness validations.
|
|
#
|
|
#
|
|
# => The API details: it works by a first POST request to Valve servers,
|
|
# which, when succeed, returns 3 main components in the response. We use
|
|
# these values to perform a PUT request with the ZIP compressed file, and
|
|
# finally a last POST request is necessary to finish the transaction.
|
|
# Below, the specific format of such requests:
|
|
#
|
|
# The first POST takes the following fields:
|
|
#
|
|
# steamid = user Steam ID, based on the latest Steam logged user;
|
|
# have_dump_file = 0/1 - should be 1 when sending a ZIP file;
|
|
# dump_file_size = the ZIP file size, in bytes;
|
|
# product = "holo" (hard-coded for now);
|
|
# build = the SteamOS build ID, from '/etc/os-release' file;
|
|
# version = running kernel version;
|
|
# platform = "linux" (hard-coded for now);
|
|
# crash_time = the timestamp (epoch) of log collection/submission;
|
|
# stack = a really concise call trace summary, only functions/addrs;
|
|
# note = summary of the dmesg crash info, specifically a full stack trace;
|
|
# format = "json" (hard-coded for now).
|
|
#
|
|
# The response of a succeeding POST will have multiple fields, that can
|
|
# be split in 3 categories:
|
|
#
|
|
# PUT_URL = a new URL to be used in the PUT request;
|
|
# GID = special ID used to finish the submission process in the next POST;
|
|
# header name/value pairs = multiple pairs of name/value fields used as
|
|
# headers in the PUT request.
|
|
#
|
|
# After parsing the response, we perform a PUT request to the PUT_URL, with
|
|
# the ZIP file as a "--data-binary" component and the additional headers that
|
|
# were collected in the first POST's response. Finally, we just POST the GID
|
|
# to the finish URL ("gid=GID_NUM") and the process is terminated.
|
|
#
|
|
# Notice we heavily use 'jq' tool to parse the JSON response, so we assume
|
|
# this format is the response one and that it's not changing over time.
|
|
#
|
|
```
|