Click here to Skip to main content
65,938 articles
CodeProject is changing. Read more.
Articles
(untagged)

Patching the Linux kernel at run-time

0.00/5 (No votes)
10 Jul 2017 1  
How to patch the Linux kernel with a custom mechanism and without having to reboot the system.

Download the source - kp.zip

Introduction

How many times new functionalities have been released after deep testing, just to present once more by the classical demo-effect we all know? Coding is done by humans, and humans can fail; this is an universal truth, and you think otherwise you either are too full or yourself or you don't have enough coding experience.

Software bugs are not the only reason behind patching of existing code: with time new requirements can emerge for your software, which must now support new functionalities not previously defined. While this process usually is done by recompiling the kernel/module and shipping it to the running machine, it can be the case that such machine cannot be rebooted yet, but still the functionalities are required for an improved/correct behavior (think of a backdoor to sanitize on a server).

This document will present some strategies to perform the patching and solve your problems. There is no one way to get this job done, and some solutions fit better than others under certain conditions.

So let's see one of the ways to do this...

Background

Concepts inside this article are presented in a soft way to allow every user to understand the idea and the technology beyond the magic. If you desire to fully understand, use and adapt this code in order to have it working for your kernel (the code presented here has been done by a human), then you will need a deeper knowledge into Linux internals and kernel in general.

The boring method: using something already done.

The Linux kernel comes with kPatch, a feature which allows to Live Patch your kernel. Live patching means that the modifications happens while the kernel is running. This technology has been developed by the Red Hat guys, and internally leverage on the ftrace mechanisms to route around the execution path of the kernel routines.

Now, since everything route around this ftrace mechanism, how does it work?

Ftrace does its job thanks to the GCC instrumental options "-pg" which adds, at compile time, a call to the special function mcount. Now, as you can guess, perform an additonal call for every routines is and expensive operation, since it's about pushing/pulling data on the stack. For this reason, the kernel is usually compiled using the option CONFIG_DYNAMIC_FTRACE, which replace at boot time the call to mcount with NOP opcodes, and indexes all such points to keep a track were patching can take place.

You can directly examine this by creating a dummy application and try to switch ON/OFF this compiler option.
Let's take the example of the following dummy application:

#include <stdio.h>
#include <stdlib.h>

/******************************************************************************
 * Entry point.                                                               *
 ******************************************************************************/

int a() 
{
    return 0;
}

int main(int argc, char ** argv)
{
    return 0;
}

What you have to do is just compile and examine without and with the -pg flags to see how GCC organize the binary file. In the case of a normal compilarion, without instrumental options, we have the following organization of the procedures:

(gdb) disas a
Dump of assembler code for function a:
   0x00000000004004ed <+0>:     push   %rbp
   0x00000000004004ee <+1>:     mov    %rsp,%rbp
   0x00000000004004f1 <+4>:     mov    $0x0,%eax
   0x00000000004004f6 <+9>:     pop    %rbp
   0x00000000004004f7 <+10>:    retq
End of assembler dump.
(gdb) disas main
Dump of assembler code for function main:
   0x0000000000400503 <+0>:     push   %rbp
   0x0000000000400504 <+1>:     mov    %rsp,%rbp
   0x0000000000400507 <+4>:     mov    %edi,-0x4(%rbp)
   0x000000000040050a <+7>:     mov    %rsi,-0x10(%rbp)
   0x000000000040050e <+11>:    mov    $0x0,%eax
   0x0000000000400513 <+16>:    pop    %rbp
   0x0000000000400514 <+17>:    retq
End of assembler dump.

While if you compile the dummy application with the instrumental with the instrumental options, what you will have is something like:

(gdb) disas a
Dump of assembler code for function a:
   0x00000000004005fd <+0>:     push   %rbp
   0x00000000004005fe <+1>:     mov    %rsp,%rbp
   0x0000000000400601 <+4>:     callq  0x4004b0 <mcount@plt>
   0x0000000000400606 <+9>:     mov    $0x0,%eax
   0x000000000040060b <+14>:    pop    %rbp
   0x000000000040060c <+15>:    retq
End of assembler dump.
(gdb) disas main
Dump of assembler code for function main:
   0x000000000040061d <+0>:     push   %rbp
   0x000000000040061e <+1>:     mov    %rsp,%rbp
   0x0000000000400621 <+4>:     sub    $0x10,%rsp
   0x0000000000400625 <+8>:     callq  0x4004b0 <mcount@plt>
   0x000000000040062a <+13>:    mov    %edi,-0x4(%rbp)
   0x000000000040062d <+16>:    mov    %rsi,-0x10(%rbp)
   0x0000000000400631 <+20>:    mov    $0x0,%eax
   0x0000000000400636 <+25>:    leaveq
   0x0000000000400637 <+26>:    retq
End of assembler dump.

More information on how this works can be found on internet and in the Linux kernel documentation.

If you take a look at the file samples/livepatch/livepatch-sample.c, present in the Linux kernel source code, you can find an easy-to-understand example of how the whole mechanism works. This simple example only patch the proc entry which allows to read the arguments used to boot the operating system.

static int livepatch_cmdline_proc_show(struct seq_file *m, void *v) 
{
        seq_printf(m, "%s\n", "this has been live patched"); 
        return 0; 
}

static struct klp_func funcs[] = { 
        {
                .old_name = "cmdline_proc_show",
                .new_func = livepatch_cmdline_proc_show, 
        }, 
        { /* NULL entry */ } 
};

static struct klp_object objs[] = { 
        {         
                /* name being NULL means vmlinux */
                .funcs = funcs, 
        }, 
        { /* NULL entry */ } 
};

static struct klp_patch patch = {
        .mod = THIS_MODULE,
        .objs = objs, 
};

static int livepatch_init(void) 
{
        int ret; 
        
        ret = klp_register_patch(&patch);         

        if (ret) 
                return ret;
        
        ret = klp_enable_patch(&patch);

        if (ret) {
                WARN_ON(klp_unregister_patch(&patch)); 
                return ret;
        }

        return 0; 
}

The code above show you how livepatch mechanism is used: the patch is inserted in a kernel module which is loaded in the system. First the patch needs to be registered in the livepatch subsystem, and after that you can enable/disable it at wish. The init procedure of the module performs in fact both these steps.

Livepatch structures gives the possibility to perform multiple adjustments at once, all contained in a single patch to apply (because the modification can fall under multiple existing functions that have to be fixed). old_name field of the patch klp_func structure points to the symbol of the procedure to patch, while new_func contains the address of the procedure which will patch the old one. Note that the new procedure must have the same profile (return values and arguments) of the old one, or your system can end up having a very bad stack corruption.

static void livepatch_exit(void) 
{
        WARN_ON(klp_disable_patch(&patch));
        WARN_ON(klp_unregister_patch(&patch)); 
}

The exit point of the module shows how to disable, and eventually remove, the patch from the system.

Resume: Livepatch mechanism is a ready-to-use way to patch your system (if you know what you are doing), and relies on well-known and tested mechanisms as ftrace and kernel modules. In order to have such functionality you must compile your kernel with both these options. In nowadays distributions this is usually true, since the system is going to run on a general purpose desktop or laptop. The problem arise when you need to setup more complex system where the memory footprint and other requirements forces you to disable the necessary options to use this tool.

In this case then you "need to do it yourself". While going on this way can be really dangerous and result in re-inventing the boiling water, for academical purposes is really interesting to deal with such complexity in order to understand in the deep how to make it work. Use custom-made mechanism only if you are forced to.

The interesting method: do it yourself!

Perform patching requires to solve a series of challanges, that can be overwhelming if taken all-in-one. In order to do it, let's split the task in a series of smaller problems which must be resolved. At the end, all the solution can be merged in one big API-like mechanism that will do the entire job one step after the other.

We assume now that:

  • The kernel to patch have not be compiled with Livepatch mechanism.
  • The kernel to patch have not be compiled with CONFIG_DYNAMIC_FTRACE or other ftrace symbols. 
  • The kernel to patch have no idea that it will be patched.

Patching the kernel

For patching the kernel operation like memcpy, memset and other ones of the same family are enough to get the job done (thanks to architecture abstractions), but a problem arise if you try to use them on kernel code. Just like normal applications, the kernel reside in memory, and such memory pages are protected against writes. The code normally does not want to change itself, as the data usually shall not be executed.  This means that apparently it is not possible to change the execution flow inside the kernel area.

#include <linux/kernel.h>
#include <linux/init.h>
#include <linux/module.h>
#include <linux/string.h>

#include "utils.h"

int target()
{
    return 1;
}

static int __init kp_init(void)
{
    char * m = ((char *)target) + 0xa;

    /* Read our target procedure in order to evaluate the state before
     * patching it's content.
     */
    kp_print(target, 16);

    /* Try to modify the return value to 2.
     *
     * NOTE:
     * This will cause a crash!
     */
    *m = 2;

    return 0;
}

static void __exit kp_exit(void)
{
    return;
}

module_init(kp_init);
module_exit(kp_exit);

MODULE_LICENSE("GPL");

The following kernel module (read-only.c) is trying to patch the return value of the target procedure. I already disassembled the procedure and found out which part must be changed in order to change it's return value from 1 to 2 (10th byte after the procedure start). Running it as it is will just cause your machine to crash or the module to fail to be loaded, in the best case by leaving behind a graceful trace. If you have intention to give it a try, run it only on a testing machine since corruption/crash of the kernel can result in data loss on some systems.

root@debian:/home/user/kp# insmod ro.ko
Killed
root@debian:/home/user/kp# dmesg

[....].

[  122.661372] Dumping memory at ffffffffc05b7000:
[  122.661373]        00 01 02 03 04 05 06 07 08 09 10 11 12 13 14 15

[  122.661378] 0000   0f 1f 44 00 00 55 48 89 e5 b8 01 00 00 00 5d c3
[  122.661393] BUG: unable to handle kernel paging request at ffffffffc05b700a
[  122.661414] IP: [<ffffffffc02fb025>] kp_init+0x25/0x1000 [ro]
[  122.661430] PGD 4ce0a067
[  122.661436] PUD 4ce0c067
[  122.661461] PMD 36f81067
[  122.661465] PTE 7b63c161

[  122.661476] Oops: 0003 [#1] SMP
[  122.661484] Modules linked in: ro(O+) vboxsf(O) vboxvideo(O) vboxguest(O) joydev ppdev snd_intel8x0 snd_ac97_codec ac97_bus snd_pcm sg snd_timer snd soundcore intel_powerclamp crct10dif_pclmul crc32_pclmul ghash_clmulni_intel intel_rapl_perf pcspkr evdev serio_raw ttm drm_kms_helper drm battery parport_pc parport video ac button acpi_cpufreq ip_tables x_tables autofs4 ext4 crc16 jbd2 crc32c_generic fscrypto ecb mbcache hid_generic usbhid hid sr_mod cdrom sd_mod ata_generic crc32c_intel aesni_intel aes_x86_64 glue_helper lrw gf128mul ablk_helper cryptd psmouse ata_piix ahci libahci ohci_pci ehci_pci ohci_hcd ehci_hcd usbcore usb_common i2c_piix4 e1000 libata scsi_mod
[  122.661661] CPU: 0 PID: 3691 Comm: insmod Tainted: G        W  O    4.9.0-3-amd64 #1 Debian 4.9.30-2+deb9u2
[  122.661682] Hardware name: innotek GmbH VirtualBox/VirtualBox, BIOS VirtualBox 12/01/2006
[  122.661699] task: ffff9dfe7b2f8040 task.stack: ffffc0cac1458000
[  122.661712] RIP: 0010:[<ffffffffc02fb025>]  [<ffffffffc02fb025>] kp_init+0x25/0x1000 [ro]
[  122.661731] RSP: 0018:ffffc0cac145bcc8  EFLAGS: 00010282
[  122.661742] RAX: ffffffffc05b700a RBX: 0000000000000000 RCX: 0000000000000006
[  122.661757] RDX: 0000000000000000 RSI: 0000000000000297 RDI: ffff9dfe7fc0de20
[  122.661772] RBP: ffffc0cac145bcd0 R08: 0000000000000001 R09: 0000000000008cc4
[  122.661787] R10: ffffffffacb13540 R11: 0000000000000001 R12: ffff9dfe369c2760
[  122.661810] R13: ffff9dfe7a446080 R14: ffffffffc05b9000 R15: ffffffffc05b9050
[  122.661826] FS:  00007fadb6903700(0000) GS:ffff9dfe7fc00000(0000) knlGS:0000000000000000
[  122.661843] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  122.661855] CR2: ffffffffc05b700a CR3: 000000007cb7c000 CR4: 00000000000406f0
[  122.661873] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[  122.661888] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[  122.661902] Stack:
[  122.661908]  ffffffffc05b700a ffffffffc02fb000 ffffffffabe0218b ffff9dfe7ffeb5c0
[  122.661926]  000000000000001f 3dd28ef37e71a3ae 0000000000000286 ffff9dfe79b99400
[  122.661945]  ffffffffabfc3b5d 3dd28ef37e71a3ae ffffffffc05b9000 3dd28ef37e71a3ae
[  122.661963] Call Trace:
[  122.661971]  [<ffffffffc05b700a>] ? target+0xa/0x10 [ro]
[  122.661983]  [<ffffffffc02fb000>] ? 0xffffffffc02fb000
[  122.662418]  [<ffffffffabe0218b>] ? do_one_initcall+0x4b/0x180
[  122.662837]  [<ffffffffabfc3b5d>] ? __vunmap+0x6d/0xc0
[  122.663240]  [<ffffffffabf7a58c>] ? do_init_module+0x5b/0x1ed
[  122.663650]  [<ffffffffabeff8d2>] ? load_module+0x2602/0x2a50
[  122.664061]  [<ffffffffabefc030>] ? __symbol_put+0x60/0x60
[  122.664468]  [<ffffffffabefff66>] ? SYSC_finit_module+0xc6/0xf0
[  122.664903]  [<ffffffffac40627b>] ? system_call_fast_compare_end+0xc/0x9b
[  122.665377] Code: <c6> 00 02 b8 00 00 00 00 c9 c3 00 00 00 00 00 00 00 00 00 00 00 00
[  122.665789] RIP  [<ffffffffc02fb025>] kp_init+0x25/0x1000 [ro]
[  122.666144]  RSP <ffffc0cac145bcc8>
[  122.666485] CR2: ffffffffc05b700a
[  122.666815] fbcon_switch: detected unhandled fb_set_par error, error code -16
[  122.667695] fbcon_switch: detected unhandled fb_set_par error, error code -16
[  122.668571] ---[ end trace 4d41b1ec10988ac5 ]---

As you can see in this kernel trace, the error is triggered when a writing paging request is issued at 0xffffffffc05b700a (result of 0xffffffffc05b7000 + 0xa) which is exactly where our kp_init is trying to replace the returning value.

What is necessary to do now is to disable the write protection of those pages of memory, apply the changes and then reset again the memory with its previous access flags. By looking deeper in the Linux kernel memory subsystem, you will eventually find out that there are a set of routines that can help you doing this without messing directly with pages or raw details. Such routines are:

  • set_memory_rw, which turn off the memory write protection mechanism.
  • set_memory_ro, which protects again the memory against unwanted writes.

Unfortunately such procedures are not exported to be used by kernel modules.

Using kernel private procedures from kernel modules

So, is a "not exported procedure" a block for a kernel hacker? Nope, of course. :-)

To get the job done we just need to know in which part of the kernel such procedure is located; having the source code visible (open source), allow us to know the procedure profile (return value and arguments needed). If we arrive to have the address of the procedure, we are done. As always, there are multiple ways to go around this block, and I will show here just two methods to perform this, which depends on the kernel configuration.

If you have a kernel compiled with CONFIG_KALLSYMS symbol included, you just have to use this subsystem of the kernel to obtain a pointer to the procedure. This API is quite simple to use and just needs a string with the name of the procedure to call.

unsigned long rw = kallsyms_lookup_name("set_memory_rw");

Result of such operation will be the address at which the procedure is located. As a final step we just need to assign it to a pointer to procedure in order to get the job done.

int (* smem_rw) (unsigned long addr, int pages);
int (* smem_ro) (unsigned long addr, int pages);

static int __init kp_init(void)
{
    unsigned long rw = kallsyms_lookup_name("set_memory_rw");
    unsigned long ro = kallsyms_lookup_name("set_memory_ro");

    if(!rw || !ro) {
        printk(KERN_INFO "Cannot resolve set_memory_* procedures!\n");
        return 0;
    }

    smem_rw = (void *)rw;
    smem_ro = (void *)ro;

    [...]
}

It can always be the case where CONFIG_KALLSYMS is not set during kernel compile time; in this case the solution is a little more complicated but still possible. All the symbols of procedures and variables shared within the kernel are actually already present in your machine, accessible to anyone to read (even non-root users). Located in the file present at /boot/System.map-* (where the last part is the version of the target kernel) there are listed all the important symbols of the kernel. 

root@debian:/home/user/kp# cat /boot/System.map-4.9.0-3-amd64
0000000000000000 D __per_cpu_start
0000000000000000 D irq_stack_union
0000000000000000 A xen_irq_disable_direct_reloc
0000000000000000 A xen_save_fl_direct_reloc
00000000000001c5 A kexec_control_code_size
0000000000004000 d exception_stacks
0000000000009000 D gdt_page
000000000000a000 D espfix_waddr
000000000000a008 D espfix_stack
000000000000a040 D cpu_info
000000000000a140 D cpu_llc_shared_map
000000000000a180 D cpu_core_map

...

ffffffff810635c0 T _set_memory_wc
ffffffff81063650 T set_memory_wc
ffffffff81063710 T _set_memory_wt
ffffffff81063750 T set_memory_wt
ffffffff81063810 T _set_memory_wb
ffffffff81063850 T set_memory_ro
ffffffff81063890 T set_memory_rw
ffffffff810638d0 T set_memory_np
ffffffff81063910 T set_memory_4k
ffffffff81063950 T set_pages_ro
ffffffff810639c0 T set_pages_rw

What is necessary here is just to pass these information from user space to kernel space by using a symple misc device of sysfs entry to share address and name.

Once you located the the necessary procedure within the running kernel, what you need is just to call them using a proper arguments and the job is done; you will have your page now accessible for writing and the possibility to switch back to the read only mode too. The working module can be found in the source file simple-patch-kallsyms.c, which will be build as the spk.ko module. It will locate the memory location where we want to apply the patch, turn that page into writeable, apply the simple patch and turn the permission back to its original value. Track of the modifications will be found in the kernel log file, which will have the following format:

[   35.550397] Target procedure at ffffffffc058202d returned 1
[   35.550398] Dumping memory at ffffffffc058202d:
[   35.550399]        00 01 02 03 04 05 06 07 08 09 10 11 12 13 14 15

[   35.550404] 0000   0f 1f 44 00 00 55 48 89 e5 b8 01 00 00 00 5d c3
[   35.550421] Target procedure at ffffffffc058202d returned 2
[   35.550421] Dumping memory at ffffffffc058202d:
[   35.550422]        00 01 02 03 04 05 06 07 08 09 10 11 12 13 14 15

[   35.550426] 0000   0f 1f 44 00 00 55 48 89 e5 b8 02 00 00 00 5d c3

Notice that before the actual patching the target procedure located at ffffffffc058202d was returning 1, and the memory dump show this: after the usual prologue for procedures (opcodes 55 48 89 e5), the CPU is moving the value 1 into the EAX register (opcodes b8 01 00 00 00), and then returns to the caller (c3). In the second memory dump you can see that the modification has been applied, and the same procedure now returns a different value.

Resume: We overcomed now the restriction of not having the possibility to use private procedure from within the kernel modules. After that, spk kernel module showed us that modify the "text" part of the kernel is possible and does not produce any more an error. But just modifying like this is not enough, since it's really simple and rely on deep knowledge on how the procedure is translated in assembly code. What is necessary is to replace a set of functionalities with another one, which can have a complex organization which calls other routines: basically change a procedure with an improved version.

Patching existing procedures with custom ones.

Patching a procedure is nothing different or too much complicated respect of what we already seen; the idea is that the old references to the procedure to patch remains where they already are, but the body of the procedure is changed in order to route the execution path somewhere else. This means that when you invoke the patched procedure you will finish up in executing the new one. But how do you make it work like that?

This now is related to the architecture where the kernel is running on, since you'll have to modify code at that low level. The general idea is to inject a jump at the procedure start (technically, a trampoline), which point to where your new patch is located.

The jump operation I will use is the E9 opcode, which requires an additional immediate value of 32 bits, (the relative offset to jump to). The value itself is a signed double word, which means that you will have the possibility to jump 2,147,483,647 bytes forward and -2,147,483,648 bytes backward; this should be more than enough for the current setup of the Linux kernel memory.

Injecting such jump is really easy: first we need to compute of much is the space between our new patch and the old procedure, and then we need to put the opcode in place and the immadiate value just after it. The procedure-patch.c file show how the trick is done:

/* Size of the jmp opcode with its displacement value. */
#define KP_X86_64_JMP_SIZE    5

/*
 * Procedure which will patch the original one:
 */

int do_something_new(void)
{
    printk(KERN_INFO "do_something_new() invoked\n");

    return 1;
}

/*
 * Original procedure:
 */

int do_something(void)
{
    printk(KERN_INFO "do_something() invoked\n");

    return 0;
}

/*
 * Entry/exit points:
 */

static int __init kp_init(void)
{
    int32_t new =
        (int32_t)((unsigned long)do_something_new -
        (unsigned long)do_something);

    char * ptr = (char *)do_something;

    /* Initialize kp subsystem. */
    if(kp_resolve_procedures()) {
        return -EFAULT;
    }

    /* Original call... */
    kp_print(do_something, 16);
    do_something();

/* ---- HERE MEMORY is WRITABLE --------------------------------------------- */
    kp_memrw(do_something, KP_X86_64_JMP_SIZE);
    /* Jump to... */
    ptr[0] = 0xe9;
    /* ... here. */
    memcpy(ptr + 1, &new, sizeof(int32_t));
    kp_memro(do_something, KP_X86_64_JMP_SIZE);
/* -------------------------------------------------------------------------- */

    kp_print(do_something, 16);
    do_something();

    return 0;
}

Resolving the procedures addresses, switching memory to writable and back to read only has now been moved in a utility code sheet (utils.c). With a simple cast to pointer of bytes (char *), as you can see, the E9 jump is injected as the first operation, while the immediate value is properly put just after by using the memcpy procedure. What we will end up having when the module is loaded is something as:

[   48.852597] Dumping memory at ffffffffc04c7049:
[   48.852598]        00 01 02 03 04 05 06 07 08 09 10 11 12 13 14 15

[   48.852603] 0000   0f 1f 44 00 00 55 48 89 e5 48 c7 c7 46 80 4c c0
[   48.852608] do_something() invoked
[   48.852621] Dumping memory at ffffffffc04c7049:
[   48.852621]        00 01 02 03 04 05 06 07 08 09 10 11 12 13 14 15

[   48.852626] 0000   e9 e4 ff ff ff 55 48 89 e5 48 c7 c7 46 80 4c c0
[   48.852631] do_something_new() invoked

Invoking the do_something procedure for the first time result in a normal execution of the procedure, while after the patch the execution path is redirected to the new function. Formatted as bold you can see the newly injected jump with the offset to the new entry point.

I'm expecting at this point that some of you will notice that, as I previously said, the procedure prologue is identified with opcodes 55 48 89 e5. So: what are the first 5 bytes always shown in the kernel log? Well, the pattern 0f 1f 44 00 00 is basically a 5 bytes long NOP (do nothing), which is exactly what we need to perform a jump. This opcodes are here due the fact I am too lazy to recompile a whole kernel and I'm using a standard one for the testing, which means that actually (even if I'm not using it) my kernel does support kpatch.

Se let's see what happens if I move the patch in the place where it would be if no ftrace mechanism is actually included in the kernel. By modifying the patch to skip the first 5 bytes, I'll end up having the modification applied straight on the procedure prologue:

int32_t new =
    (int32_t)((unsigned long)do_something_new -
    ((unsigned long)do_something + 5));

[...]

/* Jump to... */
ptr[5] = 0xe9;
/* ... here. */
memcpy(ptr + 1 + 5, &new, sizeof(int32_t));

And this will result in the following log:

[   53.351702] Dumping memory at ffffffffc03e1049:
[   53.351703]        00 01 02 03 04 05 06 07 08 09 10 11 12 13 14 15

[   53.351710] 0000   0f 1f 44 00 00 55 48 89 e5 48 c7 c7 46 20 3e c0
[   53.351716] do_something() invoked
[   53.351729] Dumping memory at ffffffffc03e1049:
[   53.351729]        00 01 02 03 04 05 06 07 08 09 10 11 12 13 14 15

[   53.351735] 0000   0f 1f 44 00 00 e9 df ff ff ff c7 c7 46 20 3e c0
[   53.351741] do_something_new() invoked

The drawback of this operation is that the patched procedure prologue is lost (the prologue and a piece of the following opcode, 55 48 89 e5 48), and cannot be reset to it's original values. This problem can of course be resolved if, just before the patch, we save that data in another location which is bound to the patched address (a mapping like ffffffffc03e1049 --> 55 48 89 e5 48). This way, if we want to rollback the modifications, we just need to write those bytes back on their place.

Resume: Not only now we can write the kernel code, but actually we can inject a trampoline which moves the execution flow to arbitrary position inside the memory. This can be done irrespective of the fact that ftrace is in place or not; the new mechanism is unbound to it and under our total control. Such operation makes the patched procedure not usable anymore, but by saving the overwritten opcode just before applying the patch we can easily rollback the modifications and reset the memory to it's original status.

Perform the patching in a transparent way.

Problem solved, case closed? Nope at all, we just started. The following problem is actually more nasty and difficult to handle, and is bound to the fact that the kernel to patch have no idea that it will be patched.

In the kernel lot of events happens: interrupts can be invoked by the hardware to serve requests, kernel threads get preempted and user space system calls need to be served. In this complex scenario, how can you be sure that you are not patching the procedure while in the meantime another thread is trying to run that function? Can you always assume that the patcher will not be interrupted during it's operations (for example at half of the writing)? If such event occurs then your kernel is likely to crash, in the best case. In the worse case you can loose alignment with the code and you are doing operation not intended to be done on legal data, which results again in random crashes which are waaay more difficult to debug.

You need a synchronization mechanism that allows to avoid CPUs from executing that code until it has been patched. You obviously can't use any spinlock, semaphore or atomic operation since it will means modifying the kernel code in order to support something like this:

if (spin_lock(&something_lock)) {
    do_something();
}

This means that the kernel is expecting to be patched, and we said it wont.

The way I decided to go for this is to put the machine set of online CPUs in a controlled deadlock.

This raw method allow us to take the control of every computation core of the machine, thus blocking every operation for a limited amount of time. This lock must not to be extensive; during my preliminary test I locked my PC for 5 seconds (lot of time for a kernel), and the Linux scheduler detected the out-of-order time and started to complain (while entering RT throttling). Despite the long amount of locked time, the only visible problem was that the PC completely blocked, to start again just after some time. No additional drawbacks were registered on that machine during the test. Despite the initial successful run, it is not necessary to lock for such long amount of time, and shorter periods makes the offline-time of the kernel almost impossible to detect.

The code relative to this transparent patching can be found in the atomic-procedure-patch.c source sheet.

struct my_task {
    struct task_struct * t;
    unsigned int cpu;
};

/* Number of detected CPUs. */
int nof_cpus = 0;

/* IDs of the CPUs. */
int cpu_ids[KP_MAX_CPUS] = {-1};

/* State of CPUs;
 * no locking required since each CPU just touch it's own byte.
 */
int cpu_s[KP_MAX_CPUS]   = {0}; /* 0 - unlocked, 1 - locked. */

/* Kernel threads info. */
struct my_task cpu_tasks[KP_MAX_CPUS] = {{0}};

/* In case of error kill anyone. */
int kp_die = 0;
/* The elected patching CPU. */
int kp_patcher = -1;
/* Switching this to 1 will triggers our threads. */
int kp_proceed = 0;

First, we organize some global variable which will have to be filled during initialization steps. Number of online CPUs, their IDs, their states and some other additional flags. A single CPU of the one detected will be assigned as the patcher, so the actor which will perform the modifications in memory (by default the first detected CPU).

static int __init kp_init(void)
{
    int i;
    int cpu;

    /* Initialize kp subsystem. */
    if(kp_resolve_procedures()) {
        return -EFAULT;
    }

    /* Original call... */
    kp_print(do_something, 16);
    do_something();

    for_each_cpu(cpu, cpu_online_mask) {
        if(nof_cpus > KP_MAX_CPUS) {
            printk(KERN_ERR "Too much CPUs to handle!\n");
            return -1;
        }

        cpu_ids[nof_cpus] = cpu;
        nof_cpus++;

        printk(KERN_INFO "CPU %u is online...\n", cpu);
    }

    /* The first one is our patcher. */
    kp_patcher = cpu_ids[0];

    for(i = 0; i < nof_cpus; i++) {
        cpu_tasks[i].cpu = cpu_ids[i];
        cpu_tasks[i].t = kthread_create(
            kp_thread, &cpu_tasks[i], "kp%d", cpu_ids[i]);

        if(!cpu_tasks[i].t) {
            printk(KERN_ERR "Error while starting %d.\n",
                cpu_ids[i]);

            /* Kill threads which aready started. */
            kp_die = 1;
            return -1;
        }

        /* Bing the task on that CPU....*/
        kthread_bind(cpu_tasks[i].t, cpu_tasks[i].cpu);
        wake_up_process(cpu_tasks[i].t);
    }

    /* Wait a "little". */
    schedule();

    kp_proceed = 1;

    return 0;
}

The initialization procedure now fills almost all the global variables at once: first it detects the CPUs and save their IDs, then promote a patcher between them (by default the first one). After this, it starts a thread for each CPU, which will be run just on that core (kthread_bind will do the job). At last it schedule itself to wait for a small amount of time and then order the threads to proceed with synchronization and patching steps.

int kp_thread(void *arg)
{
    int i;
    int p;

    struct my_task * t = (struct my_task *)arg;

    int32_t new =
        (int32_t)((unsigned long)do_something_new -
        (unsigned long)do_something);

    char * ptr = (char *)do_something;

    printk(KERN_INFO "Thread of CPU %u running...\n", t->cpu);

wait:
    /* Wait until someone(kp_init) told us to proceed with the job. */
    while(!kp_proceed) {
        if(kp_die) {
            /* Kill this thread... */
            goto out;
        }

        /* Be scheduled. */
        schedule();
    }

    if(kp_patcher < 0) {
        /* No patching CPU elected; this is CRITICAL.
        goto wait;
         */
        printk(KERN_CRIT "Patching CPU %d elected!!!\n", kp_patcher);
        goto out;
    }

/* ---- CPU LOCKED ---------------------------------------------------------- */
    get_cpu();
    cpu_s[t->cpu] = 1;

    do {
        p = 1;

        /* The patching process is up to us! */
        if(t->cpu == kp_patcher) {
            for(i = 0; i < nof_cpus; i++) {
                p &= cpu_s[i];
            }

            /* If 1, everyone is locked. */
            if(p) {
                kp_memrw(do_something, 5);
                ptr[0] = 0xe9;
                memcpy(ptr + 1, &new, sizeof(int32_t));
                kp_memro(do_something, 5);

                kp_proceed = 0;
            }
        }

        if(!kp_proceed) {
            break;
        }
    } while(1);

    cpu_s[t->cpu] = 0;
    put_cpu();
/* -------------------------------------------------------------------------- */

    /* Back to the waiting state where you wait for another patch time.
    goto wait;
     */

    /* Every CPU will call do_something now.*/
    do_something();

out:
    printk(KERN_INFO "Thread of CPU %u died...\n", t->cpu);

    if(t->cpu == kp_patcher) {
        kp_print(do_something, 16);
    }

    return 0;
}

Finally each thread (on a different CPU each) will run the kp_thread routine (see the code above). This logic just makes the thread wait for someone to order it to proceed with operations (using the kp_proceed variable); if during this wait something goes wrong, a die order is issued to all the threads which immediately kill themselves. When the proceed order is issued, a last check makes sure that a patcher has been set, and then the thread continue in locking the CPU. When the CPU get locked the concurrency inside the kernel rises (lot of job, few cores which process them), and this ensure that all the CPU will eventually get locked quickly one after the other.

Once in lock state, the thread switch a dedicate byte to 1 to signal that the CPU is locked, and only the patcher iterate on the array to make sure that nobody is running the kernel code. If this is the case, then the patch is applied and the patcher thread signal to all the others to release the core, giving it back to the kernel.

The logic is inserted in a do while loop in order to add a proper condition in the extreme case that the waiting time takes too much to execute (mainly because CPUs don't get locked for some reason). For the moment I inserted a "1", which means always, but this is not a proper way to handle such extreme event; a timeout mechanism must be inserted here, which should avoid an infinite locking loop. Unfortunately using procedures like getnstimeofday is no use here, since all the CPUs are locked and cannot process interrupts coming from timers and high-resolution timers.

At the end each thread run our do_something procedure, to see what happened there and if the patch has been applied, resulting in the following log:

[  349.718009] Dumping memory at ffffffffc04c30af:
[  349.718010]        00 01 02 03 04 05 06 07 08 09 10 11 12 13 14 15

[  349.718015] 0000   0f 1f 44 00 00 55 48 89 e5 48 c7 c7 fe 40 4c c0
[  349.718020] do_something() invoked
[  349.718021] CPU 0 is online...
[  349.718561] Thread of CPU 0 running...
[  349.718570] do_something_new() invoked
[  349.718571] Thread of CPU 0 died...
[  349.718571] Dumping memory at ffffffffc04c30af:
[  349.718572]        00 01 02 03 04 05 06 07 08 09 10 11 12 13 14 15

[  349.718577] 0000   e9 e4 ff ff ff 55 48 89 e5 48 c7 c7 fe 40 4c c0

For multi-core VM we have, instead:

[   39.865623] Dumping memory at ffffffffc075d094:
[   39.865624]        00 01 02 03 04 05 06 07 08 09 10 11 12 13 14 15

[   39.865630] 0000   0f 1f 44 00 00 55 48 89 e5 48 c7 c7 fe e0 75 c0
[   39.865635] do_something() invoked
[   39.865636] CPU 0 is online...
[   39.865636] CPU 1 is online...
[   39.865637] CPU 2 is online...
[   39.865637] CPU 3 is online...
[   39.865854] Thread of CPU 0 running...
[   39.865999] Thread of CPU 1 running...
[   39.866116] Thread of CPU 2 running...
[   39.866488] Thread of CPU 3 running...
[   39.866563] do_something_new() invoked
[   39.866563] do_something_new() invoked
[   39.866564] do_something_new() invoked
[   39.866564] do_something_new() invoked
[   39.866565] Thread of CPU 2 died...
[   39.866565] Thread of CPU 1 died...
[   39.866566] Thread of CPU 3 died...
[   39.866566] Thread of CPU 0 died...
[   39.866567] Dumping memory at ffffffffc075d094:
[   39.866567]        00 01 02 03 04 05 06 07 08 09 10 11 12 13 14 15

[   39.866588] 0000   e9 e4 ff ff ff 55 48 89 e5 48 c7 c7 fe e0 75 c0

This code of course is preliminary and can be better polished if you intend to use it in more corporate-like systems, still is functional enough to get the job done.

Resume: Overcoming all the preliminary problems we finally reached a mechanism that allows us to patch a procedure with another one, by injecting a trampoline which redirect the execution flow. In order to avoid problem in SMP architecture and with preemption, the patching operation is now atomic across all the CPUs: nobody will be run that procedure code, or in general kernel code, while the quick patching process in ongoing. This will grant that no crashes will occurs while the memory is written with the new values.

History

07/06/2017 - Initial release of the article
07/11/2017 - Added direct link to download the sources

License

This article has no explicit license attached to it but may contain usage terms in the article text or the download files themselves. If in doubt please contact the author via the discussion board below.

A list of licenses authors might use can be found here