RegreSSHion: Remote Code Execution in OpenSSH

On July 1st, 2024, the security firm Qualys published their discovery of a vulnerability within the OpenSSH daemon that allowed for unauthenticated, network-exploitable, remote code execution [1]. As a ubiquitous means of secure access to remote servers, the vulnerability allowed for attackers to gain unfettered, superuser access to a majority of servers running Linux, which makes up of 60% of all servers [2]. With a severity of 8.1 out of 10 [3], RegreSSHion was one of the most significant vulnerabilities of 2024, but who was affected, how did it work, and how severe was it in practice?

OpenSSH

OpenSSH is an implementation of the SSH protocol, allowing for users to securely connect to a remote environment. Unlike Telnet, communication in SSH is done over an encrypted channel, mediated through a public-private key scheme. Authentication can be done either through providing a username and password of a Unix User on the server, or by sharing the public component of a trusted key (Usually via the ssh-add command). Users can then copy files via scp, setup a file share via sftp, or access a remote shell environment to run commands on the server via ssh [8].

With server management dominated by cloud platforms like AWS and Azure, OpenSSH is a pivotal component to allow server administrators access to their remote servers, and OpenSSH is the default implementation for Linux systems and OpenBSD.

Who is Affected?

The name of the vulnerability: RegreSSHion is in reference to the fact that this vulnerability is actually a regression: the exploit was previous discovered, patched, and then reintroduced in later versions. Therefore, all versions of OpenSSH before version 4.4p1, and those versions after 8.5p1 and before 9.8p1 are considered vulnerable [4].

While OpenSSH is available for most major Operating System platforms, the RegreSSHion vulnerability targets a weakness specific to the glibc implementation of the C language and standard library: narrowing down affected systems exclusively to Linux.

How Does it Work?

To understand how RegreSSHion was able to achieve remote code execution, we must first understand the concepts that were exploited: Process Signals, and Dynamic Memory Allocation.

Process Signals

In Unix based systems, the Signal is an essential aspect of IPC, or inter-process communication. It allows for a process, such as a task manager, to issue a signal to another process, such as instructing the process to die [5]. Signals are very similar to the concept of Exceptions or Interrupts, where the logical flow of a program is interrupted, moved to a special region of code known as the Interrupt Handler, and handles the interrupt before returning to where execution branched off [6].

In higher level languages, such as C++ or Python, exceptions are managed by the language, but when dealing with raw Assembly, care needs to be taken to ensure that the state of the CPU is identical to that before the Interrupt Handler was invoked. Because Interrupts and Exceptions can occur at any time, a program can be interrupted in the middle of an important operation that relies on values being in certain locations, values that an Interrupt Handler may change.

Signals, like Exceptions and Interrupts, cause a branch in the normal flow of program execution, but rather than being handled on the CPU level, are instead handled by the Kernel [5]. While some Signals, particularly SIGKILL (Forcefully kills the program), cannot be caught or ignored, most Signals allow for the application to handle the request through a Signal Handler. For example, consider a program that writes data to files, where an interruption of that process could lead to corruption. If a user wants to stop the process prematurely by sending SIGINT (Ctrl+C in the terminal), the process will want to catch that signal, and then close the file gracefully. By using the signal function, we can trivially set a function that should be run when a process receives a signal:

#include <signal.h>
#include <iostream>

static void cleanup(int signo) {
	std::cout << "Cleanup!" << std::endl;
	// Handle any cleanup related steps.
	exit(1);
}

int main() {
	// Set cleanup to be the SIGIINT Signal Handler.
	signal(SIGINT, cleanup);
	// Do work.
}

However, a very important thing to understand about Signal Handlers is that because they branch execution at any point, a Signal Handler must act like an Interrupt Handler in that it must ensure that, when the signal handler resumes normal execution, that state must not be modified. Consider a rather trivial example:

#include <signal.h>  
#include <iostream>  
#include <chrono>  
#include <thread>  
  
int a = 2;  
  
static void handle(int signo) {  
       a = 3;  
}  
  
int main() {  
       // Set handle to be the SIGIINT Signal Handler.  
       signal(SIGINT, handle);  
  
       // Do work  
       std::this_thread::sleep_for(std::chrono::seconds(5));  
  
       std::cout << a * a << std::endl;  
       return 0;  
}

If we do not send an signal interrupt, we get the expected output: 4, but if we do send the SIGINT signal, then we return 9. The key takeaway is that Signal Handlers must take extreme precaution to ensure they do not modify the state in a way the returning function may not expect.

Dynamic Memory Allocation

A key concept of both Assembly and C/C++ is the two core data structures of the Stack, and the Heap. The Stack is a logically structured set of values, which typically “grows” from the top of an application’s allocated memory downward. The Stack follows a Last-In-First-Out scheme, as opposed to a First-In-First-Out structure like a queue. C/C++ compilers use the Stack to store local variables [9].

The Heap is an unorganized set of memory used for dynamically allocated data [9]. Specific functions are needed to manage this memory, to which the C language mandates the class of memory allocators malloc, calloc, realloc, and the deallocator free. To effectively manage this Heap space, the allocators are complicated functions (The glibc implementation of malloc is over 6000 lines of code [7]), and there are numerous ways to misuse them, causing issues like Double Frees, Segmentation Faults, and Memory Leaks. The follow example shows values being initialized on both the Stack and Heap:

#include <string>  
#include <stdlib.h>  
  
int main() {  
       // Allocated on the stack.  
       int a = 2;

       // Allocated on the stack.  
       char str[] = "Hello, World!"; 

       // Out-Of-Bounds indexing will cause the program to crash due to Stack-Smashing Detection.  
       str[99] = '.';

       // The std::string manages its own memory on the heap.  
       std::string str2 = "Hello, again!";

        // Allocate 4 integers on the heap.  
       int* int_array = reinterpret_cast<int*>(calloc(4, sizeof(int)));

       // Failure to check for a successful allocation will dereference a null pointer. 
       if (int_array == NULL) return -1; 

       // No error will occur, but Valgrind will report an Invalid-Read as an out-of-bounds index. 
       int_array[99] = 1; 

       // Failure to manually free the allocated memory will cause a Memory Leak.    
       free(int_array);
       return 0;  
}

Importantly, and the takeaway of this section is that: Dynamic Allocation Functions must be handled with caution. They are not thread safe, and should not be used in a Signal Handler because interrupting the allocation process can easily corrupt the Heap.

The Exploit

Now that we understand Signals and Dynamically Allocated Memory, we can dive into the exploit itself. OpenSSH has a function to timeout a connection when a user takes excessively long to provide credentials. This function is exposed in a LoginGraceTime parameter, which defaults to 120 seconds [10]. In order to handle this timeout period, OpenSSH uses the SIGALRM to wake the process up after the specified period, which would then trigger its SIGALRM handler [4]. This is not out of the ordinary for applications; however, OpenSSH logs this timeout using the syslog() function, which in the glibc implementation in not safe for use in a signal handler. If syslog() is being called for the first time, then it invokes the __localtime64_r() function, which calls malloc(). Insidiously, the syslog() call itself is buried in a call chain, from the grace_alarm_handler function to the sigdie macro to the sshsigdie function to the sshlogv function before finally arriving at do_log() [4]:

do_log(LogLevel level, int force, const char *suffix, const char *fmt, va_list args) {
	syslog(pri, "%.500s", fmtbuf);

Qualys found that if this SIGALRM handler interrupted a malloc call within OpenSSH’s public-key parsing code, then the syslog() call to malloc within the handler would cause corruption to the heap when execution was returned back from the handler:

Main Execution        |---> grace_alarm_handler:
...                   |      // The state of the heap is now unstable as malloc did not finish.
...                   |      syslog("Timeout") -> malloc() -> Updates the heap.
malloc(X) -> SIGALRM -| <--- return
// When this malloc resumes, the state of the heap has changed, and causes corruption.
...

Eventually, the researchers were able to corrupt memory that would write code into the heap, and then they could overwrite glibc’s __free_hook function pointer to the address of the code that they had written, which would then grant them remote code execution during the next call to free() [4]. The researchers explain the process of corrupting the heap:

If […] malloc() is interrupted by SIGALRM after line 4327 but before line 4339, then the [allocated] chunk […] is already linked into the unsorted list of free chunks, but its size field is under our control, […] and this artificially enlarged […] chunk overlaps with the following small hole. […] [W]hen the SIGALRM handler calls syslog(), [malloc() allocates] the small hole for its FILE structure, and [malloc() allocates] a 4KB read buffer. […] [W]e therefore overwrite parts of the FILE structure with the internal header of this small remainder chunk.

We were able to make 27 pairs of such large and small holes in [the] heap […]: Achieving this complex heap layout was extremely painful and time-consuming, but the [highlights is]: We abuse [OpenSSH’s] public-key parsing code to perform arbitrary sequences of malloc() and free() calls.

[4]

How Severe Was It?

A unauthenticated, superuser privilege, network-exploitable remote code execution vulnerability is exceedingly severe. However, despite having an 8.1 out of 10 severity score, RegreSSHion only had a 2.2 out of 10 for an Exploitability Score [3]. Why was this? In short, RegreSSHion took a considerable amount of time to successfully carry out, and could only be done so on specific architectures [4].

Luck and Patience

There was a lot of luck involved, as the exploiters had to perfectly time the SIGALRM handler within a narrow range of 12 lines of code within the malloc of OpenSSH’s public-private parsing code. Coupled with the default 120 second timeout for each SIGALRM, the authors remark that it took ~10,000 tries to win the race condition and cause the intended Heap corruption. This translated to 3-4 hours for remote code execution with 100 active connections to the server [4].

ASLR

A security feature that severely hindered the exploit was ASLR, or Address Space Layout Randomization. The Linux Kernel will automatically randomize the location of key parts of the program, specifically glibc. When attacking a 32 bit machine (With 4 byte memory addresses), this caused the _free_hook function in glibc to randomly be at either address 0xb07400000 or 0xb07200000, to which the attackers had a 50% of guessing correctly. This alone effectively doubles the time needed to exploit RegreSSHion, up to 6-8 hours. [4]

AMD64

The most important factor as to why RegreSSHion was not easily exploitable was due to the significant improvements to ASLR from the 4 byte memory addresses of 32 bit machines to the 8 byte memory addresses of modern 64 bit machines. The authors of the paper were unable to use this exploit on a 64 bit computer, but estimated that the time required was upward of an entire week [4]. Given that this entire exploit uses a vulnerability in syslog(), the function that logs these timeouts, any astute server administrator would quickly notice these floods of timeouts, and every one of the 100 permitted connections being utilized, and a simple restart of the server or OpenSSH daemon would reset the Heap to a safe state undoing all of the attacker’s work. With 64 bit computers as the dominant computer architecture, long since overtaking the previous 32 bit, and the inability for the researchers to exploit 64 bit computers in any reasonable time, the scope of the exploit is reduced significantly.

Conclusion

RegreSSHion revealed a significant vulnerability in the ubiquitous OpenSSH, where by abusing a Signal Handler tied to a timeout, which used the thread-unsafe glibc implementation of syslog(), attackers could methodically corrupt the heap of a privileged child and eventually achieve arbitrary code execution, exploitable from the network, and at superuser privilege. While newer versions of the software have since patched the vulnerability, its emergence as a regression from a prior vulnerability reveals the critical importance of testing and stringent code review for any application that users and administrators rely on for security, for even minor issues—such as using a standard library function that that just so happens to use a dynamic memory allocator in a Signal Handler—can be exploited through patience and ingenuity.

References

1: https://blog.qualys.com/vulnerabilities-threat-research/2024/07/01/regresshion-remote-unauthenticated-code-execution-vulnerability-in-openssh-server
2: https://www.fortunebusinessinsights.com/server-operating-system-market-106601
3: https://nvd.nist.gov/vuln/detail/CVE-2024-6387
4: https://www.qualys.com/2024/07/01/cve-2024-6387/regresshion.txt
5: https://www.man7.org/linux/man-pages/man7/signal.7.html
6: https://tldp.org/LDP/lkmpg/2.6/html/x1256.html
7: https://github.com/kraj/glibc/blob/master/malloc/malloc.c
8: https://www.openssh.com/
9: https://www.learncpp.com/cpp-tutorial/the-stack-and-the-heap/
10: https://www.man7.org/linux/man-pages/man5/sshd_config.5.html

Image References:

Featured Image: https://www.securityweek.com/wp-content/uploads/2024/07/regreSSHion.jpg
OpenSSH: https://www.openssh.com/images/openssh.gif
Process Signals: https://devopedia.org/images/article/197/5091.1562685662.png
Dynamic Memory Allocation: https://cdn-images-1.medium.com/max/1200/1*8b9-Z3FV6X9SP9We8gSC3Q.jpeg

Join the Conversation

6 Comments

Emeka Nnamdi says:

September 10, 2024 at 6:20 pm

Great Job Kyle,
This highlighted the impact of having a continuous incident response plan.
As newer versions of OpenSSH patch this vulnerability, it’s a stark reminder of the importance of thorough scrutiny and constant vigilance in software development. The persistence and ingenuity required to exploit such vulnerabilities also underline the need for continuous improvement and vigilance in maintaining software security. For developers and administrators, this serves as a crucial lesson: always anticipate potential issues, no matter how minor they may seem, and maintain a robust approach to testing and review.

Log in to Reply
Maria Isabel German says:

September 11, 2024 at 10:34 pm

Very informative, Kyle! Were there any reports if anyone got affected by this? It’s good that they were able to release a patch and fixed the issue because just imagine how severe the impact will be! This is really a good example of why regression testing is crucial in every project, regardless of how meticulous you are in aiming for a “flawless” implementation.

Log in to Reply
1. Kyle Kernick says:
  
  September 12, 2024 at 12:15 pm
  
  From the news articles I’ve read, Qualys responsibly disclosed the vulnerability and was in contact with the OpenSSH team, so the patch was immediately available and users could update to a safe version before bad actors could attempt to take advantage. According to Kaspersky (https://www.kaspersky.com/blog/openssh-vulnerability-mitigation-cve-2024-6387-regresshion/51603/), Another security measure that I hadn’t mentioned was that because this exploit requires so many connections to the server, DDoS protection mechanisms like Cloudflare can thwart this as well, so it’s exceedingly unlikely that a bad actor was able to abuse this exploit on a high-value target.
  
  Log in to Reply
Ankita Ankita says:

September 12, 2024 at 3:00 pm

Hello Kyle, excellent post! The explanation of the RegreSSHion vulnerability and its exploitation methods is remarkably detailed. The complexity involved in exploiting such vulnerabilities is notable. It’s interesting to see how the interaction between signal handlers and memory allocation functions can lead to significant security breaches. The in-depth analysis of process signals and dynamic memory management clarifies how the exploit was executed. Have you encountered any details regarding the difficulties faced during the patching of the RegreSSHion vulnerability?

Log in to Reply
Devanshu Paresh Parikh says:

September 14, 2024 at 12:54 pm

Attackers could use this problem to gain superuser control by controlling the heap memory during signal handling.
Great insights!
Kyle notes that while the severity was high, the elaborate exploitation took considerable time, underscoring the potential impact of such attacks. This incident can be used to advocate for more security testing in commonly used applications such as OpenSSH.

Log in to Reply
Nicole Lefebvre says:

September 20, 2024 at 5:34 pm

Great post, Kyle! Your explanations provided a clear understanding of how the attack works and the underlying technology that allowed it to happen. The most concerning thing about this vulnerability was the fact that it was previously fixed, and then reintroduced. I wonder if automated tests are used in the OpenSSH code base, and run for any proposed changes through GitHub actions or another CI pipeline, so that any reintroduction of these issues could be caught.

Log in to Reply

Cancel reply

You must be logged in to post a comment.