An opinionated guide on how to reverse engineer software, part 2

This is the second post in a series meant to help improve your static reverse engineering skills. If you missed the first post, it’s available here.

There’s nothing new under the sun

With the exception of the blockchain, we haven’t had a new fundamental data structure in 30 years. The languages and tools we’re using have changed but how we construct software hasn’t. We still use arrays, vectors, lists, hash maps, queues, etc. It’s considered good practice to write the most boring and straightforward code you can -- it’s good for maintainability, for your junior teammates, and for your own future sanity, and often allows the compiler to produce better code.

Put yourself in the developer’s shoes

The best reverse engineer is also a very capable software developer. Why you may ask? Because almost no one reinvents the wheel. You should know how a developer should have solved a specific problem and then use your reverse engineering skills to confirm those assumptions.

Don’t let your hubris get in the way. You will be a much better reverse engineer by learning how to program in the language and discipline of your target software. Push back on anyone telling you otherwise -- those that disagree may be able to get the job done, but are they really effective?

As a novice, the absolute best way to get better at reverse engineering is to write code, envision the assembly created, and check the disassembly to verify your assumptions. Start by implementing the most common libc functions – the ones that relate to user-supplied data or external data. We’re going to concern ourselves with functions like memcpy, memcmp, strcpy, strlen, and strcmp. For each, write implementations and analyze their disassembly.

memcpy is one of the libc standard data copying functions. The POSIX standard specifies memcpy shall:

The memcpy() function shall copy n bytes from the object pointed to by s2 into the object pointed to by s1. If copying takes place between objects that overlap, the behavior is undefined.

With a function prototype of void *memcpy(void *restrict s1, const void *restrict s2, size_t n);.

A straightforward memcpy implementation could look like this:

void *memcpy(void *restrict s1, const void *restrict s2, size_t n)
{
    char *dst = s1;
    const char *src = s2;

    while (n--) {
        *dst++ = *src++;
    }

    return s1;
}

The disassembly graphs should be fairly similar, they’re both copying one byte at a time to a given length.

In each function, we check if the number of bytes specified is zero and return early if so.
We enter the bytewise copy loop
Check to see if we’ve hit our specified limit
Return the destination buffer pointer

The only difference (5) is in the strncpy, it checks for null bytes and breaks out of the loop when it encounters one – indicating that it expects to be working with C strings.

Putting it together

After you get a handle on how the classic datastructures and their manipulation functions compile to native code, you'll be able to instantly recognize structure of the data and then more quickly determine its purpose.

Homework

What would the graph for strcpy look like?
How does strcpy differ from memcpy and strncpy?
Take a shot at implementing strchr and comparing its graph to the others.