{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "Prerequisite: the reader is encouraged to read the documentation of `expression` and `locationdb` before this part.\n", "\n", "# Miasm Intermediate representation\n", "The intermediate representation of Miasm allows to represent the `side effects` of instructions in a control flow graph. To summarise, here is the correspondence between native world and its intermediate representation:\n", "- an assembly control flow graph (`AsmCFG`) is represented in intermediate representation by an \"Intermediate representation control flow graph\": `IRCfg`\n", "- an AsmCFG in composed of basic blocks. In intermediate representation, the `IRCfg` is composed of Intermediate representation blocks: `IRBlock`s\n", "- a native basic block is a sequence of instructions. In intermediate representation, the `IRBlock` if a sequence of `AssignBlock`s\n", "- an `AssignBlock` is composed of parallel assignments of expressions. \"Parallel\" mean that those assignments are executed exactly the same time (different from successive)\n", "\n", "Note this does not imply that an instruction translates to an `AssignBlock`. The translation of a native instruction can generate multiple `AssignBlock`s and even multiple `IRBlock`s. " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Examples\n", "Let's take some examples of translated instructions. First of all, we will create an helper to generate intermediate representation from assembly code. Skip this code, it's not important for the rest of the documentation" ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [], "source": [ "from miasm.analysis.machine import Machine\n", "from miasm.arch.x86.arch import mn_x86\n", "from miasm.core import parse_asm, asmblock\n", "from miasm.arch.x86.lifter_model_call import LifterModelCall_x86_32\n", "from miasm.core.locationdb import LocationDB\n", "from miasm.loader.strpatchwork import StrPatchwork\n", "from miasm.analysis.binary import Container\n", "from miasm.ir.ir import IRCFG, AssignBlock\n", "from miasm.expression.expression import *\n", "import logging\n", "\n", "# Quiet warnings\n", "asmblock.log_asmblock.setLevel(logging.ERROR)\n", "\n", "\n", "def gen_x86_asmcfg(asm):\n", " # First, asm code\n", " machine = Machine(\"x86_32\")\n", "\n", " # Add dummy label \"end\" at code's end\n", " code = asm + \"\\nend:\\n\"\n", " loc_db = LocationDB()\n", " # The main will be at address 0\n", " loc_db.set_location_offset(loc_db.get_or_create_name_location(\"main\"), 0x0)\n", "\n", " asmcfg = parse_asm.parse_txt(\n", " mn_x86, 32, code,\n", " loc_db\n", " )\n", " virt = StrPatchwork()\n", " # Assemble shellcode\n", " patches = asmblock.asm_resolve_final(\n", " machine.mn,\n", " asmcfg,\n", " )\n", " # Put shelcode in a string\n", " for offset, raw in patches.items():\n", " virt[offset] = raw\n", " data = bytes(virt)\n", " cont = Container.fallback_container(\n", " data,\n", " vm=None, addr=0,\n", " loc_db=loc_db,\n", " )\n", " dis_engine = machine.dis_engine\n", " # Disassemble back the shellcode\n", " # Now, basic blocks are at known position, determined by\n", " # the assembled version\n", " mdis = dis_engine(cont.bin_stream, loc_db=cont.loc_db)\n", " asmcfg = mdis.dis_multiblock(0)\n", " return asmcfg\n", "\n", "def lift_x86_asm(asm, model_call=False, lifter_custom=None):\n", " asmcfg = gen_x86_asmcfg(asm)\n", " machine = Machine(\"x86_32\")\n", " # Get a lifter\n", " if model_call and lifter_custom is None:\n", " lifter = LifterModelCall_x86_32(asmcfg.loc_db)\n", " elif lifter_custom is not None:\n", " lifter = lifter_custom(asmcfg.loc_db)\n", " else:\n", " lifter = machine.lifter(asmcfg.loc_db)\n", "\n", " # Translate to IR\n", " ircfg = lifter.new_ircfg_from_asmcfg(asmcfg)\n", " return ircfg\n", "\n", "def graph_ir_x86(asm, model_call=False, lifter_custom=None):\n", " ircfg = lift_x86_asm(asm, model_call, lifter_custom)\n", " return ircfg.graphviz()\n" ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [ { "data": { "image/svg+xml": [ "\n", "\n", "\n", "\n", "\n", "\n", "html_table\n", "\n", "\n", "\n", "0\n", "\n", "\n", "main\n", "MOV        EAX, EBX\n", "\n", "\n", "\n", "1\n", "\n", "\n", "end\n", "IOError\n", "\n", "\n", "\n", "0->1\n", "\n", "\n", "\n", "\n", "\n" ], "text/plain": [ "" ] }, "execution_count": 2, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Let's generate the AsmCFG\n", "asmcfg = gen_x86_asmcfg(\"\"\"\n", "main:\n", " MOV EAX, EBX\n", "\"\"\")\n", "asmcfg.graphviz()" ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [ { "data": { "image/svg+xml": [ "\n", "\n", "\n", "\n", "\n", "\n", "html_table\n", "\n", "\n", "\n", "0\n", "\n", "\n", "main\n", "EAX = EBX\n", "IRDst = b'end'\n", "\n", "\n", "\n", "1\n", "\n", "\n", "end\n", "\n", "NOT PRESENT\n", "\n", "\n", "\n", "0->1\n", "\n", "\n", "\n", "\n", "\n" ], "text/plain": [ "" ] }, "execution_count": 3, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# And graph the corresponding IRCFG\n", "graph_ir_x86(\"\"\"\n", "main:\n", " MOV EAX, EBX\n", "\"\"\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Lets analyze this graph:\n", "- the first ir basic block has the name set to `main`\n", "- it is composed of 2 `AssignBlock`s\n", "- the first `AssignBlock` contains only one assignment, `EAX = EBX`\n", "- the second one is `IRDst = loc_key_1`\n", "\n", "The `IRDst` is a special register which represent a kind of *program counter* in intermediate representation. Each `IRBlock` has one and only one assignment to `IRDst`. The position of the `IRDst` assignment is not always in the last `AssignBlock` of the `IRBlock`. In our case, the shellcode stops after the `MOV EAX, EBX`, so the next location to execution is unknown: `end`. This label has been artificially added by the script.\n", "\n", "\n", "Let's take another instruction." ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [ { "data": { "image/svg+xml": [ "\n", "\n", "\n", "\n", "\n", "\n", "html_table\n", "\n", "\n", "\n", "0\n", "\n", "\n", "main\n", "zf = FLAG_EQ_CMP(EAX, -0x3)\n", "nf = FLAG_SIGN_SUB(EAX, -0x3)\n", "pf = parity((EAX + 0x3) & 0xFF)\n", "cf = FLAG_ADD_CF(EAX, 0x3)\n", "of = FLAG_ADD_OF(EAX, 0x3)\n", "af = ((EAX ^ 0x3) ^ (EAX + 0x3))[4:5]\n", "EAX = EAX + 0x3\n", "IRDst = b'end'\n", "\n", "\n", "\n", "1\n", "\n", "\n", "end\n", "\n", "NOT PRESENT\n", "\n", "\n", "\n", "0->1\n", "\n", "\n", "\n", "\n", "\n" ], "text/plain": [ "" ] }, "execution_count": 4, "metadata": {}, "output_type": "execute_result" } ], "source": [ "graph_ir_x86(\"\"\"\n", "main:\n", " ADD EAX, 3\n", "\"\"\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "In this graph, we can note that each instruction side effect is represented.\n", "Note that in the equation:\n", "```\n", "zf = FLAG_EQ_CMP(EAX, -0x3)\n", "```\n", "The detailed version of the expression:\n", "```\n", "ExprId('zf', 1) = ExprOp('FLAG_EQ_CMP', ExprId('EAX', 32), ExprInt(-0x3, 32))\n", "```\n", "The operator `FLAG_EQ_CMP` is a kind of *high level* representation. But you can customize the lifter in order to get the real equation of the `zf`. This will be presented in a documentation dedicated to modification of the intermediate representation control flow graph.\n", "```\n", "ExprId('zf', 1) = ExprCond(ExprId('EAX', 32) - ExprInt(-0x3, 32), ExprInt(0, 1), ExprInt(1, 1))\n", "```\n", "which is, in a simplified form:\n", "```\n", "zf = (EAX - 3) ? (0, 1)\n", "```" ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [ { "data": { "image/svg+xml": [ "\n", "\n", "\n", "\n", "\n", "\n", "html_table\n", "\n", "\n", "\n", "0\n", "\n", "\n", "main\n", "EAX = EBX\n", "EBX = EAX\n", "IRDst = b'end'\n", "\n", "\n", "\n", "1\n", "\n", "\n", "end\n", "\n", "NOT PRESENT\n", "\n", "\n", "\n", "0->1\n", "\n", "\n", "\n", "\n", "\n" ], "text/plain": [ "" ] }, "execution_count": 5, "metadata": {}, "output_type": "execute_result" } ], "source": [ "graph_ir_x86(\"\"\"\n", "main:\n", " XCHG EAX, EBX\n", "\"\"\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "This one is interesting, as it demonstrate perfectly the parallel execution of multiple assignments. In you are puzzled by this notation, imagine this describes equations, which expresses destination variables of an output state depending on an input state. The equations can be rewritten:\n", "```\n", "EAX_out = EBX_in\n", "EBX_out = EAX_in\n", "```\n", "\n", "And this matches the `xchg` semantic. After the execution, those variables are committed, which means that `EAX` takes the value of `EAX_out`, and `EBX` takes the value of `EBX_out`" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Some arbitrary choices have been done in order to try to match as best as possible. For example lets take the instruction:\n", "```\n", "CMOVZ EAX, EBX\n", "```\n", "This conditional move is done if the zero flag is activated. So we may want to translate it as:\n", "```\n", "EAX = zf ? EBX : EAX\n", "```\n", "Which can be read: if `zf` is 1, `EAX` is set to `EBX` else `EAX` is set to `EAX`, which is equivalent to no modifications.\n", "\n", "This representation seems good at first, as the semantic of the conditional move seems ok. But let's question the system on the equation `EAX = zf ? EBX, EAX`:\n", "- which register is written ? `EAX` is *always* written\n", "- which register is read ? `zf`, `EBX`, `EAX` are read\n", "\n", "IF we ask the same question on the instruction `CMOVZ EAX, EBX`, the answers are a bit different:\n", "- which register is written ? `EAX` is written only if the `zf` is 1\n", "- which register is read ? `zf` is *always* read, `EBX` may be read is `zf` is 1\n", "\n", "The conclusion is the representation we gave doesn't represent properly the instruction. Here is what Miasm will gave as intermediate representation for it:" ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [ { "data": { "image/svg+xml": [ "\n", "\n", "\n", "\n", "\n", "\n", "html_table\n", "\n", "\n", "\n", "0\n", "\n", "\n", "main\n", "ESP = ESP + -0x4\n", "@32[ESP + -0x4] = EAX\n", "IRDst = b'end'\n", "\n", "\n", "\n", "1\n", "\n", "\n", "end\n", "\n", "NOT PRESENT\n", "\n", "\n", "\n", "0->1\n", "\n", "\n", "\n", "\n", "\n" ], "text/plain": [ "" ] }, "execution_count": 6, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Here is a push\n", "graph_ir_x86(\"\"\"\n", "main:\n", " PUSH EAX\n", "\"\"\")" ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [ { "data": { "image/svg+xml": [ "\n", "\n", "\n", "\n", "\n", "\n", "html_table\n", "\n", "\n", "\n", "0\n", "\n", "\n", "main\n", "IRDst = CC_EQ(zf)?(loc_key_2,b'end')\n", "\n", "\n", "\n", "1\n", "\n", "\n", "end\n", "\n", "NOT PRESENT\n", "\n", "\n", "\n", "0->1\n", "\n", "\n", "\n", "\n", "\n", "2\n", "\n", "\n", "loc_key_2\n", "EAX = EBX\n", "IRDst = b'end'\n", "\n", "\n", "\n", "0->2\n", "\n", "\n", "\n", "\n", "\n", "2->1\n", "\n", "\n", "\n", "\n", "\n" ], "text/plain": [ "" ] }, "execution_count": 7, "metadata": {}, "output_type": "execute_result" } ], "source": [ "graph_ir_x86(\"\"\"\n", "main:\n", " CMOVZ EAX, EBX\n", "\"\"\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Here are some remarks we can do on this version:\n", "- *one* x86 instruction has generated multiple `IRBlocks`\n", "- the first `IRBlock` only reads the `zf` (we don't take the locations into account here)\n", "- `EAX` is assigned only in the case of `zf` equals to 1\n", "- `EBX` is read only in the case of `zf` equals to 1\n", "\n", "We can dispute on the fact that in this form, it's harder to get what is read and what is written. But one argument is: If `cmovz` doesn't exist (for example in older cpus) what may be the code to do this ?" ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [ { "data": { "image/svg+xml": [ "\n", "\n", "\n", "\n", "\n", "\n", "html_table\n", "\n", "\n", "\n", "0\n", "\n", "\n", "main\n", "EIP = CC_EQ(zf)?(b'end',loc_2)\n", "IRDst = CC_EQ(zf)?(b'end',loc_2)\n", "\n", "\n", "\n", "1\n", "\n", "\n", "end\n", "\n", "NOT PRESENT\n", "\n", "\n", "\n", "0->1\n", "\n", "\n", "\n", "\n", "\n", "2\n", "\n", "\n", "loc_2\n", "EAX = EBX\n", "IRDst = b'end'\n", "\n", "\n", "\n", "0->2\n", "\n", "\n", "\n", "\n", "\n", "2->1\n", "\n", "\n", "\n", "\n", "\n" ], "text/plain": [ "" ] }, "execution_count": 8, "metadata": {}, "output_type": "execute_result" } ], "source": [ "graph_ir_x86(\"\"\"\n", "main:\n", " JZ end\n", " MOV EAX, EBX\n", "end:\n", "\"\"\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The conclusion is that in intermediate representation, the `cmovz` is exactly as difficult as analyzing the code using `jz/mov`" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "So an important point is that in intermediate representation, one instruction can generate *multiple* `IRBlock`s. Here are some interesting examples:" ] }, { "cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [ { "data": { "image/svg+xml": [ "\n", "\n", "\n", "\n", "\n", "\n", "html_table\n", "\n", "\n", "\n", "0\n", "\n", "\n", "main\n", "@8[EDI[0:32]] = @8[ESI[0:32]]\n", "IRDst = df?(loc_key_3,loc_key_2)\n", "\n", "\n", "\n", "2\n", "\n", "\n", "loc_key_2\n", "ESI = ESI[0:32] + 0x1\n", "EDI = EDI[0:32] + 0x1\n", "IRDst = b'end'\n", "\n", "\n", "\n", "0->2\n", "\n", "\n", "\n", "\n", "\n", "3\n", "\n", "\n", "loc_key_3\n", "ESI = ESI[0:32] + -0x1\n", "EDI = EDI[0:32] + -0x1\n", "IRDst = b'end'\n", "\n", "\n", "\n", "0->3\n", "\n", "\n", "\n", "\n", "\n", "1\n", "\n", "\n", "end\n", "\n", "NOT PRESENT\n", "\n", "\n", "\n", "2->1\n", "\n", "\n", "\n", "\n", "\n", "3->1\n", "\n", "\n", "\n", "\n", "\n" ], "text/plain": [ "" ] }, "execution_count": 9, "metadata": {}, "output_type": "execute_result" } ], "source": [ "graph_ir_x86(\"\"\"\n", "main:\n", " MOVSB\n", "\"\"\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "And now, the version using a repeat prefix:" ] }, { "cell_type": "code", "execution_count": 10, "metadata": {}, "outputs": [ { "data": { "image/svg+xml": [ "\n", "\n", "\n", "\n", "\n", "\n", "html_table\n", "\n", "\n", "\n", "0\n", "\n", "\n", "main\n", "IRDst = ECX[0:32]?(loc_key_4,b'end')\n", "\n", "\n", "\n", "1\n", "\n", "\n", "end\n", "\n", "NOT PRESENT\n", "\n", "\n", "\n", "0->1\n", "\n", "\n", "\n", "\n", "\n", "4\n", "\n", "\n", "loc_key_4\n", "@8[EDI[0:32]] = @8[ESI[0:32]]\n", "IRDst = df?(loc_key_3,loc_key_2)\n", "\n", "\n", "\n", "0->4\n", "\n", "\n", "\n", "\n", "\n", "2\n", "\n", "\n", "loc_key_2\n", "ESI = ESI[0:32] + 0x1\n", "EDI = EDI[0:32] + 0x1\n", "IRDst = loc_key_5\n", "\n", "\n", "\n", "5\n", "\n", "\n", "loc_key_5\n", "ECX = ECX[0:32] + -0x1\n", "IRDst = ((ECX[0:32] + -0x1)?(0x0,0x1))?(b'end',loc_key_4)\n", "\n", "\n", "\n", "2->5\n", "\n", "\n", "\n", "\n", "\n", "3\n", "\n", "\n", "loc_key_3\n", "ESI = ESI[0:32] + -0x1\n", "EDI = EDI[0:32] + -0x1\n", "IRDst = loc_key_5\n", "\n", "\n", "\n", "3->5\n", "\n", "\n", "\n", "\n", "\n", "4->2\n", "\n", "\n", "\n", "\n", "\n", "4->3\n", "\n", "\n", "\n", "\n", "\n", "5->1\n", "\n", "\n", "\n", "\n", "\n", "5->4\n", "\n", "\n", "\n", "\n", "\n" ], "text/plain": [ "" ] }, "execution_count": 10, "metadata": {}, "output_type": "execute_result" } ], "source": [ "graph_ir_x86(\"\"\"\n", "main:\n", " REP MOVSB\n", "\"\"\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "In the very same way as `cmovz`, if the `rep movsb` instruction didn't exist, we would use a more complex code." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The translation of some instructions are tricky:" ] }, { "cell_type": "code", "execution_count": 11, "metadata": {}, "outputs": [ { "data": { "image/svg+xml": [ "\n", "\n", "\n", "\n", "\n", "\n", "html_table\n", "\n", "\n", "\n", "0\n", "\n", "\n", "main\n", "cf = (EAX >> (0x1 + -0x1))[0:1]\n", "of = (0x1 + -0x1)?(0x0,EAX[31:32])\n", "EAX = EAX >> 0x1\n", "zf = (EAX >> 0x1)?(0x0,0x1)\n", "nf = FLAG_SIGN_SUB(EAX >> 0x1, 0x0)\n", "pf = parity((EAX >> 0x1) & 0xFF)\n", "IRDst = b'end'\n", "\n", "\n", "\n", "1\n", "\n", "\n", "end\n", "\n", "NOT PRESENT\n", "\n", "\n", "\n", "0->1\n", "\n", "\n", "\n", "\n", "\n" ], "text/plain": [ "" ] }, "execution_count": 11, "metadata": {}, "output_type": "execute_result" } ], "source": [ "graph_ir_x86(\"\"\"\n", "main:\n", " SHR EAX, 1\n", "\"\"\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "For the moment, nothing special. `EAX` is updated correctly, and the flags are updated according to the result (note those side effects are in parallel here). But look at the next one:" ] }, { "cell_type": "code", "execution_count": 12, "metadata": {}, "outputs": [ { "data": { "image/svg+xml": [ "\n", "\n", "\n", "\n", "\n", "\n", "html_table\n", "\n", "\n", "\n", "0\n", "\n", "\n", "main\n", "IRDst = (zeroExt_32(ECX[0:8]) & 0x1F)?(loc_key_2,b'end')\n", "\n", "\n", "\n", "1\n", "\n", "\n", "end\n", "\n", "NOT PRESENT\n", "\n", "\n", "\n", "0->1\n", "\n", "\n", "\n", "\n", "\n", "2\n", "\n", "\n", "loc_key_2\n", "cf = (EAX >> ((zeroExt_32(ECX[0:8]) & 0x1F) + -0x1))[0:1]\n", "of = ((zeroExt_32(ECX[0:8]) & 0x1F) + -0x1)?(0x0,EAX[31:32])\n", "EAX = EAX >> (zeroExt_32(ECX[0:8]) & 0x1F)\n", "zf = (EAX >> (zeroExt_32(ECX[0:8]) & 0x1F))?(0x0,0x1)\n", "nf = FLAG_SIGN_SUB(EAX >> (zeroExt_32(ECX[0:8]) & 0x1F), 0x0)\n", "pf = parity((EAX >> (zeroExt_32(ECX[0:8]) & 0x1F)) & 0xFF)\n", "IRDst = b'end'\n", "\n", "\n", "\n", "0->2\n", "\n", "\n", "\n", "\n", "\n", "2->1\n", "\n", "\n", "\n", "\n", "\n" ], "text/plain": [ "" ] }, "execution_count": 12, "metadata": {}, "output_type": "execute_result" } ], "source": [ "graph_ir_x86(\"\"\"\n", "main:\n", " SHR EAX, CL\n", "\"\"\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "In this case, if `CL` is zero, the destination is shifted by a zero amount. The instruction behaves (in 32 bit mode) as a `nop`, and the flags are not assigned. We could have done the same trick as in the `cmovz`, but this representation matches more accurately the instruction semantic." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Here is another one:" ] }, { "cell_type": "code", "execution_count": 13, "metadata": {}, "outputs": [ { "data": { "image/svg+xml": [ "\n", "\n", "\n", "\n", "\n", "\n", "html_table\n", "\n", "\n", "\n", "0\n", "\n", "\n", "main\n", "IRDst = ECX?(loc_key_2,loc_key_3)\n", "\n", "\n", "\n", "2\n", "\n", "\n", "loc_key_2\n", "EDX = umod({EAX 0 32, EDX 32 64}, zeroExt_64(ECX))[0:32]\n", "EAX = udiv({EAX 0 32, EDX 32 64}, zeroExt_64(ECX))[0:32]\n", "IRDst = b'end'\n", "\n", "\n", "\n", "0->2\n", "\n", "\n", "\n", "\n", "\n", "3\n", "\n", "\n", "loc_key_3\n", "exception_flags = 0x2010000\n", "IRDst = b'end'\n", "\n", "\n", "\n", "0->3\n", "\n", "\n", "\n", "\n", "\n", "1\n", "\n", "\n", "end\n", "\n", "NOT PRESENT\n", "\n", "\n", "\n", "2->1\n", "\n", "\n", "\n", "\n", "\n", "3->1\n", "\n", "\n", "\n", "\n", "\n" ], "text/plain": [ "" ] }, "execution_count": 13, "metadata": {}, "output_type": "execute_result" } ], "source": [ "graph_ir_x86(\"\"\"\n", "main:\n", " DIV ECX\n", "\"\"\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "This instruction may generate an exception in case of the divisor is zero. The intermediate representation generates a test in which it evaluate the divisor value and assigns a special register `exception_flags` to a constant. This constant represents the division by zero.\n", "\n", "Note this is arbitrary. We could have done the choice to not explicit the possible division by zero, and keep in mind that the `umod` and `udiv` operator may generate exceptions. This may change in a future version of Miasm. Indeed, each memory access may generate a exception, and Miasm doesn't explicit them in the intermediate representation: this may be misleading and very hard to analyze in a post pass. This is why we may accept to implicitly raise exception in both those operators rather than generating such a code.\n", "\n", "The same choice has been done in other instructions:" ] }, { "cell_type": "code", "execution_count": 14, "metadata": {}, "outputs": [ { "data": { "image/svg+xml": [ "\n", "\n", "\n", "\n", "\n", "\n", "html_table\n", "\n", "\n", "\n", "0\n", "\n", "\n", "main\n", "exception_flags = 0x2\n", "interrupt_num = 0x3\n", "IRDst = b'end'\n", "\n", "\n", "\n", "1\n", "\n", "\n", "end\n", "\n", "NOT PRESENT\n", "\n", "\n", "\n", "0->1\n", "\n", "\n", "\n", "\n", "\n" ], "text/plain": [ "" ] }, "execution_count": 14, "metadata": {}, "output_type": "execute_result" } ], "source": [ "graph_ir_x86(\"\"\"\n", "main:\n", " INT 0x3\n", "\"\"\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Memory accesses by default explicit segmentation:" ] }, { "cell_type": "code", "execution_count": 15, "metadata": {}, "outputs": [ { "data": { "image/svg+xml": [ "\n", "\n", "\n", "\n", "\n", "\n", "html_table\n", "\n", "\n", "\n", "0\n", "\n", "\n", "main\n", "EAX = @32[segm(FS, EBX)]\n", "IRDst = b'end'\n", "\n", "\n", "\n", "1\n", "\n", "\n", "end\n", "\n", "NOT PRESENT\n", "\n", "\n", "\n", "0->1\n", "\n", "\n", "\n", "\n", "\n" ], "text/plain": [ "" ] }, "execution_count": 15, "metadata": {}, "output_type": "execute_result" } ], "source": [ "graph_ir_x86(\"\"\"\n", "main:\n", " MOV EAX, DWORD PTR FS:[EBX]\n", "\"\"\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The pointer of the memory uses the special operator `segm`, which takes two arguments:\n", "- the value of the segment used the memory access\n", "- the base address\n", "\n", "Note that if you work in a flat segmentation model, you can add a post translation pass which will *simplify* `ExprOp(\"segm\", A, B)` into `B`. This will ease code analysis.\n", "\n", "Note: If you read carefully the documentation on `expression`s, you know that the word `ExprOp` is n-ary and that all of its arguments must have the same size. The operator `segm` is one of the exceptions. The register `FS` has a size of 16 bit (as a segment selector register) and `EBX` has a size of 32. In this case, the size of `ExprOp(\"segm\", FS, EBX)` has the size of `EBX`" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Intermediate representation translation\n", "In this part, we will explain some manipulations which can be done during the native code *lifting*. Let's take the example of a call to a subfunction:" ] }, { "cell_type": "code", "execution_count": 16, "metadata": {}, "outputs": [ { "data": { "image/svg+xml": [ "\n", "\n", "\n", "\n", "\n", "\n", "html_table\n", "\n", "\n", "\n", "0\n", "\n", "\n", "main\n", "CALL       loc_11223344\n", "\n", "\n", "\n", "2\n", "\n", "\n", "loc_5\n", "MOV        EBX, EAX\n", "\n", "\n", "\n", "0->2\n", "\n", "\n", "\n", "\n", "\n", "1\n", "\n", "\n", "end\n", "IOError\n", "\n", "\n", "\n", "2->1\n", "\n", "\n", "\n", "\n", "\n" ], "text/plain": [ "" ] }, "execution_count": 16, "metadata": {}, "output_type": "execute_result" } ], "source": [ "asmcfg = gen_x86_asmcfg(\"\"\"\n", "main:\n", " CALL 0x11223344\n", " MOV EBX, EAX\n", "\"\"\")\n", "asmcfg.graphviz()" ] }, { "cell_type": "code", "execution_count": 17, "metadata": {}, "outputs": [ { "data": { "image/svg+xml": [ "\n", "\n", "\n", "\n", "\n", "\n", "html_table\n", "\n", "\n", "\n", "0\n", "\n", "\n", "main\n", "ESP = ESP[0:32] + 0xFFFFFFFC\n", "@32[ESP[0:32] + 0xFFFFFFFC] = loc_5\n", "EIP = loc_11223344\n", "IRDst = loc_11223344\n", "\n", "\n", "\n", "3\n", "\n", "\n", "loc_11223344\n", "\n", "NOT PRESENT\n", "\n", "\n", "\n", "0->3\n", "\n", "\n", "\n", "\n", "\n", "1\n", "\n", "\n", "end\n", "\n", "NOT PRESENT\n", "\n", "\n", "\n", "2\n", "\n", "\n", "loc_5\n", "EBX = EAX\n", "IRDst = b'end'\n", "\n", "\n", "\n", "2->1\n", "\n", "\n", "\n", "\n", "\n" ], "text/plain": [ "" ] }, "execution_count": 17, "metadata": {}, "output_type": "execute_result" } ], "source": [ "graph_ir_x86(\"\"\"\n", "main:\n", " CALL 0x11223344\n", " MOV EBX, EAX\n", "\"\"\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "What did happened here ?\n", "- the `call` instruction has 2 side effects: stacking the return address and jumping to the subfunction address\n", "- here, the subfunction address is 0x1122334455, and the return address is located at offset `0x5`, which is represented here by `loc_5`\n", "\n", "The question is: why are there unlinked nodes in the graph? The answer is that the graph only analyzes destinations of the `IRBlock`s, which means the value of `IRDst`. So in the `main`, Miasm knowns that the next `IRBlock` is located at `loc_11223344`. But as we didn't disassemble code at this address, we don't have its intermediate representation.\n", "\n", "But the disassembler engine knowns (this behavior can be customized) that a `call` returns back to the instruction just next to the call. So the basic block at `end` has been disassembled and translated. If we analyze `IRDst` only, there are no links between them." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "This `raw` way of translating is interesting to see low level moves of stack and return address, but it makes code analysis a bit hard. What we may want is to consider subcalls like an unknown operator, with arguments and side effects. This may *model* the call to a subfunction.\n", "\n", "This is the difference in Miasm between translating using `lifter` (raw translation) and `lifter_model_call` (`ilifter` + call modelization) which models subfunction calls. By default, Miasm uses a basic model which is *wrong* in most cases. But this model can (and must ?) be replaced by the user behavior.\n", "\n", "You can observe the difference in the examples:\n", "```\n", "example/disasm/dis_binary_lift.py\n", "```\n", "and\n", "```\n", "example/disasm/dis_binary_lifter_model_call.py\n", "```\n" ] }, { "cell_type": "code", "execution_count": 18, "metadata": {}, "outputs": [ { "data": { "image/svg+xml": [ "\n", "\n", "\n", "\n", "\n", "\n", "html_table\n", "\n", "\n", "\n", "0\n", "\n", "\n", "main\n", "EBX = 0x1234\n", "EAX = call_func_ret(loc_11223344, ESP)\n", "ESP = call_func_stack(loc_11223344, ESP)\n", "IRDst = loc_a\n", "\n", "\n", "\n", "2\n", "\n", "\n", "loc_a\n", "ECX = EAX\n", "ESP = ESP[0:32] + 0x4\n", "EIP = @32[ESP[0:32]]\n", "IRDst = @32[ESP[0:32]]\n", "\n", "\n", "\n", "0->2\n", "\n", "\n", "\n", "\n", "\n" ], "text/plain": [ "" ] }, "execution_count": 18, "metadata": {}, "output_type": "execute_result" } ], "source": [ "graph_ir_x86(\"\"\"\n", "main:\n", " MOV EBX, 0x1234\n", " CALL 0x11223344\n", " MOV ECX, EAX\n", " RET\n", "\"\"\", True)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "What happened here?\n", "The translation of the `call` is replaced by two side effects which occur in parallel:\n", "- `EAX` is set to the result of the operator `call_func_ret` which has two arguments: `loc_11223344` and `ESP`\n", "- `ESP` is set to the result of the operator `call_func_stack` which has two arguments: `loc_11223344` and `ESP`\n", "\n", "The first one is there to model the assignment in 'classic' x86 code of the return value. The second one is there to model a possible change of the stack pointer depending on the function called, that the old stack pointer.\n", "Everything here can be subclassed in order to customize the translation behavior." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Subfunction call custom modeling\n", "The code responsible of the modelisation of function calls is located in the `LifterModelCall` class (the lifter with call modeling) in `miasm/ir/analysis.py`:\n", "```python\n", "...\n", " def call_effects(self, addr, instr):\n", " \"\"\"Default modelisation of a function call to @addr. This may be used to:\n", "\n", " * insert dependencies to arguments (stack base, registers, ...)\n", " * add some side effects (stack clean, return value, ...)\n", "\n", " Return a couple:\n", " * list of assignments to add to the current irblock\n", " * list of additional irblocks\n", "\n", " @addr: (Expr) address of the called function\n", " @instr: native instruction which is responsible of the call\n", " \"\"\"\n", "\n", " call_assignblk = AssignBlock(\n", " [\n", " ExprAssign(self.ret_reg, ExprOp('call_func_ret', addr, self.sp)),\n", " ExprAssign(self.sp, ExprOp('call_func_stack', addr, self.sp))\n", " ],\n", " instr\n", " )\n", " return [call_assignblk], []\n", "\n", "```\n", "\n", "Some architectures subclass it to include some architecture dependent stuffs, for example in `miasm/arch/x86/lifter_model_call.py` in which we use a default calling convention linked to arguments passed through registers:\n", "```python\n", "...\n", " def call_effects(self, ad, instr):\n", " call_assignblk = AssignBlock(\n", " [\n", " ExprAssign(\n", " self.ret_reg,\n", " ExprOp(\n", " 'call_func_ret',\n", " ad,\n", " self.sp,\n", " self.arch.regs.RCX,\n", " self.arch.regs.RDX,\n", " self.arch.regs.R8,\n", " self.arch.regs.R9,\n", " )\n", " ),\n", " ExprAssign(self.sp, ExprOp('call_func_stack', ad, self.sp)),\n", " ],\n", " instr\n", " )\n", " return [call_assignblk], []\n", "\n", "```\n", "\n", "This is the generic code used in `x86_64` to model function calls. But you can finely model functions. For example, suppose you are analysing code on `x86_32` with `stdcall` convention. Suppose you know the callee clean its stack arguments. Suppose as well you know for each function how many arguments it has. You can then customize the model to match the callee and compute the correct stack modification, as well as getting the arguments from stack:\n", "\n", "\n", "\n" ] }, { "cell_type": "code", "execution_count": 19, "metadata": {}, "outputs": [ { "data": { "image/svg+xml": [ "\n", "\n", "\n", "\n", "\n", "\n", "html_table\n", "\n", "\n", "\n", "0\n", "\n", "\n", "main\n", "EBX = 0x1234\n", "ESP = ESP + -0x4\n", "@32[ESP + -0x4] = 0x3\n", "ESP = ESP + -0x4\n", "@32[ESP + -0x4] = 0x2\n", "ESP = ESP + -0x4\n", "@32[ESP + -0x4] = 0x1\n", "EAX = call_func_ret(loc_11223344, @32[ESP + 0x0], @32[ESP + 0x4], @32[ESP + 0x8])\n", "ESP = ESP + 0xC\n", "IRDst = loc_10\n", "\n", "\n", "\n", "2\n", "\n", "\n", "loc_10\n", "ECX = EAX\n", "ESP = ESP[0:32] + 0x4\n", "EIP = @32[ESP[0:32]]\n", "IRDst = @32[ESP[0:32]]\n", "\n", "\n", "\n", "0->2\n", "\n", "\n", "\n", "\n", "\n" ], "text/plain": [ "" ] }, "execution_count": 19, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Construct a custom lifter\n", "class LifterFixCallStack(LifterModelCall_x86_32):\n", " def call_effects(self, addr, instr):\n", " if addr.is_loc():\n", " if self.loc_db.get_location_offset(addr.loc_key) == 0x11223344:\n", " # Suppose the function at 0x11223344 has 3 arguments\n", " args_count = 3\n", " else:\n", " # It's a function we didn't analyze\n", " raise RuntimeError(\"Unknown function parameters\")\n", " else:\n", " # It's a dynamic call !\n", " raise RuntimeError(\"Dynamic destination ?\")\n", " # Arguments are taken from stack\n", " args = []\n", " for i in range(args_count):\n", " args.append(ExprMem(self.sp + ExprInt(i * 4, 32), 32))\n", " # Generate the model\n", " call_assignblk = AssignBlock(\n", " [\n", " ExprAssign(self.ret_reg, ExprOp('call_func_ret', addr, *args)),\n", " ExprAssign(self.sp, self.sp + ExprInt(args_count * 4, self.sp.size))\n", " ],\n", " instr\n", " )\n", " return [call_assignblk], []\n", "\n", "graph_ir_x86(\"\"\"\n", "main:\n", " MOV EBX, 0x1234\n", " PUSH 3\n", " PUSH 2\n", " PUSH 1\n", " CALL 0x11223344\n", " MOV ECX, EAX\n", " RET\n", "\"\"\", lifter_custom=LifterFixCallStack)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "In the new graph, it's now easy to see that `EAX` depends on a custom operator `call_func_ret` with arguments:\n", "- `loc_11223344`\n", "- @32[ESP + 0x0]\n", "- @32[ESP + 0x4]\n", "- @32[ESP + 0x8]\n", "\n", "The stack pointer is updated: it is increased by 0xC bytes, which corresponds to its arguments size (we didn't model the extra 4 bytes pushed on the stack for the return address, so no need to take them into account using our arbitrary model)\n" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.9.0" } }, "nbformat": 4, "nbformat_minor": 4 }