PDF Forensics: Introduction (Part 1)

Tho Le
6 min readApr 26, 2019

--

This session means to provide an overview of PDF Forensics, including (1) PDF structure, (2) PDF syntax, (3) some notable suspicious objects, (4) Overview of Javascript analysis and (5) introducing a tool that assists PDF investigation (mpeepdf)

PDF Forensics: Javascript analysis (part 2)

PDF Structure

In order to perform PDF forensics, it is essential to understand the structure of PDF. Luckily, it is fairly simple. I find the figure below from zbetcheckin quite representative for all what we need to know. PDF contains 4 parts:

  • Header: starts with %PDF (e.g. %PDF-1.1 for PDF version 1.1) within the first 1024 bytes.
  • Boby: contains the main content of a PDF which is actually just a list of objects (PDF object will be covered in more details in the PDF syntax part).
  • Xref (Cross Reference) table: contain offset addresses of objects in the Body and their status (“n” stands for not free or in use and “f” stands for free). In the figure below, the first row “0 5” indicates that there are 5 objects in the PDF file and starting from object 0. So there are 5 more rows contain offset addresses and statuses for object 0, 1, 2, 3 and 4 respectively.
  • Trailer: contains three important information: (1) the root object which indicates the starting of the PDF logical view. For further information, I suggest to check out this article from Didier Stevens. (2) the offset address of the Xref table so that PDF viewer can refer to objects. (3) other metadata such as author and created date.
Figure 1: PDF Structure (Source: https://github.com/zbetcheckin/PDF_analysis/blob/master/pdf_ange_albertini.png)

PDF Syntax

Adobe provides a portable reference for PDF which is about 1,000 pages, so creating a PDF is fairly complex. Fortunately, we don’t need to know all of those details to perform malware analysis for PDF. The figure below illustrates most (if not all :) ) of what we need to know.

PDF content is actually just a list of objects which are linked together to build a logical tree. An Object starts with obj_num obj_rev obj (e.g. 4 0 obj) and ends with endobj. An object can be referenced by another object, so it can also be called an indirect object. Particularly, in the figure below, object 4 refers to object 28 (28 0 R).

There are 7 basic types of objects (examples are shown in figure 2):

  1. Numeric object: is for numeric value such as length in this example.
  2. Boolean object: is simply just true and false.
  3. String object: is for string. However, it can support multiple forms : (1) ASCII text in parentheses () such as (some text here) and (2) hex string in angle brackets <> such as <abcdfe78>.
  4. Array object: is enclosed in a square bracket [].
  5. Dictionary object: is a set of key-value pairs inside <<>>. As shown in figure 2, there are 2 key-values pairs: key1 — /Filter with value1 — /FlateDecode and key2 — /Length with value2–10243.
  6. Name object: is a unique symbol defined by a slash (/) and has no internal structure. Example are /somename or /#61#62cd (# to indicate 2-digit hexadecimal, so in this example, the name object is /abcd)
  7. Stream object: is a sequence of bytes and can contain other objects/files/images etc. Furthermore, a stream can be encoded by one or multiple schemes. Figure 2 illustrates object 1 as a stream with the format:<< Dictionary
    /Filter /FlateDecode (encode by FlateDecode)
    /Length 10243 (length of this stream is 10243)
    >>
    stream
    some bytes
    endstream
Figure 2: PDF Syntax

Some notable suspicious objects:

Below are some suspicious objects which are frequently strong indicators for further PDF analysis.

  • Embedded Javascript: /JS, /JavaScript,/MacroForm and /XFA ( the last two items can contain Javascript as well as other data used by Javascript in XML)
  • Embedded Flash: /RichMedia (Flash supports action script)
  • Launching: /AA, /Launch and /OpenAction (perform a defined action upon opening a PDF)
  • Internet access: /URI and /SubmitForm
  • Embedded file: /EmbeddedFiles
  • Others: /Goto etc.

Overview of Javascript Analysis

Javascript is a common attack vector. Furthermore, it can be highly obfuscated, which makes it quite “interesting” to analyze. Therefore, it worths spending some time discussing how to perform Javascript analysis.

Figure 3: An example of Javascript embedded in PDF (source: https://isc.sans.edu/forums/diary/Advanced+obfuscated+JavaScript+analysis/4246/)

In general, there are two approaches:

  • Go through and understand the script (it will be discussed more in Part 2). This approach is quite time-consuming but is usually helpful to deal with a sophisticated and/or new way of Javascript obfuscation.
  • Focus only on interesting functions that are commonly used to obfuscate Javascript. This approach may not be applicable to all cases, however, it is effective and quick. Therefore, it usually worths giving this approach a try. Details of this method are covered in the text below.

These are some functions that I often take a better look for the second approach:

  • “Replace” function: malware writers may use it to turn a garbage string into something meaningful. For example, “evailazkb”.replace(/[azkb]/g,“”) is equal to the string “evil”.
  • “Unescape” function: malware writers may retrieve their final payload (e.g shellcode) by unescaping an escaped string. For example, unscape(“eval%20%28some%20evil%20codes%29”) is equal to the string “eval (some evil codes)”
  • Eval” function: is an interesting function that runs strings as code. This opens many opportunities for Javascript obfuscation.
    Figure 4 illustrates how an obfuscated Javascript is executed via eval(). It starts from a highly obfuscated script which will then construct a Javascript-code string. Then eval function is called to execute the newly generated string as Javascript code. This can be repeated multiple times until the real-malicious code.
Figure 4: Obfuscated Javascript via Eval()

Figure 5 illustrates how to defeat this obfuscation technique. It is simply that we just need to overwrite the eval() function. In this example, I make eval() to print out its the code prior to execution. This method can be repeated until the final code and ones may look for shellcodes or URLs to malicious payloads (preferably via automation — mpeepdf has this feature).

Figure 5: An approach to defeat obfuscated Javascript via eval()

PDF Investigation with mpeepdf

Fortunately, there are quite a number of good tools for PDF analysis. Please refer here and give it a try to select your favorite tools.

In my toolkit, I often use pdf-id, pdf-parser, pdfwalker (to view PDF structures via GUI) and especially peepdf which is just a great, all-in-one tool to analyze PDF. Thumps up to the author — Jose Miguel Esparza.

However, there are still some features that I want to add to peepdf. So I have spent most of my free time in several months to enhance it with some more capabilities, called mpeepdf (a modified version of peepdf). I have created a quick user guide to analyze PDF with mpeepdf. For all commands, it is best to refer to the original wiki from Jose.

Figure 6 briefly presents all features that mpeepdf can assist you when performing PDF analysis.

Figure 6: Summary of mpeepdf functions (reds are original peepdf features and greens are newly added features)

Summary

The article has hopefully introduced the fundamental of PDF analysis. It covers PDF structure/syntax and some notable suspicious elements that almost guarantee for a further investigation. Then it touches Javascript obfuscation and deobfuscation techniques. Finally, it presents a favorite tool of the author, peepdf or mpeepdf, for PDF analysis.

--

--

Tho Le
Tho Le

Written by Tho Le

Senior Cyber Security Analyst — be better than the yesterday self

Responses (1)