Investigate malicious PDF documents with mpeepdf — A quick user guide

7 min readDec 22, 2020

mpeepdf is a modified version of a powerful Python tool — peepdf to analyze PDF documents. The ultimate goal of mpeepdf is to provide a unique, all-you-need framework for security researchers and analysts to investigate a PDF file.

When analyzing a PDF document, there are multiple options in the toolkit like pdf-id, pdf-parser, pdfwalker (to view PDF structures via GUI), and especially peepdf which is just a great, all-in-one tool to analyze PDF. Thumbs up to the author — Jose Miguel Esparza. However, there are still some features that I want to add to peepdf. So I have spent most of my free time in several months to enhance it with some more capabilities, called mpeepdf (a modified version of peepdf).

This article serves as a quick user guide to analyze PDF with mpeepdf. For all available commands in peepdf, it is best to refer to the original wiki from Jose.

All options

-h, — help — show this help message and exit
-i, — interactive — Sets console mode.
-s SCRIPTFILE, — load-script=SCRIPTFILE — Loads the commands stored in the specified file and execute them.
-c, — check-vt — Checks the hash of the PDF file on VirusTotal.
-f, — force-mode — Sets force parsing mode to ignore errors.
-l, — loose-mode — Sets loose parsing mode to catch malformed objects.
-m, — manual-analysis — Avoids automatic Javascript analysis. Useful with eternal loops like heap spraying.
-u, — update — Updates peepdf with the latest files from the repository.
-g, — grinch-mode — Avoids colorized output in the interactive console.
-v, — version — Shows program’s version number.
-x, XMLPATH, — xml=XMLPATH — Exports the document information in XML format.
-j, JSONPATH, — json=JSONPATH — Exports the document information in JSON format.
-w, HTMLPATH, — html=HTMLPATH — Exports the document information in JSON format.
-C, COMMANDS, — command=COMMANDS — Specifies a command from the interactive console to be executed.

Basic usage

mpeepdf.py [option] PDFFile

If you see the warning of PyV8 and pylibemu not installed, mpeepdf can’t analyze Javascript code and shellcode respectively.
Since malicious PDFs may not strictly follow a standard format, it is highly recommended to always use mpeepdf with the option “-fl” which will ignore errors while parsing a PDF.
mpeepdf may take from few seconds to minutes to parse a PDF depending on its size. If it takes too long (it may due to automatic Javascript analysis), you can add option “-m” to disable automatic JS analysis.

Example: mpeepdf.py -fl test.pdf

mpeepdf’s output provides all basic metadata as well as advanced analysis information such as large streams, suspicious elements, maliciousness score, JS code, and shellcode (if found)

Export analysis report

Notably, mpeepdf can export the output to multiple formats, namely JSON (-j), XML (-x) and HTML(-w).

Example: mpeepdf.py -fl -w test.html test.pdf

Investigate PDF in interative mode

Normally an analyst would like to interact with the parsed PDF in order to perform further investigation (option -i); hence, mpeepdf is often used with “-fli”. This section aims to show common commands used when analyzing a malicious PDF. The exhausting list can be found in All Commands.

Example: mpeepdf.py -fli test.pdf

The command above returns an interactive shell where an analyst can examine all PDF contents, metadata and analysis information

info command: to show the same output as you first run mpeepdf which highly all interesting elements and may require some further investigation. This is also useful when one needs to show back the main info after wandering around.

tree command: is useful as it shows the relationship between objects/streams. For example, if a parent object is /AA and a child is /JS, then there is a high chance that Javascript will be executed once the PDF is open.

offsets command: shows how each object physically located in a PDF.

object command: once identified interesting objects for analysing, “object” is the most use command to review decoded content of an object.

rawobject command: same as “object” but to show raw data.

stream command: similar to “object”, it shows the decoded content of a stream.

Figure 9: stream command output

rawstream command: to show the raw content of a stream.

extract command: to extract URL(s), Javascript(s) found in a PDF. Also, it can extract URL(s) (if found in shellcode) and analyzed javascript code (it may contain multi stages)

set and show command: to assign a value to a variable and show it.

>,>>: are to use extract/append a value/object/stream to a file.

$> and $>>: are to use extract/append a value/object/stream to a variable.

xor_search_pe command: to look for a PE file based on the magic header (MZ…PE). If provided a key, it will “xor” the content with a key and look for a PE file. Otherwise, it will brute force all possible keys (255 since this is 1-byte xor); however, this may take long time depending on the size of the provided content. Please note that this could have false detection since this is just a pure text search.

Javascript analysis

js_code command: is to show found javascript code. In the figure below, object 1 contains Javascript; hence, the command js_code 1 will display JS embedded.

js_eval command: Executes the Javascript code stored in the specified variable, file, or object.

js_beautiful command: is useful to beautify obfuscated Javascript codes. It is especially useful when considering Javascript from external sources (js_code internally also uses js_beautiful)

js_unescape command: is to unescape escaped bytes

js_analyse command: use a wrapper of v8 Javascript intepreter, PyV8. It overwrites the “eval” method to only print out provided decoded script (next stage script), then “js_analyse” repeatedly evaluate the next-stage script till no further-stage found and print all next JS stages as well as found unescaped bytes and URLs. However, this function doesn’t take data from parts of the PDF to enrich its evaluation.

js_unpack command: adopts the JSUnpack’s approach. It uses the V8 Javascript interpreter directly since V8 provides some benefits over PyV8. In principle, it is similar to “js_analyse”; however, it enriches the analyzed code by PDF’s metadata, annotation (to provide a return for getAnnot() and getAnnots() function) and fields in XML (if the script in the analyzed object is contained in an XML. /XFA and /Acroform may contains Javascript in an embedded XML). Furthermore, it evaluates all Javascript codes with multiple app.viewerVersion (some JavaScripts only work in a certain Adobe version, currently, mpeepdf evaluates over three main versions: 9.0, 10.0, and 11.0. These versions are configurable). The outcome of this command is stages of Javascript codes, unescaped bytes, URLs.
Note: This command is run to every Javascript code found in a PDF (same as “js_analyse”). It is recommended to use -w (for html) or -j (for json) or -x (for xml) option which export results in a specified format.
Example: js_unpack object 1 to analyze Javascript in object 1 and the result looks as in figures below. (For the sake of brevity, results are shortened)

Investigate malicious PDF documents with mpeepdf — A quick user guide

Written by Tho Le