In the previous article, I have provided an introduction to PDF forensics, which explains PDF structure and introduces some important object types to examine.
Javascript (JS) is a common attack vector. Furthermore, it can be highly obfuscated, which makes it quite “interesting” to analyze. Hence, this article presents more details of JS analysis.
In general, there are two approaches:
- Approach 1 — Code analysis: Go through and understand the script This approach is quite time-consuming but is usually helpful to deal with a sophisticated and/or new way of Javascript obfuscation.
- Approach 2 — Behaviour analysis: Focus only on interesting functions that are commonly used to obfuscate Javascript. This approach may not be applicable to all cases, however, it is effective and quick. Therefore, it usually worths giving this approach a try. Details of this method are covered in the text below.
Approach 2 — Behaviour analysis has been covered in the previous article. Hence, we will go through together the code analysis approach with some tips/tricks and then apply them to analyze a malicious sample.
Approach 1 — Code analysis
At this point, it is assumed that you have extracted malicious Javascript from a PDF file (I will cover this step in the example analysis below, more advanced details can be referred to the quick guide of mpeepdf). Below are some tips to analyze Javascript codes:
- Beautify codes: Javascript is often obfuscated to make it hard to read with un-necessarily spaces, comments, and messy indents, etc. Hence, beautifying the codes will make it easier to derive the code logic. There are multiple tools for this purpose, some of them are (1) js-beautify (2) JSTool, and (3) JSTool plugins in Visual Studio (VS) Code and notepad++. Below is an example of raw extracted JS code vs beautified code via JSTool plugin in VS Code
- Comment removal: Malware writers try to overwhelm analysts by adding numerous comments into JS codes. Those comments are not part of the codes; hence, they can be safely ignored/removed.
- String replacement: it is common that malware writers play around with string to confuse analysts. As seen in the figure below, fjrQoWmQCIhnajrICWondIeQ’.replace(/[QWIjn]/g,’’) is an obfuscated expression of fromCharCode
- Variable/function rename: Malware writers enjoy to spin your head around with crazy variable/function names, such as jcoknrvrjwvktuyoedov. Hence, it is really helpful to rename variable/function names to somethings that reflect their functions such as jcoknrvrjwvktuyoedov → new_file.
- Always true/false condition: it is also common that malware writers trick us by always true or always false condition. For example, this code block 2e1>10e1?:123:“if” always return “if” as 2e1 is smaller than 10e1.
- Execute JS codes: if there is a chunk of JS codes that are highly obfuscated and will take a lot of time for analysis. A quick way to defeat the obfuscation is to execute those codes via JS interpreters. There are Examplethree main JS interpreters, namely spider monkey (which is used by Adobe), V8 (by Google), and node.js.
- Debugging: I often use VS Code to debug JS codes. However, I am sure there are other ways as well. Figure 5 illustrates the debugging screen, in which you can track variables, call stack, watch variables, and add a breakpoint to an interesting line.
Sample Analysis
At this point, we should have enough theory to analyze a malicious PDF. Let's analyze a sample found in VirusTotal with SHA256: 8860ee6c2772e66a5024a719f05d23bb597ad24832ff301c050dd37e4914e3
Let's examine the sample with mpeepdf tool. From the tool output, it indicates the maliciousness score of 7.0/10. Furthermore, amber lines explain why this file is considered with high malicious probability.
In Particular, object 7 is interesting since it is a stream that contains encoded JS codes. The figure below shows JS codes that will be extracted with the command: stream 7 > obj7.js
Open obj7.js in VS Code and beautify the code. For the sake of clarity, I strip out the long array of decimal numbers.
P/S: there are 2 errors highlighted by VS Code, so I corrected them which will allow me to run the whole code or debug if required.
- jR=new Date(),var h=false → jR=new Date();var h=false
- sK=function(){},c=”c”, →sK=function(){},c=”c”
A quick scan over the codes, some function calls are noticeable such as replace(), substring(), CharCodeAt(). Also all variable and function names are kind of obfuscated. Hence, I proceed to rename them to something more meaningful.
Figure 8 presents JS codes after processing. This is a quite simple JS code that reads each decimal number in the number_array and decoded it to a character. decoded_string is then placed into the eval() function for execution. At this point, there are few ways to derive the decoded_string:
- Method overwrite: you can overwrite eval() method to only print out its argument and exit.
- Replace eval() by print() in the code ( for convenience, I used this method)
- Debug and make a breakpoint at that line to examine the content of decoded_string
The second stage looks as below and it seems to be the final stage as well since we observe buffer overflow attack by exploiting util.printf() — CVE-2008–2992, Collab.collectEmailInfo() — CVE-2007–5659 and app.doc.Collab.getIcon()-CVE-2009–0927 to execute shellcode.
Although analyzing shellcode is not the purpose of this article, some simple shellcodes are only to retrieve the next payload via URLs. Hence, we can have a quick look over readable characters to obtain some IOCs. There are multiple ways to convert hex to string in the shellcode, in this article, I utilize mpeepdf’s functions as below:
As can be seen in Figure 11, there is a URL extracted from the shellcode which is likely the next payload.