Changing Metadata with Python

Posted on Sat 08 July 2017 in Python

Election Emails

In early May, near the tail end of the contentious French presidential election, then-candidate and now-President Emmanuel Macron's campaign was targeted by who many security experts thought to be APT28 - an advanced, offensive Russian hacking group.

Less than 48 hours before the election, download links to 9GB of emails from Macron's party were anonymously published on Pastebin. Shortly thereafter, the En Marche! political party confirmed the breach in a public statement:

The En Marche! party has been the victim of a massive, coordinated act of hacking, in which diverse internal information (mails, documents, accounting, contracts) have been broadcast this evening on social networks.

But most importantly, the party warned any would-be purveyors of the stolen data that there was no way to guarantee their authenticity, implying that the attackers could have (and likely did) planted disinformation.

False Flags

The trove of data contained several MS Word documents with Cyrillic-character metadata. To some, these artifacts were the smoking gun - but metadata can be fake.

MS Word documents, also known as Office Open XML Documents, are essentially zipped archives (they contain other files and folders).

If you want to quickly test this, create a "test.docx" file and rename it to "test.zip." Extract the files from the archive, and you should be left with the following directory structure.

-- test.zip
| -- rels/
| -- docProps/
| -- -- app.xml
| -- -- core.xml
| -- word/

The core.xml file contains all of the metadata for your file:

<cp:coreProperties xmlns:cp="http://schemas.openxmlformats.org/package/2006/metadata/core-properties" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:dcmitype="http://purl.org/dc/dcmitype/" xmlns:dcterms="http://purl.org/dc/terms/" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance">
   <dc:title />
   <dc:subject />
   <dc:creator>goodguy@original.com</dc:creator>
   <cp:keywords />
   <dc:description />
   <cp:lastModifiedBy>goodguy@original.com</cp:lastModifiedBy>
   <cp:revision>1</cp:revision>
   <dcterms:created xsi:type="dcterms:W3CDTF">2017-07-06T19:07:00Z</dcterms:created>
   <dcterms:modified xsi:type="dcterms:W3CDTF">2017-07-06T19:07:00Z</dcterms:modified>
</cp:coreProperties>

As is mentioned in the article "Let's get fancy with flase flags", it's trivially easy to alter this metadata.

Python

Equipped with the python-docx module (documentation here), we can quickly change any of the fields in the core.xml file with just a few lines of code.

>>> from docx import Document
>>> document = Document('test.docx')
>>> document.core_properties.author
'goodguy@original.com'
>>> document.core_properties.created
datetime.datetime(2017, 7, 6, 19, 7)
>>> document.core_properties.last_modified_by
'goodguy@original.com'
>>> document.core_properties.author = badguy@changed.com
>>> document.core_properties.author
'badguy@changed.com'
>>> document.save('meta.docx')

Attribution

Why is any of this important? Because attribution can be really, really difficult, and the geopolitical ramifications can be significant. As the security research x0rz points out in the aforementioned article:

"Metadata might only give us clues. It needs to be corroborated with other sources of intelligence or you’ll fail miserably at threat intel."

Although the example of MS Word documents may seem overly simplistic, it still highlights the basic principle. Allow me to reference yet another security researcher known only by their handle, the grugq, detailing the complexity of "proof" in intelligence investigations:

"It will probably be a bunch of circumstantial evidence, a complexity of timelines, snippets of information from various sources with different levels of confidentiality and reliability. This patchwork of data needs to be processed and analyzed via complicated techniques designed to reduce cognitive bias. All of this, only to arrive at a sort of high probability of maybe."