6

Cache Rules Everything Around Me: Finding secrets in Python bytecode

 3 years ago
source link: https://blog.jse.li/posts/pyc/
Go to the source link to view the article. You can view the picture content, updated content and better typesetting reading experience. If the link is broken, please click the button below to view the snapshot at that time.

When you import a Python file for the first time, the Python interpreter will compile it and cache the resulting bytecode in a .pyc file so that subsequent imports don’t have to deal with the overhead of parsing or compiling again.

It’s also common practice for Python projects to store configuration, keys, and passwords (collectively referred to as “secrets”) in a gitignored Python file called secrets.py , config.py , or settings.py , which other parts of the project import. This provides a nice separation between secrets and source code that gets checked in, and for the most part, this kind of setup works well. And because it reuses the language’s import mechanism, these projects don’t have to fuss around with file I/O or formats like JSON.

But for the same reason that this pattern is fast and convenient, it is also potentially insecure. Because it reuses the language’s import mechanism, which has a habit of creating and caching .pyc files, those secrets also live in the compiled bytecode! Some initial research using the GitHub API reveals that thousands of GitHub repositories contain secrets hidden inside their bytecode.

Existing tools for finding secrets (my favorite is trufflehog ) in repositories skip over binary files like .pyc files, instead only scanning plain text files such as source code or configuration files.

A crash-course on cached source

Earlier versions of Python stored these files next to the original source files, but beginning with Python 3.2 , these files all live in a folder called __pycache__ at the root of the imported module. For our purposes, this doesn’t matter.

Suppose we had a Python file containing this secret password:

SECRET_KEY = "Green eggs and ham"

The bytecode corresponding to that line of code looks like this:

0 LOAD_CONST               1 ('Green eggs and ham')
2 STORE_FAST               0 (SECRET_KEY)

Note that the string and variable name are reproduced in their entirety! Further, it turns out that Python bytecode often contains enough information to recover the original structure of the file. Tools like uncompyle6 can translate .pyc files back into their original forms.

$ uncompyle6 secrets.cpython-38.pyc

# uncompyle6 version 3.6.7
# Python bytecode 3.8 (3413)
# Decompiled from: Python 3.8.2 (default, Apr  8 2020, 14:31:25) 
# [GCC 9.3.0]
# Embedded file name: secrets.py
# Compiled at: 2020-05-12 17:16:29
# Size of source mod 2**32: 34 bytes
SECRET_KEY = 'Green eggs and ham'
# okay decompiling secrets.cpython-38.pyc

Caching out

To investigate just how widespread this problem was, I wrote a short script to search GitHub for .pyc files and decompile them to look for secrets. I alerted any organizations that had secrets I discovered this way.

import base64
import io
import os
import tempfile
import uncompyle6
from github import Github

GITHUB_KEY = os.environ.get("GITHUB_KEY")

g = Github(GITHUB_KEY)
items = g.search_code("filename:secrets.pyc")
for item in items:
    print(f"DECOMPILING REPO https://github.com/{item.repository.full_name}")
    print(f"OWNER TYPE: {item.repository.owner.type}")
    try:
        contents = base64.b64decode(item.content)
        with tempfile.NamedTemporaryFile(suffix=".pyc") as f:
            f.write(contents)
            f.seek(0)

            out = io.StringIO()
            uncompyle6.decompile_file(f.name, out)
            out.seek(0)
            print(out.read())
    except Exception as e:
        print(e)
        print(f"COULD NOT DECOMPILE REPO https://github.com/{item.repository.full_name}")
        continue
    print("\n\n\n")

Try this out yourself!

This post comes with a small capture-the-flag style lab for you to try out this technique yourself.

You can find it at https://github.com/veggiedefender/pyc-secret-lab/

Takeaways

Cached bytecode is a low-level internal performance optimization, which is the kind of thing Python was supposed to free us from having to think about! The contents of .pyc files are inscrutable without special tools like a disassembler or decompiler. When these files are buried inside __pycache__ (the double underscores signal “keep out; internal use only”), they’re easy to overlook. Avoiding this requires pretty advanced knowledge of git and python internals.

Actionable things you can do:

  • Look through your repositories for loose .pyc files, and delete them
  • If you have .pyc files and they contain secrets, then revoke and rotate your secrets
  • Use a standard gitignore to prevent checking in .pyc files
  • Use JSON files or environment variables for configuration

About Joyk


Aggregate valuable and interesting links.
Joyk means Joy of geeK