This is a project to use machine learning to convert raw decompiled binary files into cleaner variations.
We take a large dataset of C/C++ code, compile it, decompile the binaries, then train a model to translate the decompiled binaries into their original version,.
Afterward, we have a model which can convert ugly decompiled code into cleaner code.
Install dependencies using your platform's package manager (recommend Homebrew on macOS):
> git clone git@github.com:Jakobeha/UnderstandableBinary.git
> cd UnderstandableBinary
> run.sh [options]...
You can also open in IntelliJ and there are sample run configurations. Note that you may need to change some global library locations (e.g. path to Poetry)
This project uses many different languages and frameworks. READMEs and run.sh
scripts are in subdirectories.
The root is an IntelliJ project, however modules are in subdirectories.
../UnderstandableBinary-data/
: The default location where the dataset is generated and stored. This cannot be inUnderstandableBinary/
because the dataset is extremely large and contains code, which confuses a lot of tools and find and makes everything a hassle. You can override the dataset dir, and you may want to make it on a separate volume with more storage.python/
: Python scripts which use poetry for dependency management. Mainly for training and running the model since that is in Pythonget-data/
: Generate datasetlocal/
: Local directory where you can store scratch data which isn't the dataset. Also, some log files are stored hereghidra_logs/
: Ghidra script log files
docs/
: documentation
- N-Bref: A neural-based decompiler framework and binary analysis tool
- G-3PO and GptHidra: generate an explanation for a decompiled function and suggest variable names (uses GPT3 and Ghidra)
- Gepetto: generate a doc-comment for a decompiled function, plus rename variables and add comments in the function body (uses GPT3 and IDA Pro)
Conventions:
- File and directory names are usually
kebab-case
unless there's another reason (e.g. Java) - Use PEP and shellcheck (IntelliJ defaults)
TODO: add more