A fork from oalieno/asm2vec-pytorch
Mostly the same (original help given below is still relevant). Although there are some hacks and tips added:
- If you don't have CUDA, you should better manually install pytorch with CPU only:
pip3 install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cpu
- radare2 can be found here and installed like this:
git clone https://github.com/radareorg/radare2 && radare2/sys/install.sh
- bin2asm removes functions with duplicate names. Thus, if you want a database from several binary programs, you shall run it separately for each of them and save results to different directories (probably). Note that this script stores disassembled data in text format, so you can watch what the code is in resulting directory.
- There is an example added: /bin/wget5 contains wget1.5.3 binary, and /bin/wget6 contains wget1.6. I preprocessed them as following:
python scripts/bin2asm.py -i bin/wget5 -o asm/wget5
python scripts/bin2asm.py -i bin/wget6 -o asm/wget6
- I recommend playing with different model training datasets. It seems interesting and I lack a clue how to do it best.
- There is scripts/evaluate.py script added. It compares the files in two directories and writes the comparison results in output stream (more detailes) and in json file (for automatic comparison). Model options are the same with compare.py. It checks the filename difference$ if it troubles you, you may remove the condition on line 79 to disable this check. Note that this script needs tqdm (just to make it look better). The script may be called this way (matches all wget5 fragments to some wget6 fragments):
python scripts/evaluate.py -i1 asm/wget5/ -i2 asm/wget6 -m model.pt
- Good luck!
TODO: any ideas how to make evaluate.py faster without accuracy loss and shady schemes with tokens are appreciated.
Unofficial implementation of asm2vec
using pytorch ( with GPU acceleration )
The details of the model can be found in the original paper: (sp'19) Asm2Vec: Boosting Static Representation Robustness for Binary Clone Search against Code Obfuscation and Compiler Optimization
python >= 3.6
packages | for |
---|---|
r2pipe | scripts/bin2asm.py |
click | scripts/* |
torch | almost all code need it |
You also need to install radare2
to run scripts/bin2asm.py
. r2pipe
is just the python interface to radare2
If you only want to use the library code, you just need to install torch
python setup.py install
or
pip install git+https://
91E9
github.com/oalieno/asm2vec-pytorch.git
An implementation already exists here: Lancern/asm2vec
Following is the benchmark of training 1000 functions in 1 epoch.
Implementation | Time (s) |
---|---|
Lancern/asm2vec | 202.23 |
oalieno/asm2vec-pytorch (with CPU) | 9.11 |
oalieno/asm2vec-pytorch (with GPU) | 0.97 |
python scripts/bin2asm.py -i /bin/ -o asm/
First generate asm files from binarys under /bin/
.
You can hit Ctrl+C
anytime when there is enough data.
python scripts/train.py -i asm/ -l 100 -o model.pt --epochs 100
Try to train the model using only 100 functions and 100 epochs for a taste.
Then you can use more data if you want.
python scripts/test.py -i asm/123456 -m model.pt
After you train your model, try to grab an assembly function and see the result.
This script will show you how the model perform.
Once you satisfied, you can take out the embedding vector of the function and do whatever you want with it.
Usage: bin2asm.py [OPTIONS]
Extract assembly functions from binary executable
Options:
-i, --input TEXT input directory / file [required]
-o, --output TEXT output directory
-l, --len INTEGER ignore assembly code with instructions amount smaller
than minlen
--help Show this message and exit.
# Example
python bin2asm.py -i /bin/ -o asm/
Usage: train.py [OPTIONS]
Options:
-i, --input TEXT training data folder [required]
-o, --output TEXT output model path [default: model.pt]
-m, --model TEXT load previous trained model path
-l, --limit INTEGER limit the number of functions to be loaded
-d, --ebedding-dimension INTEGER
embedding dimension [default: 100]
-b, --batch-size INTEGER batch size [default: 1024]
-e, --epochs INTEGER training epochs [default: 10]
-n, --neg-sample-num INTEGER negative sampling amount [default: 25]
-a, --calculate-accuracy whether calculate accuracy ( will be
significantly slower )
-c, --device TEXT hardware device to be used: cpu / cuda /
auto [default: auto]
-lr, --learning-rate FLOAT learning rate [default: 0.02]
--help Show this message and exit.
# Example
python train.py -i asm/ -o model.pt --epochs 100
Usage: test.py [OPTIONS]
Options:
-i, --input TEXT target function [required]
-m, --model TEXT model path [required]
-e, --epochs INTEGER training epochs [default: 10]
-n, --neg-sample-num INTEGER negative sampling amount [default: 25]
-l, --limit INTEGER limit the amount of output probability result
-c, --device TEXT hardware device to be used: cpu / cuda / auto
[default: auto]
-lr, --learning-rate FLOAT learning rate [default: 0.02]
-p, --pretty pretty print table [default: False]
--help Show this message and exit.
# Example
python test.py -i asm/123456 -m model.pt
┌──────────────────────────────────────────┐
│ endbr64 │
│ ➔ push r15 │
│ push r14 │
├────────┬─────────────────────────────────┤
│ 34.68% │ [rdx + rsi*CONST + CONST] │
│ 20.29% │ push │
│ 16.22% │ r15 │
│ 04.36% │ r14 │
│ 03.55% │ r11d │
└────────┴─────────────────────────────────┘
Usage: compare.py [OPTIONS]
Options:
-i1, --input1 TEXT target function 1 [required]
-i2, --input2 TEXT target function 2 [required]
-m, --model TEXT model path [required]
-e, --epochs INTEGER training epochs [default: 10]
-c, --device TEXT hardware device to be used: cpu / cuda / auto
[default: auto]
-lr, --learning-rate FLOAT learning rate [default: 0.02]
--help Show this message and exit.
# Example
python compare.py -i1 asm/123456 -i2 asm/654321 -m model.pt -e 30
cosine similarity : 0.873684