8000 GitHub - andralet/asm2vec-pytorch: Unofficial implementation of asm2vec using pytorch ( with GPU acceleration )
[go: up one dir, main page]
More Web Proxy on the site http://driver.im/
Skip to content

Unofficial implementation of asm2vec using pytorch ( with GPU acceleration )

License

Notifications You must be signed in to change notification settings

andralet/asm2vec-pytorch

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

49 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Mostly the same (original help given below is still relevant). Although there are some hacks and tips added:

  1. If you don't have CUDA, you should better manually install pytorch with CPU only:
pip3 install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cpu
  1. radare2 can be found here and installed like this:
git clone https://github.com/radareorg/radare2 && radare2/sys/install.sh
  1. bin2asm removes functions with duplicate names. Thus, if you want a database from several binary programs, you shall run it separately for each of them and save results to different directories (probably). Note that this script stores disassembled data in text format, so you can watch what the code is in resulting directory.
  2. There is an example added: /bin/wget5 contains wget1.5.3 binary, and /bin/wget6 contains wget1.6. I preprocessed them as following:
python scripts/bin2asm.py -i bin/wget5 -o asm/wget5
python scripts/bin2asm.py -i bin/wget6 -o asm/wget6
  1. I recommend playing with different model training datasets. It seems interesting and I lack a clue how to do it best.
  2. There is scripts/evaluate.py script added. It compares the files in two directories and writes the comparison results in output stream (more detailes) and in json file (for automatic comparison). Model options are the same with compare.py. It checks the filename difference$ if it troubles you, you may remove the condition on line 79 to disable this check. Note that this script needs tqdm (just to make it look better). The script may be called this way (matches all wget5 fragments to some wget6 fragments):
python scripts/evaluate.py -i1 asm/wget5/ -i2 asm/wget6 -m model.pt
  1. Good luck!

TODO: any ideas how to make evaluate.py faster without accuracy loss and shady schemes with tokens are appreciated.

Here ends the part added by andralet and starts the original README text

asm2vec-pytorch

release 1.0.0 mit python

Unofficial implementation of asm2vec using pytorch ( with GPU acceleration )
The details of the model can be found in the original paper: (sp'19) Asm2Vec: Boosting Static Representation Robustness for Binary Clone Search against Code Obfuscation and Compiler Optimization

Requirements

python >= 3.6

packages for
r2pipe scripts/bin2asm.py
click scripts/*
torch almost all code need it

You also need to install radare2 to run scripts/bin2asm.py. r2pipe is just the python interface to radare2

If you only want to use the library code, you just need to install torch

Install

python setup.py install

or

pip install git+https://
91E9
github.com/oalieno/asm2vec-pytorch.git

Benchmark

An implementation already exists here: Lancern/asm2vec
Following is the benchmark of training 1000 functions in 1 epoch.

Implementation Time (s)
Lancern/asm2vec 202.23
oalieno/asm2vec-pytorch (with CPU) 9.11
oalieno/asm2vec-pytorch (with GPU) 0.97

Get Started

python scripts/bin2asm.py -i /bin/ -o asm/

First generate asm files from binarys under /bin/.
You can hit Ctrl+C anytime when there is enough data.

python scripts/train.py -i asm/ -l 100 -o model.pt --epochs 100

Try to train the model using only 100 functions and 100 epochs for a taste.
Then you can use more data if you want.

python scripts/test.py -i asm/123456 -m model.pt

After you train your model, try to grab an assembly function and see the result.
This script will show you how the model perform.
Once you satisfied, you can take out the embedding vector of the function and do whatever you want with it.

Usage

bin2asm.py

Usage: bin2asm.py [OPTIONS]

  Extract assembly functions from binary executable

Options:
  -i, --input TEXT   input directory / file  [required]
  -o, --output TEXT  output directory
  -l, --len INTEGER  ignore assembly code with instructions amount smaller
                     than minlen

  --help             Show this message and exit.
# Example
python bin2asm.py -i /bin/ -o asm/

train.py

Usage: train.py [OPTIONS]

Options:
  -i, --input TEXT                training data folder  [required]
  -o, --output TEXT               output model path  [default: model.pt]
  -m, --model TEXT                load previous trained model path
  -l, --limit INTEGER             limit the number of functions to be loaded
  -d, --ebedding-dimension INTEGER
                                  embedding dimension  [default: 100]
  -b, --batch-size INTEGER        batch size  [default: 1024]
  -e, --epochs INTEGER            training epochs  [default: 10]
  -n, --neg-sample-num INTEGER    negative sampling amount  [default: 25]
  -a, --calculate-accuracy        whether calculate accuracy ( will be
                                  significantly slower )

  -c, --device TEXT               hardware device to be used: cpu / cuda /
                                  auto  [default: auto]

  -lr, --learning-rate FLOAT      learning rate  [default: 0.02]
  --help                          Show this message and exit.
# Example
python train.py -i asm/ -o model.pt --epochs 100

test.py

Usage: test.py [OPTIONS]

Options:
  -i, --input TEXT              target function  [required]
  -m, --model TEXT              model path  [required]
  -e, --epochs INTEGER          training epochs  [default: 10]
  -n, --neg-sample-num INTEGER  negative sampling amount  [default: 25]
  -l, --limit INTEGER           limit the amount of output probability result
  -c, --device TEXT             hardware device to be used: cpu / cuda / auto
                                [default: auto]

  -lr, --learning-rate FLOAT    learning rate  [default: 0.02]
  -p, --pretty                  pretty print table  [default: False]
  --help                        Show this message and exit.
# Example
python test.py -i asm/123456 -m model.pt
┌──────────────────────────────────────────┐
│    endbr64                               │
│  ➔ push r15                              │
│    push r14                              │
├────────┬─────────────────────────────────┤
│ 34.68% │ [rdx + rsi*CONST + CONST]       │
│ 20.29% │ push                            │
│ 16.22% │ r15                             │
│ 04.36% │ r14                             │
│ 03.55% │ r11d                            │
└────────┴─────────────────────────────────┘

compare.py

Usage: compare.py [OPTIONS]

Options:
  -i1, --input1 TEXT          target function 1  [required]
  -i2, --input2 TEXT          target function 2  [required]
  -m, --model TEXT            model path  [required]
  -e, --epochs INTEGER        training epochs  [default: 10]
  -c, --device TEXT           hardware device to be used: cpu / cuda / auto
                              [default: auto]

  -lr, --learning-rate FLOAT  learning rate  [default: 0.02]
  --help                      Show this message and exit.
# Example
python compare.py -i1 asm/123456 -i2 asm/654321 -m model.pt -e 30
cosine similarity : 0.873684

About

Unofficial implementation of asm2vec using pytorch ( with GPU acceleration )

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Python 100.0%
0