A fork from oalieno/asm2vec-pytorch

Mostly the same (original help given below is still relevant). Although there are some hacks and tips added:

If you don't have CUDA, you should better manually install pytorch with CPU only:

pip3 install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cpu

radare2 can be found here and installed like this:

git clone https://github.com/radareorg/radare2 && radare2/sys/install.sh

bin2asm removes functions with duplicate names. Thus, if you want a database from several binary programs, you shall run it separately for each of them and save results to different directories (probably). Note that this script stores disassembled data in text format, so you can watch what the code is in resulting directory.
There is an example added: /bin/wget5 contains wget1.5.3 binary, and /bin/wget6 contains wget1.6. I preprocessed them as following:

python scripts/bin2asm.py -i bin/wget5 -o asm/wget5
python scripts/bin2asm.py -i bin/wget6 -o asm/wget6

I recommend playing with different model training datasets. It seems interesting and I lack a clue how to do it best.
There is scripts/evaluate.py script added. It compares the files in two directories and writes the comparison results in output stream (more detailes) and in json file (for automatic comparison). Model options are the same with compare.py. It checks the filename difference$ if it troubles you, you may remove the condition on line 79 to disable this check. Note that this script needs tqdm (just to make it look better). The script may be called this way (matches all wget5 fragments to some wget6 fragments):

python scripts/evaluate.py -i1 asm/wget5/ -i2 asm/wget6 -m model.pt

Good luck!

TODO: any ideas how to make evaluate.py faster without accuracy loss and shady schemes with tokens are appreciated.

Here ends the part added by andralet and starts the original README text

asm2vec-pytorch

Unofficial implementation of asm2vec using pytorch ( with GPU acceleration )
The details of the model can be found in the original paper: (sp'19) Asm2Vec: Boosting Static Representation Robustness for Binary Clone Search against Code Obfuscation and Compiler Optimization

Requirements

python >= 3.6

packages	for
r2pipe	`scripts/bin2asm.py`
click	`scripts/*`
torch	almost all code need it

You also need to install radare2 to run scripts/bin2asm.py. r2pipe is just the python interface to radare2

If you only want to use the library code, you just need to install torch

Install

python setup.py install

or

pip install git+https://
91E9
github.com/oalieno/asm2vec-pytorch.git

Benchmark

An implementation already exists here: Lancern/asm2vec
Following is the benchmark of training 1000 functions in 1 epoch.

Implementation	Time (s)
Lancern/asm2vec	202.23
oalieno/asm2vec-pytorch (with CPU)	9.11
oalieno/asm2vec-pytorch (with GPU)	0.97

Get Started

python scripts/bin2asm.py -i /bin/ -o asm/

First generate asm files from binarys under /bin/.
You can hit Ctrl+C anytime when there is enough data.

python scripts/train.py -i asm/ -l 100 -o model.pt --epochs 100

Try to train the model using only 100 functions and 100 epochs for a taste.
Then you can use more data if you want.

python scripts/test.py -i asm/123456 -m model.pt

After you train your model, try to grab an assembly function and see the result.
This script will show you how the model perform.
Once you satisfied, you can take out the embedding vector of the function and do whatever you want with it.

Usage

bin2asm.py

Usage: bin2asm.py [OPTIONS]

  Extract assembly functions from binary executable

Options:
  -i, --input TEXT   input directory / file  [required]
  -o, --output TEXT  output directory
  -l, --len INTEGER  ignore assembly code with instructions amount smaller
                     than minlen

  --help             Show this message and exit.

# Example
python bin2asm.py -i /bin/ -o asm/

train.py

Usage: train.py [OPTIONS]

Options:
  -i, --input TEXT                training data folder  [required]
  -o, --output TEXT               output model path  [default: model.pt]
  -m, --model TEXT                load previous trained model path
  -l, --limit INTEGER             limit the number of functions to be loaded
  -d, --ebedding-dimension INTEGER
                                  embedding dimension  [default: 100]
  -b, --batch-size INTEGER        batch size  [default: 1024]
  -e, --epochs INTEGER            training epochs  [default: 10]
  -n, --neg-sample-num INTEGER    negative sampling amount  [default: 25]
  -a, --calculate-accuracy        whether calculate accuracy ( will be
                                  significantly slower )

  -c, --device TEXT               hardware device to be used: cpu / cuda /
                                  auto  [default: auto]

  -lr, --learning-rate FLOAT      learning rate  [default: 0.02]
  --help                          Show this message and exit.

# Example
python train.py -i asm/ -o model.pt --epochs 100

test.py

Usage: test.py [OPTIONS]

Options:
  -i, --input TEXT              target function  [required]
  -m, --model TEXT              model path  [required]
  -e, --epochs INTEGER          training epochs  [default: 10]
  -n, --neg-sample-num INTEGER  negative sampling amount  [default: 25]
  -l, --limit INTEGER           limit the amount of output probability result
  -c, --device TEXT             hardware device to be used: cpu / cuda / auto
                                [default: auto]

  -lr, --learning-rate FLOAT    learning rate  [default: 0.02]
  -p, --pretty                  pretty print table  [default: False]
  --help                        Show this message and exit.

# Example
python test.py -i asm/123456 -m model.pt

┌──────────────────────────────────────────┐
│    endbr64                               │
│  ➔ push r15                              │
│    push r14                              │
├────────┬─────────────────────────────────┤
│ 34.68% │ [rdx + rsi*CONST + CONST]       │
│ 20.29% │ push                            │
│ 16.22% │ r15                             │
│ 04.36% │ r14                             │
│ 03.55% │ r11d                            │
└────────┴─────────────────────────────────┘

compare.py

Usage: compare.py [OPTIONS]

Options:
  -i1, --input1 TEXT          target function 1  [required]
  -i2, --input2 TEXT          target function 2  [required]
  -m, --model TEXT            model path  [required]
  -e, --epochs INTEGER        training epochs  [default: 10]
  -c, --device TEXT           hardware device to be used: cpu / cuda / auto
                              [default: auto]

  -lr, --learning-rate FLOAT  learning rate  [default: 0.02]
  --help                      Show this message and exit.

# Example
python compare.py -i1 asm/123456 -i2 asm/654321 -m model.pt -e 30

cosine similarity : 0.873684

Name		Name	Last commit message	Last commit date
Latest commit History 49 Commits
asm2vec		asm2vec
bin		bin
scripts		scripts
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
requirements.txt		requirements.txt
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

A fork from oalieno/asm2vec-pytorch

Here ends the part added by andralet and starts the original README text

asm2vec-pytorch

Requirements

Install

Benchmark

Get Started

Usage

bin2asm.py

train.py

test.py

compare.py

About

Releases

Packages

Languages

License

andralet/asm2vec-pytorch

Folders and files

Latest commit

History

Repository files navigation

A fork from oalieno/asm2vec-pytorch

Here ends the part added by andralet and starts the original README text

asm2vec-pytorch

Requirements

Install

Benchmark

Get Started

Usage

bin2asm.py

train.py

test.py

compare.py

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages