Roko A deep learning based tool for consensus polishing. Description Roko is a consensus polisher which takes draft assembly and aligned reads in BAM format and outputs a set of contigs in FASTA format. It uses deep learning architecture to produce high quality consensus. Features are represented as sampled reads in a window and labels are mapped to draft assembly in Medaka-style fashion. Dependencies Check HTSlib dependencies. gcc 5.0+ and g++ python 3.6 or 3.7 (python3-dev and venv) Installation GPU git clone https://github.com/lbcb-sci/roko.git roko cd roko make gpu CPU git clone https://github.com/lbcb-sci/roko.git roko cd roko make cpu Usage To activate virtual environment: . $PROJECT_DIR/roko/bin/activate To generate features for model training or inference: python features.py [options ...] <ref> <X> <o> <ref> Draft sequence in FASTA format <X> Reads aligned to <ref> in BAM format <o> Output name (e.g. output.hdf5) options: --Y Truth genome aligned to <ref> in BAM format (training only) --t default: 1 Number of worker processes To generate BAM files for feature generation pomoxis mini_align method is recommended. To train a model: python train.py [options ...] <train> <out> <train> Directory containing generated .hdf5 files used for training (or one .hdf5 file) <out> Directory for saving trained model options: --val Directory containing generated .hdf5 files used for validation (or one .hdf5 file) --b default: 128 Batch size used for train and validation --memory default: False If flag is present, traning and validation data is stored in RAM --t default: 0 Number of workers for train and validation data loaders (--t for train data loader and --t for validation) To make inference: python inference.py [options ...] <data> <model> <out> <data> Path to the generated features in .hdf5 <model> Path to the saved model in .pth format <out> Path to the output file (FASTA format) options: --t default: 0 Number of workers for inference --b default: 128 Inference batch size Comparison The model was trained and tested on FASTQ Basecalls from Zymo R10 Native “3 Peaks”. Data was binned using Loman's script. Draft assemblies were generated using raven. BAM files used for feature generation and BAM files used for labeling were generated by mini_align script from pomoxis tool. Organisms used for training are: B. subtilis, E. faecalis, E. coli, L. Monocytogenes and S. enterica. P. aeruginosa was used for validation. Models are tested on S. aureus. Results were evaluated using pomoxis assess_assembly script. The (mean) results are given in the following table: Model Total error Mismatch Deletion Insertion< 6A37 /th> Qscore Raven 0.160% 0.040% 0.059% 0.061% 27.97 Medaka 0.037% 0.012% 0.007% 0.017% 34.30 HELEN 0.066% 0.019% 0.031% 0.016% 31.78 Roko 0.035% 0.013% 0.008% 0.013% 34.55 Total error does not correspond to the sum of errors because of rounding. Download The model stated in comparison section (R10, Guppy 2.3.8) can be downloaded here. Contact information This tool is still in an early development stage. All bugs and questions can be reported to: dominik.stanojevic@fer.hr, mile.sikic@fer.hr or mile_sikic@gis.a-star.edu.sg.