Author: Nick Marze (nickmarze@gmail.com)

Last edited 7/31/2017 by Nick Marze. Corresponding PI Jeffrey Gray (jgray@jhu.edu).

# Code

• Application source code: rosetta/main/source/src/apps/public/motif_dock/make_motif_tables.cc

To run any of the modes in make_motif_tables, type the following in a commandline:

[path to executable]/make_motif_tables.[platform|linux/mac][compile|gcc/ixx]release –database [path to database] @options

To run docking in Motif Dock Score mode, type the following in a commandline:

[path to executable]/docking_protocol.[platform|linux/mac][compile|gcc/ixx]release –database [path to database] -mh:path:scores_BB_BB [path to score tables + score table prefix] -mh:score:use_ss1 false -mh:score:use_ss2 false -mh:score:use_aa1 true -mh:score:use_aa2 true -docking_low_res_score motif_dock_score @options

Note: default argument for -mh:path:scores_BB_BB is: [path to Rosetta main]/database/protocol_data/motif_dock/xh_16_

Note: code currently resides in branch: ssrb19/ensembledock_separate, to be merged into master in Fall 2017.

# References

We recommend the following articles for further studies of RosettaDock methodology and applications:

• Marze, N. A.; Roy Burman, S. S.; Sheffler, W.; Gray, J. J. (2018) Efficient flexible backbone protein–protein docking for challenging targets. Bioinformatics 34(20), pp 3461–3469

# Application purpose

Fast low-resolution docking using full-atom energies mapped to backbone coordinates

# Algorithm

Motif Dock Scoring relies on precalculated tables of residue-pair interaction energies mapped to backbone geometries. Tables optimized on REF15 energies can be found in database/protocol_data/motif_dock. If necessary, new tables can be generated using the make_motif_tables public app.

The score tables generated by the public app make_motif_tables are used as part of the Motif Dock Score protocol, which is accessed with a set of flags through the Docking Protocol application. Score tables optimized for REF15 are found in database/protocol_data/motif_dock.

# Modes

There are five modes within make_motif_tables, each activated using a command line flag:

## Harvest motifs

Queried PDB files are read in and scored. Scores are decomposed to all residue pairs, and pairs with negative energies are stored as single-line pair motifs. The pair motif contains fields for, among others: residue numbers, amino acid identities, residue pair score, and the coordinates of the 6D transformation needed to superimpose one amino acid backbone onto the other. All motifs from a single PDB file are concatenated and compressed into a single .rpm.bin.gz file (not human readable). One output .rpm.bin.gz file will exist for each input PDB file, tagged by the PDB 4-digit code.

Mode flag usage:

-mh:harvest_motifs [list of PDB files to query]

Required flags:

-out:file:o [path to rpm output directory]
-mh:filter:filter_harvest
-mh:filter:motif_type BB_BB

Suggested flag:

-mh:harvest:max_res 5000 (ignores query PDB structures with > 5000 residues)

Notes on future optimizations: Currently, make_motif_tables uses a hard-coded modified REF15 score function. To use a new function, the full_score variable calculation will need to be edited with new score function weights; additionally, the appropriate -score:weights flag and any relevant patch flags will need to be passed. As a note, optimization suggested that single-body energy terms, when included, provide worse performance than when they are omitted. This behavior arises because poor single-body term scores will mask good two-body term scores, pulling the overall pair score above the 0 REU limit to store the motif. Likewise, good single-body term scores will mask poor two-body term scores, causing the storage of non-optimal motifs. Potentially, this convolution effect could be ignored if all motifs are stored, regardless of their energy. In practice, however, this is not feasible, as the larger number of motifs quickly becomes memory-prohibitive on current hardware (> 50 GB memory when using -3 REU cutoff and full PDB motif set).

## Merge motifs

To build the score tables, rpm files must be read in all at once from the command line. When using the full PDB as the query set, the resultant number of rpm files exceeds the typical character limit of the command line. As a workaround, the raw rpm files can be merged into a smaller number of larger motif lists. N.B.: Do not merge into a single rpm file, as this will likely crash make_motif_tables due to memory requirements; the recommended behavior is to generate one merged rpm file for each 2-character pair from the middle of the 4-digit PDB code (e.g. merged rpm file B3 will contain motifs from 1B3F, 1B3Z, 2B3A, etc.)

Mode flag usage:

-mh:merge_motifs [list of raw .rpm.bin.gz files]

Required flag:

-mh:motif_out_file [path to merged .rpm.bin.gz file]

## Harvest scores

Builds score tables from the motif lists. Motifs are read in one-by-one, and their 6D transformation geometry is matched to the corresponding bin in the score table. The motif score is compared to the bin score, and the lower of the two is stored as the new bin score. After all motifs are read, the score table will contain some number of populated bins, in which the lowest-energy conformation corresponding with the bin geometry is represented, and some number of empty bins, in which no matching geometry was observed in the motif set, and in which the score is 0. One score table will be generated for each amino acid pair, stored in compressed .xh.bin.gz format (not human readable).

Mode flag usage:

-mh:harvest_scores [list of merged .rpm.bin.gz files]

Required flags (suggested values in ):

-mh:motif_out_file [path to xh output directory]
-mh:filter:filter_harvest
-mh:filter:motif_type BB_BB
-mh:harvest:sep_aa true
-mh:harvest:hash_cart_resl <2>
-mh:harvest:hash_angle_resl <22.5>

N.B.: the -mh:motif_out_file path variable will also become the prefix for the .xh.bin.gz filenames, and should not contain any “/” characters

Suggested flags:

-mh:harvest:agg_with_max TRUE
-mh:harvest:smoothing_factor 1

Notes on future optimizations: The hash_cart_resl and hash_angle_resl flags set the bin size for the translational and rotational coordinates, respectively. The former should divide evenly into 16, and the latter should divide evenly into 180 to ensure equal bin size. Setting agg_with_max to FALSE will change the population behavior of the score table bins: rather than storing the minimum of the bin score and the queried motif score, it stores the sum of the two. This setting is strongly NOT recommended, as this behavior produces poor Motif Dock Score results, with a few highly populated bins dominating the Motif Dock Score. This population method needs extensive rebuilding to be viable in the general case. Setting the smoothing_factor variable alters the smoothness of the score tables. Sufficiently low motifs scores will also be stored in neighboring score bins; higher values of smoothing_factor will increase the radius of population from the central score bin, increasing the number of bins populated with the low-scoring motif.

## Print motifs

Prints motifs from a compressed .rpm.bin.gz file in a human-readable form.

Mode flag usage:

-mh:print_motifs [path to .rpm.bin.gz file]

## Print scores

Prints a human-readable list of populated bins within a .xh.bin.gz file.

Mode flag usage:

-mh:print_scores [path to .xh.bin.gz file]

Required flags:

-mh:harvest:hash_cart_resl <matching value of hash_cart_resl flag used in harvest_scores>
-mh:harvest:hash_angle_resl <matching value of hash_angle_resl flag used in harvest_scores>