Currently to get the number of iterations a mover, such as FastRelax does, I enable the dump_trajectory scoretype and count the models in the PDBs (code below).
This seems very very weird way of doing this, so I was wondering if there is a less silly way of doing this?
Also, is there a direct way to open a multimodel file say to get the scores of each? (Rather than extracting the models, writing to file, reading them into PyRosetta and scoring them).
# ============ Methods ======================================
DumpTrajectoryEnergy = pyrosetta.rosetta.core.energy_methods.DumpTrajectoryEnergy
#formerly: DumpTrajectoryEnergy = pyrosetta.rosetta.core.scoring.util_methods.DumpTrajectoryEnergy
def enable_trajectory(scorefxn: pyrosetta.ScoreFunction,
prefix:str='pose',
stride:int=100,
gz:bool=False) -> DumpTrajectoryEnergy:
dt_st = pyrosetta.rosetta.core.scoring.ScoreType.dump_trajectory
scorefxn.set_weight(dt_st, 1)
# ----- make the traj dump less intense -----
# settable by cmd line options... maybe better
old_traj_energy = [k for k in scorefxn.all_methods() if isinstance(k, DumpTrajectoryEnergy)][0]
scorefxn.all_methods().remove(old_traj_energy)
emo = pyrosetta.rosetta.core.scoring.methods.EnergyMethodOptions()
if gz:
# emo.dump_trajectory_gz(True)
raise RuntimeError('this just outputs a text file with gz suffix... (?!)')
emo.dump_trajectory_prefix(prefix)
emo.dump_trajectory_stride(stride)
new_traj_energy = DumpTrajectoryEnergy(emo)
scorefxn.all_methods().append(new_traj_energy)
return new_traj_energy
import os, gzip, re, pathlib, datetime
import pandas as pd
def get_traj_interations(prefix:str) -> pd.DataFrame:
raw_trajs = []
for file in os.listdir():
if re.match(prefix, file):
with open(file,'r') as fh:
file_content=fh.read()
modstamp = pathlib.Path(file).stat().st_mtime
modtime = datetime.datetime.fromtimestamp(modstamp)
models = re.findall('MODEL', file_content)
raw_trajs.append(dict(filename=file,
time=modtime,
n_models=len(models))
)
trajs = pd.DataFrame(raw_trajs).sort_values('time')
cat_df = uncontrajs.filename.str.extract(r'_([^_]+)\.pdb')
trajs['short_name'] = cat_df[0] + '_' + cat_df.groupby([0]).cumcount().add(1).astype(str)
return trajs
# ============ Run ======================================
experiment_name = 'test123'
scorefxn = pyrosetta.get_fa_scorefxn()
new_traj_energy = enable_trajectory(scorefxn=scorefxn,
prefix=experiment_name,
stride=10_000)
relax = pyrosetta.rosetta.protocols.relax.FastRelax(scorefxn, 3)
relax.apply(pose)
trajs = get_traj_interations(experiment_name)
trajs.head()
filename | time | n_models | short_name | |
---|---|---|---|---|
43 | test_unconstrained__20210720141946221_score.pdb | 2021-07-20 14:19:46.429851 | 4 | score_1 |
27 | test_unconstrained__20210720141946966_pack.pdb | 2021-07-20 14:19:48.854521 | 28 | pack_1 |
67 | test_unconstrained__20210720141949017_score.pdb | 2021-07-20 14:19:49.172959 | 4 | score_2 |
30 | test_unconstrained__20210720141949412_min.pdb | 2021-07-20 14:19:52.569970 | 8 | min_1 |
95 | test_unconstrained__20210720141952687_score.pdb | 2021-07-20 14:19:52.815413 | 4 | score_3 |
Category:
Post Situation:
I'd back up for a minute and ask what you're looking to use this information for. Usually you just count the number of apply()'s of whatever top-level mover you're using as the number of iterations, and treat what things happen internally as something of a black box (modulo whatever settings are exposed by the configuration of the top-level mover). If you have particular reasons to dig deeper into the internals, you have to carefully define the bounds of what you're interested in monitoring, because depending on what you're doing, you might count as "changes" things which aren't actually the steps/iterations you're interested in, or you may miss some of the desired steps which are done "externally" or separately.
Regarding getting scores from a multi-model file, if you can't see scoring information in text form in the file, Rosetta probably hasn't dumped it out. (You can dump out structure information without dumping scoring information, depending on how you do it.) If that's the case, you'd need to rescore the structure to get the score.
Well, in the case I was interested today I wanted to see how many repacking interations a FastRelax run was doing with and without a particular set of constraints as there was a large running time difference and I was concerned that it was that affecting the run —whereas the distribution of repacking events did change across the gradient of the cycle, there was nothing to worry about and the time difference was probably just the overhead from all the PyRosetta-defined constraints. But in general, I would argue that testing something to make sure it is behaving as hoped is better than just assuming it is.
But actually in FastRelax, one can set the max_inter getter/setter of FastRelax but the apply method does not say how many were done nor, from what I can tell, is there is no counter in pose.energies() or similar or in the log... but I am guessing there maybe one given than one can set the max_iter.
About the multi-model file, I mean simply reading it into memory. The models in a trajectory PDB are not scored, so yes, I know I need to rescore them. If one reads a multimodel PDB file, for example the PDB:1L2Y (tryptophan minicage NMR) it loads the models as separate chains into a single pose, which needs to be split up. I.e. it does not load it as a vector1_core_pose_Pose of poses or similar.
It seems rather dangerous and potentially complicated —multichain multimodel etc. Hence why I was wondering if there was a I/O specifically for multimodel PDBs.