Hi everyone, I am new to Python and need help to make a script for the task. I am running a modified Autodock program and need to compile the results.
I have a folder that contain hundreds of *.pdbqt files named “compound_1.pdbqt”, “compound_2.pdbqt”, .etc.
Each file have a structure like this:
MODEL 1
REMARK minimizedAffinity -7.11687565
REMARK CNNscore 0.573647082
REMARK CNNaffinity 5.82644749
REMARK 11 active torsions:
#Lots of text here
MODEL 2
REMARK minimizedAffinity -6.61898327
REMARK CNNscore 0.55260396
REMARK CNNaffinity 5.86855984
REMARK 11 active torsions:
#Lots of text here
#Repeat with 10 to 20 models
I want to use a python (using Python 3) script to exact the “MODEL”, “minimizedAffinity”, “CNNscore”, and “CNNaffinity” of each and every compound in the folder into a delimited text file that look like this:
#! /usr/bin/env python
import sys
import glob
files = glob.glob('**/*.pdbqt',
recursive = True)
for file in files:
word1 = 'MODEL'
word2 = 'minimizedAffinity'
word3 = 'CNNscore'
word4 = 'CNNaffinity'
print(file)
with open(file) as fp:
# read all lines in a list
lines = fp.readlines()
for line in lines:
# check if string present on a current line
if line.find(word1) != -1:
print('Line:', line)
if line.find(word2) != -1:
print('Line:', line)
if line.find(word3) != -1:
print('Line:', line)
if line.find(word4) != -1:
print('Line:', line)
Okay, so what do you imagine are the logical steps required to solve the problem? Where exactly are you getting stuck? It seems like you already tried to write code to find all the files, open each one, look at the lines, and see whether they match certain patterns. Right? Then - what do you suppose is the next step? For example, suppose the code is going through one of the files, and it finds a line that includes the 'MODEL' text. What should happen next?
What I have in mind is to extract the numerical value after the MODEL or 'minimizedAffinity' or etc strings, also I need to extract the file_name value from file_name.pdbqt, then arrange then into space delimited rows. I don’t know what is the necessary code for these steps… Afterward export all into a text file.
I’d suggest reading the file line by line and using .split to split each of them on whitespace:
>>> line = 'REMARK minimizedAffinity -7.11687565\n' # An example line.
>>> fields = line.split()
>>> print(fields)
['REMARK', 'minimizedAffinity', '-7.11687565']
>>>
Keep track of which values you’ve collected (minimizedAffinity, minimizedAffinity, minimizedAffinity), or just counting them, and when you’ve got all of them (all 3), write another row to the output file.
import sys
import glob
import re
import os
word1 = 'MODEL'
word2 = 'minimizedAffinity'
word3 = 'CNNscore'
word4 = 'CNNaffinity'
files = glob.glob('**/*.pdbqt',
recursive = True)
print('Compound', 'MODEL' , 'minimizedAffinity', 'CNNscore' ,'CNNaffinity', sep='\t')
for file in files:
with open(file) as fp:
# read all lines in a list
lines = fp.readlines()
for line in lines:
# check if string present on a current line
match1 = re.search('MODEL (.+)', line)
match2 = re.search('minimizedAffinity (.+)', line)
match3 = re.search('CNNscore (.+)', line)
match4 = re.search('CNNaffinity (.+)', line)
if match1:
print (os.path.splitext(file)[0], end='\t')
print (match1.group(1), end='\t')
if match2:
print (match2.group(1), end='\t')
if match3:
print (match3.group(1), end='\t')
if match4:
print (match4.group(1))