Need help for a python script to extract information from a list of files and write into a log file

Hi everyone, I am new to Python and need help to make a script for the task. I am running a modified Autodock program and need to compile the results.

I have a folder that contain hundreds of *.pdbqt files named “compound_1.pdbqt”, “compound_2.pdbqt”, .etc.

Each file have a structure like this:

MODEL 1
REMARK minimizedAffinity -7.11687565
REMARK CNNscore 0.573647082
REMARK CNNaffinity 5.82644749
REMARK  11 active torsions:
#Lots of text here
MODEL 2
REMARK minimizedAffinity -6.61898327
REMARK CNNscore 0.55260396
REMARK CNNaffinity 5.86855984
REMARK  11 active torsions:
#Lots of text here
#Repeat with 10 to 20 models

I want to use a python (using Python 3) script to exact the “MODEL”, “minimizedAffinity”, “CNNscore”, and “CNNaffinity” of each and every compound in the folder into a delimited text file that look like this:

Compound Model minimizedAffinity CNNscore CNNaffinity 
1 1 -7.11687565 0.573647082 5.82644749
1 2 -6.61898327 0.55260396 5.86855984

Currently I am stuck at this script

#! /usr/bin/env python

import sys
import glob

files = glob.glob('**/*.pdbqt', 
                   recursive = True)
for file in files:
    word1 = 'MODEL'
    word2 = 'minimizedAffinity'
    word3 = 'CNNscore'
    word4 = 'CNNaffinity'
    print(file)
    with open(file) as fp:
        # read all lines in a list
        lines = fp.readlines()
        for line in lines:
        # check if string present on a current line
            if line.find(word1) != -1:
                print('Line:', line)
            if line.find(word2) != -1:
                print('Line:', line)
            if line.find(word3) != -1:
                print('Line:', line)
            if line.find(word4) != -1:
                print('Line:', line)

Really appreciate any help.

Thank you very much.

Okay, so what do you imagine are the logical steps required to solve the problem? Where exactly are you getting stuck? It seems like you already tried to write code to find all the files, open each one, look at the lines, and see whether they match certain patterns. Right? Then - what do you suppose is the next step? For example, suppose the code is going through one of the files, and it finds a line that includes the 'MODEL' text. What should happen next?

1 Like

Thank you very much.

What I have in mind is to extract the numerical value after the MODEL or 'minimizedAffinity' or etc strings, also I need to extract the file_name value from file_name.pdbqt, then arrange then into space delimited rows. I don’t know what is the necessary code for these steps… Afterward export all into a text file.

I’d suggest reading the file line by line and using .split to split each of them on whitespace:

>>> line = 'REMARK minimizedAffinity -7.11687565\n' # An example line.
>>> fields = line.split()
>>> print(fields)
['REMARK', 'minimizedAffinity', '-7.11687565']
>>>

Keep track of which values you’ve collected (minimizedAffinity, minimizedAffinity, minimizedAffinity), or just counting them, and when you’ve got all of them (all 3), write another row to the output file.

1 Like

Could be something like this, I guess


import glob

COMPOUND = 'Compound'
MODEL = 'MODEL'
MINIMIZED_AFFINITY = 'minimizedAffinity'
CNN_SCORE = 'CNNscore'
CNN_AFFINITY = 'REMARK CNNaffinity'
REMARK_MINIMIZED_AFFINITY = 'REMARK minimizedAffinity'
REMARK_CNN_AFFINITY = 'REMARK CNNaffinity'
REMARK_SCORE = 'REMARK CNNscore'


def dump_to(outfile, *args):
    """Write to `outfile` the arguments separated by tabs"""
    outfile.write(
        '\t'.join(args)
    )
    outfile.write('\n')


with open('output.txt', 'w', encoding='utf=8') as outfile:
    outfile.write(
        f'{COMPOUND}\t{MODEL}\t{MINIMIZED_AFFINITY}'
        f'\t{CNN_SCORE}\t{CNN_AFFINITY}\n'
    )
    files = glob.glob('./*.pdbqt', recursive=True)
    for file in files:
        compound = file.split('_')[1].split('.')[0]
        print(file)
        with open(file, 'r', encoding='utf-8') as fp:
            model: int | None = None
            minimized_affinity: int | None = None
            cnn_score: int | None = None
            cnn_affinity: int | None = None
            for line in fp:
                line = line.strip()
            # check if string present on a current line
                if line.startswith(REMARK_CNN_AFFINITY):
                    cnn_affinity = line.split(' ')[2]
                elif line.startswith(REMARK_SCORE):
                    cnn_score = line.split(' ')[2]
                elif line.startswith(REMARK_MINIMIZED_AFFINITY):
                    minimized_affinity = line.split(' ')[2]
                elif line.startswith(MODEL):
                    if model is not None:
                        dump_to(
                            outfile,
                            compound,
                            model,
                            minimized_affinity,
                            cnn_score,
                            cnn_affinity
                        )
                    model = line.split(' ')[1]
            dump_to(
                outfile,
                compound,
                model,
                minimized_affinity,
                cnn_score,
                cnn_affinity
            )

1 Like

Thanks for everyone input.

This is the script I have come up with:

import sys
import glob
import re
import os
word1 = 'MODEL'
word2 = 'minimizedAffinity'
word3 = 'CNNscore'
word4 = 'CNNaffinity'
files = glob.glob('**/*.pdbqt', 
                   recursive = True)
print('Compound', 'MODEL' , 'minimizedAffinity', 'CNNscore' ,'CNNaffinity', sep='\t')
for file in files:
    with open(file) as fp:
        # read all lines in a list
        lines = fp.readlines()
        for line in lines:
        # check if string present on a current line
            match1 = re.search('MODEL (.+)', line)
            match2 = re.search('minimizedAffinity (.+)', line)
            match3 = re.search('CNNscore (.+)', line)
            match4 = re.search('CNNaffinity (.+)', line)
            if match1:
                print (os.path.splitext(file)[0], end='\t')
                print (match1.group(1), end='\t')
            if match2:
                print (match2.group(1), end='\t')
            if match3:
                print (match3.group(1), end='\t')
            if match4:
                print (match4.group(1))