Strings problem modification from a genbankfile

Hello Everybody,

I just start Python3 and I block on a exercice about string.

I want to read a genbank file and make a string with all the gen found. At this moment I succed to recover every gen with a flag methodology. But, I have a new problem.

I suppress the digit with a online solution (from string import digits) but after this i have space and \n betwenn gen ;( So i would like to delet this space. I try with “”.join(mygenlist) but i did’nt work :0

If you have any solution tell my ^^

Next is my code :

from string import digits
# function for reading 
def lit_genbank(nomfichier):
    drapeau=False
    lecture=""
    table = str.maketrans('', '', digits) # table for digit suppression
    with open(nomfichier, 'r') as filegbk: # open file
        for ligne in filegbk: #read ligne by ligne
            if ligne.find("//")!=-1: # if //
                drapeau=False #flag false
            if drapeau : # if flag true
                # Here i delet digits
                clarifier= ligne.translate(table)
                lecture = lecture + clarifier.strip()
            if ligne.find("ORIGIN")!=-1: # si on trouve origine
                drapeau=True
    return lecture
#
#Main
namefile="GenBank NC_001133.gbk"
seq=lit_genbank(namefile)
seq2= "".join(seq.split()) # dont work
print(seq2) # when i print that i have some space and .join don't work

For GENBANK file is somethings like that :

ORIGIN
1 ccacaccaca cccacacacc cacacaccac accacacacc acaccacacc cacacacaca
61 catcctaaca ctaccctaac acagccctaa tctaaccctg gccaacctgt ctctcaactt
121 accctccatt accctgcctc cactcgttac cctgtcccat tcaaccatac cactccgaac
181 caccatccat ccctctactt actaccactc acccaccgtt accctccaat tacccatatc
//

Result of this python is :
ccacaccaca cccacacacc cacacaccac accacacacc acaccacacc cacacacacacatcctaaca ctac

cctaac acagccctaa tctaaccctg gccaacctgt ctctcaacttaccctccatt accctgcctc cactcgtt

ac cctgtcccat tcaaccatac cactccgaaccaccatccat ccctctactt actaccactc acccaccgtt a

That is very strange, the code you show should work.

>>> seq = "  accctccatt accctgcctc cactcgttac cctgtcccat  "
>>> seq2= "".join(seq.split())
>>> print(seq2)
accctccattaccctgcctccactcgttaccctgtcccat

Can you check that you have posted the correct code?

If you have, try this:

print(len(seq2.splitlines())
print(' ' in seq2)
print('\N{NO-BREAK SPACE}' in seq2)
print([ord(c) for c in seq2[:50])

which might give some clues as to what is happening.

1 Like

When I test just that :

seq = " START  accctccatt accctgcctc cactcgttac cctgtccca \n azlkfafa qzfzaf cctgtccca \n aezEND "
seq2= "".join(seq.split())
print(seq2)

It works XD : STARTaccctccattaccctgcctccactcgttaccctgtcccaazlkfafaqzfzafcctgtcccaaezEND

Yes is the correct code :confused: When i charge it again it’s work :° I don’t understand why cause i do not do modification ><

For the things you ask i have that :

1
False
False
[99, 99, 97, 99, 97, 99, 99, 97, 99, 97, 99, 99, 99, 97, 99, 97, 99, 97, 99, 99, 99, 97, 99, 97, 99, 97, 99, 99, 97, 99, 97, 99, 99, 97, 99, 97, 99, 97, 99, 99, 97, 99, 97, 99, 99, 97, 99, 97, 99, 99]

This is not my cup of tea, so I just ask: what is expected/desired output?

Hey !

I desire to extract all the gen nucleotide in one big string (without number or space). Like that, I can use it to have the len of the gen, number/frequence of allele A, C, T, G …

I’m sure you can found more information on the Wiki page about the GBK format, here :

https://en.wikipedia.org/wiki/GenBank

I visited https://www.ncbi.nlm.nih.gov/ and it seems to me that file in FASTA format which is available for download is formatted like you need.

If you have file with following content named genbank_sample:

ORIGIN
1 gatcctccat atacaacggt atctccacct caggtttaga tctcaacaac ggaaccattg
61 ccgacatgag acagttaggt atcgtcgaga gttacaagct aaaacgagca gtagtcagct
121 ctgcatctga agccgctgaa gttctactaa gggtggataa catcatccgt gcaagaccaa
181 gaaccgccaa tagacaacat atgtaacata tttaggatat acctcgaaaa taataaaccg
241 ccacactgtc attattataa ttagaaacag aacgcaaaaa ttatccacta tataattcaa
//

Then one can write following code:

with open('genbank_sample') as f:
    rows = (row.split() for row in f)
    chunks = (word for row in rows for word in row)
    sequences = (seq for seq in chunks if seq.startswith(('a', 'c', 't', 'g')))
    gen = ''.join(sequences)

Value of gen would be:

gatcctccatatacaacggtatctccacctcaggtttagatctcaacaacggaaccattgccgacatgagacagttaggtatcgtcgagagttacaagctaaaacgagcagtagtcagctctgcatctgaagccgctgaagttctactaagggtggataacatcatccgtgcaagaccaagaaccgccaatagacaacatatgtaacatatttaggatatacctcgaaaataataaaccgccacactgtcattattataattagaaacagaacgcaaaaattatccactatataattcaa