Strings problem modification from a genbankfile

robin7618 · June 3, 2021, 4:38am

Hello Everybody,

I just start Python3 and I block on a exercice about string.

I want to read a genbank file and make a string with all the gen found. At this moment I succed to recover every gen with a flag methodology. But, I have a new problem.

I suppress the digit with a online solution (from string import digits) but after this i have space and \n betwenn gen ;( So i would like to delet this space. I try with “”.join(mygenlist) but i did’nt work :0

If you have any solution tell my ^^

Next is my code :

from string import digits
# function for reading 
def lit_genbank(nomfichier):
    drapeau=False
    lecture=""
    table = str.maketrans('', '', digits) # table for digit suppression
    with open(nomfichier, 'r') as filegbk: # open file
        for ligne in filegbk: #read ligne by ligne
            if ligne.find("//")!=-1: # if //
                drapeau=False #flag false
            if drapeau : # if flag true
                # Here i delet digits
                clarifier= ligne.translate(table)
                lecture = lecture + clarifier.strip()
            if ligne.find("ORIGIN")!=-1: # si on trouve origine
                drapeau=True
    return lecture
#
#Main
namefile="GenBank NC_001133.gbk"
seq=lit_genbank(namefile)
seq2= "".join(seq.split()) # dont work
print(seq2) # when i print that i have some space and .join don't work

For GENBANK file is somethings like that :

ORIGIN
1 ccacaccaca cccacacacc cacacaccac accacacacc acaccacacc cacacacaca
61 catcctaaca ctaccctaac acagccctaa tctaaccctg gccaacctgt ctctcaactt
121 accctccatt accctgcctc cactcgttac cctgtcccat tcaaccatac cactccgaac
181 caccatccat ccctctactt actaccactc acccaccgtt accctccaat tacccatatc
//

Result of this python is :
ccacaccaca cccacacacc cacacaccac accacacacc acaccacacc cacacacacacatcctaaca ctac

cctaac acagccctaa tctaaccctg gccaacctgt ctctcaacttaccctccatt accctgcctc cactcgtt

ac cctgtcccat tcaaccatac cactccgaaccaccatccat ccctctactt actaccactc acccaccgtt a

steven.daprano · June 3, 2021, 6:11am

That is very strange, the code you show should work.

>>> seq = "  accctccatt accctgcctc cactcgttac cctgtcccat  "
>>> seq2= "".join(seq.split())
>>> print(seq2)
accctccattaccctgcctccactcgttaccctgtcccat

Can you check that you have posted the correct code?

If you have, try this:

print(len(seq2.splitlines())
print(' ' in seq2)
print('\N{NO-BREAK SPACE}' in seq2)
print([ord(c) for c in seq2[:50])

which might give some clues as to what is happening.

robin7618 · June 3, 2021, 7:00am

When I test just that :

seq = " START  accctccatt accctgcctc cactcgttac cctgtccca \n azlkfafa qzfzaf cctgtccca \n aezEND "
seq2= "".join(seq.split())
print(seq2)

It works XD : STARTaccctccattaccctgcctccactcgttaccctgtcccaazlkfafaqzfzafcctgtcccaaezEND

Yes is the correct code When i charge it again it’s work :° I don’t understand why cause i do not do modification ><

For the things you ask i have that :

1
False
False
[99, 99, 97, 99, 97, 99, 99, 97, 99, 97, 99, 99, 99, 97, 99, 97, 99, 97, 99, 99, 99, 97, 99, 97, 99, 97, 99, 99, 97, 99, 97, 99, 99, 97, 99, 97, 99, 97, 99, 99, 97, 99, 97, 99, 99, 97, 99, 97, 99, 99]

aivarpaalberg · June 3, 2021, 10:47am

This is not my cup of tea, so I just ask: what is expected/desired output?

robin7618 · June 3, 2021, 11:55am

Hey !

I desire to extract all the gen nucleotide in one big string (without number or space). Like that, I can use it to have the len of the gen, number/frequence of allele A, C, T, G …

I’m sure you can found more information on the Wiki page about the GBK format, here :

https://en.wikipedia.org/wiki/GenBank

aivarpaalberg · June 6, 2021, 3:38am

I visited https://www.ncbi.nlm.nih.gov/ and it seems to me that file in FASTA format which is available for download is formatted like you need.

If you have file with following content named genbank_sample:

ORIGIN
1 gatcctccat atacaacggt atctccacct caggtttaga tctcaacaac ggaaccattg
61 ccgacatgag acagttaggt atcgtcgaga gttacaagct aaaacgagca gtagtcagct
121 ctgcatctga agccgctgaa gttctactaa gggtggataa catcatccgt gcaagaccaa
181 gaaccgccaa tagacaacat atgtaacata tttaggatat acctcgaaaa taataaaccg
241 ccacactgtc attattataa ttagaaacag aacgcaaaaa ttatccacta tataattcaa
//

Then one can write following code:

with open('genbank_sample') as f:
    rows = (row.split() for row in f)
    chunks = (word for row in rows for word in row)
    sequences = (seq for seq in chunks if seq.startswith(('a', 'c', 't', 'g')))
    gen = ''.join(sequences)

Value of gen would be:

gatcctccatatacaacggtatctccacctcaggtttagatctcaacaacggaaccattgccgacatgagacagttaggtatcgtcgagagttacaagctaaaacgagcagtagtcagctctgcatctgaagccgctgaagttctactaagggtggataacatcatccgtgcaagaccaagaaccgccaatagacaacatatgtaacatatttaggatatacctcgaaaataataaaccgccacactgtcattattataattagaaacagaacgcaaaaattatccactatataattcaa

Topic		Replies	Views
How to remove whitespaces from a string in Python? Python Help	2	397	March 8, 2021
Strip Command doesn't remove white space of a string stored in a variable Python Help help	7	5462	February 22, 2024
How to remove leading whitespaces from a string in the Python? Python Help	2	369	January 5, 2022
How to use f string format without changing content of string? Python Help help	21	3878	June 20, 2022
Find the added part in a string Python Help help	4	267	November 7, 2022

Strings problem modification from a genbankfile

Related Topics