Extracting count with exact group of strings from a long row with symbols

Hello,

I have txt file like this:

many comments here

 chrY    2893596 .       C       T       .       PASS    AC=1;AN=32183;AF=3.10723e-05;popmax=afr;strings1;strings2;strings2;strings3;etc;ENSG00000129824|strings|strings|strings|intron_variant|MODIFIER|HSFY3P|ENSG00000227289|Transcript|morestrings|etc||||||||||||||||||

chrY    2893598 .       A       G       .       PASS    AC=1;AN=32183;AF=3.10723e-05;popmax=afr;strings1;strings2;strings2;strings3;etc;ENSG00000129824|strings|strings|strings|upstream_gene_variant|MODIFIER|HSFY3P|ENSG00000227289|Transcript|morestrings|etc||||||||||||||||||

The thing is that column 8 consists from row of many strings, enclosed with either “;” or pipes.

I try to write Python code that counts type of variants. In this case - need to know, how many “upstream_gene_variant” and “intron_variant” are for each string, starting with “ENSG”. The count of integers in each ENSG string is 11.

Desired output is something like: Counter({(‘ENSG00000227289’, ‘upstream_gene_variant’): 1,
(‘00000227289’, ‘intron_variant’): 1}).

I started to write code:

import os
from collections import Counter
import re


files = os.listdir("./")
for file in files:
    if file.endswith('gnomad_fragment.txt'):
        with open(file) as doc:
            data = doc.read()

            occurrences = data.count("ENSG")

        print('Number of occurrences of the word :', occurrences)

Until this, it works fine.

Number of occurrences of the word : 10

But not sure, how to count specific substrings for each ENSG ?

Do I use dictionaries, or anything else?

Thank you!

Hi Una,

Does this data format have a name? There might be a library that already
does the work of parsing it.

I have txt file like this:

many comments here

chrY    2893596 .       C       T       .       PASS    AC=1;AN=32183;AF=3.10723e-05;popmax=afr;strings1;strings2;strings2;strings3;etc;ENSG00000129824|strings|strings|strings|intron_variant|MODIFIER|HSFY3P|ENSG00000227289|Transcript|morestrings|etc||||||||||||||||||

chrY 2893598 . A G . PASS AC=1;AN=32183;AF=3.10723e-05;popmax=afr;strings1;strings2;strings2;strings3;etc;ENSG00000129824|strings|strings|strings|upstream_gene_variant|MODIFIER|HSFY3P|ENSG00000227289|Transcript|morestrings|etc||||||||||||||||||

The thing is that column 8 consists from row of many strings, enclosed with either “;” or pipes.

It seems likely that this is either columns separated by semicolons,
some columns containing values seprated by pipes, OR columns separated
by pipes, some columns containing values separated by semicolons.

If you know about this (genetic?) data, hopefully you can decide what is
the case.

I am guessing the outermost separator is the pipes, because that is not
uncommon.

I would try to parse this file using the “csv” module: csv — CSV File Reading and Writing — Python 3.12.1 documentation

Open the file with csv.reader as in the first example on that page,
something like:

import csv
with open('yourfilename.txt') as csvf:
    csvr = csv.reader(csvf, delimiter='|')
    for row in csvr:
        print(row)

and see what it prints. Fiddle with the delimiter and quotechar
parameters, trying semicolon and pipe. Try without the quotechar
parameter entirely first.

Let us suppose that the delimiter=‘;’ parameter gave you sensible
looking results.

Then you can take a field with pipes in it and use the split method to
break it up. Example:

ensg_field = row[10]
if ensg_field.startswith('ENSG'):
    ensg_words = ensg_field.split('|')
    for word in ensg_word:
        look at word to see if it starts with 
        "upstream_gene_variant" and so on

I try to write Python code that counts type of variants. In this case

  • need to know, how many “upstream_gene_variant” and “intron_variant”
    are for each string, starting with “ENSG”. The count of integers in
    each ENSG string is 11.

Desired output is something like: Counter({(‘ENSG00000227289’,
‘upstream_gene_variant’): 1,
(‘00000227289’, ‘intron_variant’): 1}).

[…]

   if file.endswith('gnomad_fragment.txt'):
       with open(file) as doc:

Looks good up to here. But I’d then see if the csv module example code
above produces useful stuff for you, where “csvf” in the example is the
“file” variable in your code, as that is the open file object.

But not sure, how to count specific substrings for each ENSG ?
Do I use dictionaries, or anything else?

A dictionary would be good. Then you can make an entry per substring.
You can even use a defaultdict:

from collections import defaultdict
counts = defaultdict(int)
...
...
counts[keyword] += 1
...

A defaultdict is a type of dict which autocreates new entries if you
access them, avoiding a lot of tedious “is this a new keyword” logic.
The “defaultdict(int)” makes such a dict where the new entries are made
by calling int(), which produces a zero. Thus it is a dict which is
prefilled with zeroes.

The Counter class from the collections module does this kind of thing
for you, so you could use it instead of the defaultdict above.

Cheers,
Cameron Simpson cs@cskk.id.au