Extracting count with exact group of strings from a long row with symbols

user1960 · October 5, 2021, 3:37pm

Hello,

I have txt file like this:

many comments here

 chrY    2893596 .       C       T       .       PASS    AC=1;AN=32183;AF=3.10723e-05;popmax=afr;strings1;strings2;strings2;strings3;etc;ENSG00000129824|strings|strings|strings|intron_variant|MODIFIER|HSFY3P|ENSG00000227289|Transcript|morestrings|etc||||||||||||||||||

chrY    2893598 .       A       G       .       PASS    AC=1;AN=32183;AF=3.10723e-05;popmax=afr;strings1;strings2;strings2;strings3;etc;ENSG00000129824|strings|strings|strings|upstream_gene_variant|MODIFIER|HSFY3P|ENSG00000227289|Transcript|morestrings|etc||||||||||||||||||

The thing is that column 8 consists from row of many strings, enclosed with either “;” or pipes.

I try to write Python code that counts type of variants. In this case - need to know, how many “upstream_gene_variant” and “intron_variant” are for each string, starting with “ENSG”. The count of integers in each ENSG string is 11.

Desired output is something like: Counter({(‘ENSG00000227289’, ‘upstream_gene_variant’): 1,
(‘00000227289’, ‘intron_variant’): 1}).

I started to write code:

import os
from collections import Counter
import re


files = os.listdir("./")
for file in files:
    if file.endswith('gnomad_fragment.txt'):
        with open(file) as doc:
            data = doc.read()

            occurrences = data.count("ENSG")

        print('Number of occurrences of the word :', occurrences)

Until this, it works fine.

Number of occurrences of the word : 10

But not sure, how to count specific substrings for each ENSG ?

Do I use dictionaries, or anything else?

Thank you!

steven.daprano · October 5, 2021, 10:57pm

Hi Una,

Does this data format have a name? There might be a library that already
does the work of parsing it.

cameron · October 5, 2021, 11:45pm

I have txt file like this:

many comments here
chrY    2893596 .       C       T       .       PASS    AC=1;AN=32183;AF=3.10723e-05;popmax=afr;strings1;strings2;strings2;strings3;etc;ENSG00000129824|strings|strings|strings|intron_variant|MODIFIER|HSFY3P|ENSG00000227289|Transcript|morestrings|etc||||||||||||||||||
chrY 2893598 . A G . PASS AC=1;AN=32183;AF=3.10723e-05;popmax=afr;strings1;strings2;strings2;strings3;etc;ENSG00000129824|strings|strings|strings|upstream_gene_variant|MODIFIER|HSFY3P|ENSG00000227289|Transcript|morestrings|etc||||||||||||||||||

The thing is that column 8 consists from row of many strings, enclosed with either “;” or pipes.

It seems likely that this is either columns separated by semicolons,
some columns containing values seprated by pipes, OR columns separated
by pipes, some columns containing values separated by semicolons.

If you know about this (genetic?) data, hopefully you can decide what is
the case.

I am guessing the outermost separator is the pipes, because that is not
uncommon.

I would try to parse this file using the “csv” module: csv — CSV File Reading and Writing — Python 3.12.1 documentation

Open the file with csv.reader as in the first example on that page,
something like:

import csv
with open('yourfilename.txt') as csvf:
    csvr = csv.reader(csvf, delimiter='|')
    for row in csvr:
        print(row)

and see what it prints. Fiddle with the delimiter and quotechar
parameters, trying semicolon and pipe. Try without the quotechar
parameter entirely first.

Let us suppose that the delimiter=‘;’ parameter gave you sensible
looking results.

Then you can take a field with pipes in it and use the split method to
break it up. Example:

ensg_field = row[10]
if ensg_field.startswith('ENSG'):
    ensg_words = ensg_field.split('|')
    for word in ensg_word:
        look at word to see if it starts with 
        "upstream_gene_variant" and so on

I try to write Python code that counts type of variants. In this case

need to know, how many “upstream_gene_variant” and “intron_variant”
are for each string, starting with “ENSG”. The count of integers in
each ENSG string is 11.

Desired output is something like: Counter({(‘ENSG00000227289’,
‘upstream_gene_variant’): 1,
(‘00000227289’, ‘intron_variant’): 1}).

[…]

   if file.endswith('gnomad_fragment.txt'):
       with open(file) as doc:

Looks good up to here. But I’d then see if the csv module example code
above produces useful stuff for you, where “csvf” in the example is the
“file” variable in your code, as that is the open file object.

But not sure, how to count specific substrings for each ENSG ?
Do I use dictionaries, or anything else?

A dictionary would be good. Then you can make an entry per substring.
You can even use a defaultdict:

from collections import defaultdict
counts = defaultdict(int)
...
...
counts[keyword] += 1
...

A defaultdict is a type of dict which autocreates new entries if you
access them, avoiding a lot of tedious “is this a new keyword” logic.
The “defaultdict(int)” makes such a dict where the new entries are made
by calling int(), which produces a zero. Thus it is a dict which is
prefilled with zeroes.

The Counter class from the collections module does this kind of thing
for you, so you could use it instead of the defaultdict above.

Cheers,
Cameron Simpson cs@cskk.id.au

Topic		Replies	Views
Add a `count` keyword to str.index and str.find to find the n-th occurence of an element in a string Python Help	17	458	August 9, 2023
Various length dataframe to extract (or split) Python Help	3	306	May 11, 2022
Retreive in Excel some data from a plain txt file Python Help	0	433	July 2, 2020
Separate data by commas in Python to Excel Python Help	0	1207	February 9, 2021
Why is "Count" not working Python Help	2	1590	April 9, 2021

Extracting count with exact group of strings from a long row with symbols

many comments here

many comments here

Related Topics