Count number of substrings in String by multiple delimiters

drewbty · November 27, 2022, 9:29am

Imagine the following example Strings

‘John @ Mary John v Mary John vs Mary’
‘John v Mary Ben v Paul John v Mary’
‘Hello World / John v Mary John @ Mary John vs Mary’
‘John v Mary John vs Mary John @ Mary John v Mary’

There are 3 identified delimiters

’ @ ’
’ v ’
’ vs ’

For every row, I would like to iterate through each delimiter, look left and right by 4 characters, concatenate left and right together, and return the count if all concatenated substrings are a match.

we would end up finding ‘JohnMary’ 3 times. Return = 3
we would end up finding ‘JohnMary’,‘BenPaul’ and ‘JohnMary’. Return = 0
we would end up finding ‘JohnMary’ 3 times. note the Hello World is irrelevant as we only look 4 characters left and right. Return = 3
we would end up finding ‘JohnMary’ 4 times. Return = 4

For this I’ll need some sort recursive/loop query to iterate through each delimiter in each row, and count the number of matched substrings.

note, if the first 2 substrings encountered aren’t a match, we don’t need to continue checking any further and can return 0 (like in example 2)

Any help on which methods I should be using/how to approach this? Note that there are 20 million rows in a CSV file to check this way

MRAB · November 27, 2022, 7:13pm

Here’s an example using regex:

text = '''\
John @ Mary John v Mary John vs Mary
John v Mary Ben v Paul John v Mary
Hello World / John v Mary John @ Mary John vs Mary
John v Mary John vs Mary John @ Mary John v Mary
'''

from collections import defaultdict

import re
pattern = re.compile('(.{4})( @ | v | vs )(.{4})')

for line in text.splitlines():
    found = defaultdict(lambda: 0)

    for before, sep, after in pattern.findall(line):
        key = before, sep, after
        found[before + after] += 1

    if len(found) == 1:
        print(list(found.values())[0])
    else:
        print(0)

drewbty · November 27, 2022, 10:09pm

Thanks Matthew. How would you amend slightly to iterate each row from a CSV file, currently with 2 columns, Event and ID? Event being the target column and the end result would be to have a dataframe that I can export as CSV with 3 columns. Also the code should return 0 if all substrings aren’t a match

MRAB · November 27, 2022, 10:54pm

You can read the CSV file using the csv module. Then just process the appropriate field. If the input and output are both CSV, I’d just read and write using the CSV module. If it’s already fast enough, I wouldn’t bother with pandas.

drewbty · November 27, 2022, 11:22pm

I’m able to do such things in SQL but the issue is SQL is not suited to string manipulation (so I’m told) and that’s why it was so unworkably slow. So I got recommended to ask someone in Python. Sadly I’m not knowledgeable enough with Python for just hints like this though

from collections import defaultdict

import re
import csv

with open(r"C:\Users\xxxx\Downloads\sample.txt", newline='') as fin: 
    reader = csv.reader(fin, delimiter='\t')
    for record in reader: 
        text = record[0]

        pattern = re.compile('(.{4})( @ | v | vs )(.{4})')

        for line in text.splitlines():
            found = defaultdict(lambda: 0)

            for before, sep, after in pattern.findall(line):
                key = before, sep, after
                found[before + after] += 1

            if len(found) == 1:
                print(list(found.values())[0])
            else:
                print(0)

I have this and it’s working to read the first column of the file. I have a couple of issues from here.

How do I add a new column to the file with the results?
The result is showing as 1 for some Strings in my sample file. The result should never be 1. e.g. ‘Charlotte Hornets vs Sacramento Kings / Sandown Race 1 - Ladbrokes Switch Handicap (@Top Fluc)’
I need a piece of code to return 0 if all matches aren’t the same. e.g. a row could have 3 matches of one substring and 2 of another. Return should be 0 in this case. That would probably solve the issue above also

MRAB · November 28, 2022, 1:31am

OK, try this:

from collections import defaultdict
import re
import csv

pattern = re.compile('(.{4})( @ | v | vs )(.{4})')

with open(r"C:\Users\xxxx\Downloads\sample.txt", newline='') as fin, open(r"C:\Users\xxxx\Downloads\sample_out.txt", "w", newline='') as fout:
    reader = csv.DictReader(fin, delimiter='\t')
    writer = csv.DictWriter(fout, reader.fieldnames + ['Count'], delimiter='\t')
    writer.writeheader()

    for record in reader:
        found = defaultdict(lambda: 0)

        for before, sep, after in pattern.findall(record['Event']):
            key = before, sep, after
            found[before + after] += 1

        if len(found) == 1 and sum(found.values()) > 1:
            record['Count'] = list(found.values())[0]
        else:
            record['Count'] = 0

        writer.writerow(record)

drewbty · November 28, 2022, 2:31am

Gives the following error and outputs a blank file called sample_out

drewbty · November 28, 2022, 2:45am

MRAB · November 28, 2022, 3:02am

Are the text files encoded using UTF-8?

On Windows it’s usual for UTF-8 files to start with a ‘signature’, but not on other OSes, so, if you open a UTF-8 file on Windows for reading it’s probably best to specify not that it’s “utf-8” but that it’s “utf-8-sig”:

with open(r"C:\Users\xxxx\Downloads\sample.txt", encoding='utf-8-sig', newline='') as fin, open(r"C:\Users\xxxx\Downloads\sample_out.txt", "w", encoding='utf-8', newline='') as fout:

I think the problem is that it thinks that the signature is part of the field name, but when you open the tile in, say, Notepad, it simply recognises it as the signature and just shows the text.

Personally, I prefer to have UTF-8 files without the signature, and I overlooked that your files might have it.

drewbty · November 28, 2022, 3:20am

They must be The records were basically ripped from a database by selecting all and ‘save results as’.

Awesome, that worked perfect! I am on windows and had to insert encoding=‘utf-8-sig’. as you suggested. Encoding='utf-8 gave an error as below:

UnicodeEncodeError: 'charmap' codec can't encode character '\ufeff' in position 0: character maps to <undefined>