Search and replace from a text file as input in another text file

Hello,
I am an old guy working in the area of NLP. I cut my teeth on Perl and C. I am learning Python now and am trying to write a script that that takes input from a text file: preprocessor.rul which has the structure given below and applies it to another text file: corpus.txt which is file to be processed and writes the result to a third file: corpus.out. All the files are in UTF8 format since I work with Indic scripts.
The structure of the preprocessor.rul is as under:
The character to be changed is on the left hand side and the changed character on the right hand side. The two separated by a greater than sign. An example is given below, in Latin for ease of comprehension

Blockquote
Page>Book
.> .
Blockquote

The corpus.txt is a text fie in UTF8 format and the resultant output should also be in UTF8
I have attempted a script but it does not give the desired result,

Blockquote
#!/usr/bin/env python3
import fileinput
with fileinput.FileInput(preprocess.rul, inplace=True, backup=’.bak’) as file:
for line in file:
print(line.replace(text_to_search, replacement_text), end=’’)
Blockquote

Any help given to set this script working would be greatly appreciated.
Since I am new to the forum, please excuse any goof-up in posting.
Many thanks.

1 Like

I am an old guy working in the area of NLP. I cut my teeth on Perl and
C. I am learning Python now and am trying to write a script that that
takes input from a text file: preprocessor.rul which has the structure
given below and applies it to another text file: corpus.txt which is
file to be processed and writes the result to a third file: corpus.out.
All the files are in UTF8 format since I work with Indic scripts.

These days we almost all work in UTF8 anyway (and it is the default
text encoding) because ASCII and friends do not cover things.

The structure of the preprocessor.rul is as under:
The character to be changed is on the left hand side and the changed character on the right hand side. The two separated by a greater than sign. An example is given below, in Latin for ease of comprehension

Blockquote
Book
.> .
Blockquote

The corpus.txt is a text fie in UTF8 format and the resultant output should also be in UTF8

That will be the default anyway. All good.

I have attempted a script but it does not give the desired result,

Blockquote
#!/usr/bin/env python3
import fileinput
with fileinput.FileInput(preprocess.rul, inplace=True, backup=’.bak’) as file:
for line in file:
print(line.replace(text_to_search, replacement_text), end=’’)
Blockquote

The first thing to do is to discard the fileinput module, it seems ill
suited to your needs. Though I’m glad you’ve mentioned it, since I’d not
noticed it before.

I’d hardwire your filenames first to get the script going. Later,
modify your script to accept input and output on the command line or
whatever seems useful to you.

Something shaped like this:

rules = []
with open('preprocess.rul') as pref:
    for rule_line in pref:
        # decode the rule and stash it in rules
        from_text, to_text = rule_line.strip().split('>')
        rules.append( (from_text, to_text) )

with open('corpus.txt') as corpusf:
    with open('corpus.out') as outputf:
        for line in corpusf:
            for from_text, to_text in rules:
                ... modify line according to your rule ...
                ... note that the str.replace function is _not_ what you want ...
            # save the modified line
            outputf.write(line)

That’s all untested and omits the actual text fiddling, but hopefully
should get you started.

Until you’re happy, maybe replace:

outputf.write(line)

with:

print(line)

just to see the output directly.

Cheers,
Cameron Simpson cs@cskk.id.au

1 Like

Of course, that should be:

with open('corpus.txt') as corpusf:
    with open('corpus.out', 'w') as outputf:

i.e. open the output file for write. The default mode is read.

Cheers,
Cameron Simpson cs@cskk.id.au

1 Like