Error Regex in Python: Convert several file VTT to DOCX

pqkythuat1 · December 31, 2022, 3:36am

Hello Alls,

I found the code to convert VTT files to DOCX. When running the code, Python gives an error

re.error: bad escape \R at position 40 at line 14.

Looking forward to your help. Thank you very much.

import webvtt
from docx import Document
import re
import os

# path = '/users/tdobbins/downloads/smithtxt'
path = 'D:/A'
direct = os.listdir(path)
pattern = r'^(WEBVTT|NOTE|link:|https://|\d{2}:|$).*\R'
for i in direct:
    document = Document()
    document.add_heading(i, 0)
    myfile = open('D:/A/'+i,encoding="utf8").read()
    myfile = re.sub(pattern,' ', myfile) # remove all non-XML-compatible characters
    p = document.add_paragraph(myfile)
    document.save('D:/A/'+i+'.docx')

MRAB · December 31, 2022, 3:57am

The re module doesn’t support the \R escape sequence which matches the sequence \r\n or any newline character such as. Use (?>\r\n|[\r\n]) if you’re using Python 3.11 or (?:\r\n|[\r\n]) (that’s probably good enough!) if you’re using Python 3.10 or earlier.

pqkythuat1 · December 31, 2022, 4:58am

Thank you very much.
I have a good result when use [\r\n]
Can I have more question: I add re.MULTILINE in line 14, but it only works for the first line, the rest are unaffected?

import webvtt
from docx import Document
import re
import os

# path = '/users/tdobbins/downloads/smithtxt'
path = 'D:/A'
direct = os.listdir(path)
pattern = r'^(WEBVTT|NOTE|link:|https://|\d{2}:|$).*[\r\n]'
for i in direct:
    document = Document()
    #document.add_heading(i, 0)
    myfile = open('D:/A/'+i,encoding="utf8").read()
    myfile = re.sub(r'^(WEBVTT|NOTE|link:|https://| --> |$).*[\r\n]',' ', myfile, re.MULTILINE) # remove all non-XML-compatible characters
    p = document.add_paragraph(myfile)
    document.save('D:/A/'+i+'.docx')

MRAB · December 31, 2022, 5:28pm

Check the parameters of re.sub. They are: pattern, replacement, string, count, flags. You’re passing re.MULTILINE to the count parameter not the flags parameter.

Try being explicit by using a keyword:

myfile = re.sub(r'^(WEBVTT|NOTE|link:|https://| --> |$).*[\r\n]',' ', myfile, flags=re.MULTILINE)

pqkythuat1 · January 3, 2023, 12:49am

I finished my work with your code. Thank you very much