The re module doesn’t support the \R escape sequence which matches the sequence \r\n or any newline character such as. Use (?>\r\n|[\r\n]) if you’re using Python 3.11 or (?:\r\n|[\r\n]) (that’s probably good enough!) if you’re using Python 3.10 or earlier.
Thank you very much.
I have a good result when use [\r\n]
Can I have more question: I add re.MULTILINE in line 14, but it only works for the first line, the rest are unaffected?
import webvtt
from docx import Document
import re
import os
# path = '/users/tdobbins/downloads/smithtxt'
path = 'D:/A'
direct = os.listdir(path)
pattern = r'^(WEBVTT|NOTE|link:|https://|\d{2}:|$).*[\r\n]'
for i in direct:
document = Document()
#document.add_heading(i, 0)
myfile = open('D:/A/'+i,encoding="utf8").read()
myfile = re.sub(r'^(WEBVTT|NOTE|link:|https://| --> |$).*[\r\n]',' ', myfile, re.MULTILINE) # remove all non-XML-compatible characters
p = document.add_paragraph(myfile)
document.save('D:/A/'+i+'.docx')
Check the parameters of re.sub. They are: pattern, replacement, string, count, flags. You’re passing re.MULTILINE to the count parameter not the flags parameter.