Why does Discourse sometimes add extra blank lines to my posts?

I receive many emails from Windows users, and they don’t appear this way. I’m not an expert on email message formats, but it is my understanding that all email bodies must be formatted with DOS style CRLF line endings, and must not include isolated CR or LF. See RFC 5322:

“”"
The body of a message is simply lines of US-ASCII characters. The only two limitations on the body are as follows:

o CR and LF MUST only occur together as CRLF; they MUST NOT appear independently in the body.

o Lines of characters in the body MUST be limited to 998 characters, and SHOULD be limited to 78 characters, excluding the CRLF.
“”"

I daresay there are other complications to do with various MIME types etc but I’m pretty sure it is not a simple “its a Windows text file” issue.

Looking more closely at one of the offending emails, I see that the email (generated by Discourse) is using quoted-printable:

Content-Type: text/plain;
 charset=UTF-8
Content-Transfer-Encoding: quoted-printable

and the body of the email seems to have escaped the carriage returns but not newlines. E.g. the first few lines of this post look like this:

=0D
=0D
Make `=CE=BB` a keyword with identical meaning to `lambda`. =0D
Keyword `=CE=BB` would be more readable because it is shorter than `lambd=
a`, besides `lambda` basically means `=CE=BB`. =0D
=0D
Now that almost every PC can type non ASCII characters, we have snippets =
in every serious editor & we have formatters like Autopep & Black, I don'=
t see any reason to restrict code to English alphabet.=0D
=0D

Looking at the behaviour of the quopri module in Python, carriage returns need not (should not?) be encoded:

>>> text = "Make `λ` a keyword with identical meaning to `lambda`.\r\n"
>>> quopri.encodestring(text.encode('UTF-8'))
b'Make `=CE=BB` a keyword with identical meaning to `lambda`.\r\n'

I assume that Python’s implementation is correct :wink:

So I have a hypothesis:

  • When Discourse emails a post containing non-ASCII characters, by default it uses quoted-printable.
  • The Discourse implementation of quoted-printable wrongly (?) encodes the carriage returns to =0D
  • mutt, following Postel’s Principle, accepts the bare LF as a line ending as if it were the mandated CRLF pair
  • and then decodes the =0D, which it displays as if it were an extra carriage return.

If my hypothesis is correct, then what I am seeing is the collision between a bug(?) in Discourse’s quoted-printable implementation, and mutt trying to be helpful by accepting bare LFs as line delimiters.

It may be that encoding the CR as =0D is allowed, in which case mutt is definitely to blame here. It would be nice if other mutt users could chime in with their experiences. @cameron you use mutt don’t you?