Why does Discourse sometimes add extra blank lines to my posts?

steven.daprano · February 21, 2023, 9:32am

I frequently reply to Discourse threads by email, always using mutt on Linux. Discourse frequently, but not always, adds extra blank lines between paragraphs in the email sent out to people. I see it in my inbox, and I’ve had people ask me to “fix your post” because the email they receive is the same.

Why does this happen? I think it is a bug in Discourse, change my mind!

As far as I can tell, there is nothing for me to fix. The emails I send out are always from the same email client. They are always plain text, never HTML. Line endings are whatever mutt sends, which I believe is whatever plain text emails are supposed to use. The Content-Type is always one of

text/plain; charset=us-ascii (most common)
text/plain; charset=utf-8
text/plain; charset=iso-8859-1 (very rare).

There does not seem to be a correlation between the charset and whether the extra lines are inserted, as far as I can tell.

Here is an example: two posts by me, sent using the exact same content-type (us-ascii), to the same topic:

This post appears normally, with no extraneous blank lines.
But this post has extra blank lines added. Its not very obvious from the formatted post, but the raw unformatted post makes it clear.

That is definitely not how I sent the email. In my Sent mail folder, there are no blank lines between paragraphs or bullet points.

steven.daprano · February 21, 2023, 10:55am

Further to my previous post, I don’t know if this is related, but many of the emails from Discourse have extraneous Ctrl-M carriage returns in them. I get them from many other posters, but here is an example of mine from the same topic referenced in my previous post:

The extra Ctrl-M chars don’t seem to effect either the formatted or raw post in the web UI, but believe me, they are very obvious (and annoying) in a console text interface mail client like mutt.

Please don’t tell me to use a better email client Although I will be happy if other mutt users can suggest a config that will hide the visible ^M chars in the email

In this case, the Ctrl-M characters seem to be associated with the presence of non-ASCII characters. In this thread, I see that every post (not just mine) sent with Content-Transfer-Encoding: quoted-printable appears with extraneous Ctrl-M chars inserted between lines, while those very few with sent with Content-Transfer-Encoding: 7bit appear correctly.

Note that unlike the extra blank lines issue, this one appears to affect almost all users not just me. (Although maybe other email clients hide the ^M and so they are unaware of it.)

To summarise:

Emails sent by Discourse appear to have two (related?) bugs:

Emails containing non-ASCII chars send as quoted-printable appear to have extraneous ^M carriage returns added.
Emails originally sent by me, and then processed and resent by Discourse, sometimes but not always have extra blank lines inserted between paragraphs (but not always).

(I should comment that I’ve been using mutt for approaching twenty years, and I receive tons of emails from all sorts of mail clients, mailing lists, and other software, and have never seen either of these behaviours before. This makes me reasonably confident that it is a Discourse issue.)

malemburg · February 21, 2023, 11:46am

It is likely that you are seeing Windows/DOS line endings (CRLF) on a Unix machine (LF only). The Ctrl-M corresponds to the CR (= carriage return on an old mechanical type writer or teletype machine) character.

MUAs typically auto-convert these to the native line ending format, e.g. Thunderbird on Windows converts incoming emails to CRLF format. Perhaps there’s a switch to have mutt behave in the same way ?!

Rosuav · February 21, 2023, 12:24pm

RFC 822 and 2822 specify that, in transport, lines end with CRLF.

steven.daprano · February 21, 2023, 1:57pm

I receive many emails from Windows users, and they don’t appear this way. I’m not an expert on email message formats, but it is my understanding that all email bodies must be formatted with DOS style CRLF line endings, and must not include isolated CR or LF. See RFC 5322:

“”"
The body of a message is simply lines of US-ASCII characters. The only two limitations on the body are as follows:

o CR and LF MUST only occur together as CRLF; they MUST NOT appear independently in the body.

o Lines of characters in the body MUST be limited to 998 characters, and SHOULD be limited to 78 characters, excluding the CRLF.
“”"

I daresay there are other complications to do with various MIME types etc but I’m pretty sure it is not a simple “its a Windows text file” issue.

Looking more closely at one of the offending emails, I see that the email (generated by Discourse) is using quoted-printable:

Content-Type: text/plain;
 charset=UTF-8
Content-Transfer-Encoding: quoted-printable

and the body of the email seems to have escaped the carriage returns but not newlines. E.g. the first few lines of this post look like this:

=0D
=0D
Make `=CE=BB` a keyword with identical meaning to `lambda`. =0D
Keyword `=CE=BB` would be more readable because it is shorter than `lambd=
a`, besides `lambda` basically means `=CE=BB`. =0D
=0D
Now that almost every PC can type non ASCII characters, we have snippets =
in every serious editor & we have formatters like Autopep & Black, I don'=
t see any reason to restrict code to English alphabet.=0D
=0D

Looking at the behaviour of the quopri module in Python, carriage returns need not (should not?) be encoded:

>>> text = "Make `λ` a keyword with identical meaning to `lambda`.\r\n"
>>> quopri.encodestring(text.encode('UTF-8'))
b'Make `=CE=BB` a keyword with identical meaning to `lambda`.\r\n'

I assume that Python’s implementation is correct

So I have a hypothesis:

When Discourse emails a post containing non-ASCII characters, by default it uses quoted-printable.
The Discourse implementation of quoted-printable wrongly (?) encodes the carriage returns to =0D
mutt, following Postel’s Principle, accepts the bare LF as a line ending as if it were the mandated CRLF pair
and then decodes the =0D, which it displays as if it were an extra carriage return.

If my hypothesis is correct, then what I am seeing is the collision between a bug(?) in Discourse’s quoted-printable implementation, and mutt trying to be helpful by accepting bare LFs as line delimiters.

It may be that encoding the CR as =0D is allowed, in which case mutt is definitely to blame here. It would be nice if other mutt users could chime in with their experiences. @cameron you use mutt don’t you?

Rosuav · February 21, 2023, 4:56pm

RFC 2045: Multipurpose Internet Mail Extensions (MIME) Part One: Format of Internet Message Bodies says that the end of line should NOT be encoded separately. So I’d put the blame on Discourse here.

cameron · February 21, 2023, 8:55pm

Looking at the behaviour of the quopri module in Python, carriage
returns need not (should not?) be encoded:
>>> text = "Make `λ` a keyword with identical meaning to `lambda`.\r\n"
>>> quopri.encodestring(text.encode('UTF-8'))
b'Make `=CE=BB` a keyword with identical meaning to `lambda`.\r\n'
I assume that Python’s implementation is correct

Like many things, that might be context dependent.

IIRC (and I’ll need to see some explicit examples and reread some RFCs)
the transport text is CRLF delimited i.e. the on-the-wire
post-encoding text. The objective of the encodings is normally to
preserve the original source text byte level correctly, thus QP et al to
encode those bytes for transport.

So I have a hypothesis:

When Discourse emails a post containing non-ASCII characters, by
default it uses quoted-printable.

That should be fine, provided the content type and encoding are
correctly specified.

The Discourse implementation of quoted-printable wrongly (?) encodes the carriage returns to =0D

That’s quite possible. We’d need to make a reproduceable example (which
might involve the right flow eg post-via-email vs
post-via-the-web-forum.

mutt, following Postel’s Principle, accepts the bare LF as a line ending as if it were the mandated CRLF pair

and then decodes the =0D, which it displays as if it were an extra carriage return.

Maybe? Again, we need an example of good-looking and extra-blank-lines
for comparison.

If my hypothesis is correct, then what I am seeing is the collision between a bug(?) in Discourse’s quoted-printable implementation, and mutt trying to be helpful by accepting bare LFs as line delimiters.

It may be that encoding the CR as =0D is allowed, in which case mutt is
definitely to blame here. It would be nice if other mutt users could
chime in with their experiences. @cameron you use mutt don’t you?

Extensively

Can we get together a collection of:

posts by you (Steven) which went through well formatted and which got
extra lines
matching links to the web forum, to see how they render there
a little tar file of the original message (eg from your “sent” folder,
if you do that)
matching tar file of the received message (IIRC Discourse doesn’t send
you your own posts, so I can dig into my copies here for the received
versions and they’ve been through Discourse)

That way we can:

examine how it went out, and with what encodings and source bytes
maybe find a pattern, and isolate how much comes from mutt’s decoding
and display behaviour and how much comes from Discourse processing the
message

Cheers,
Cameron Simpson cs@cskk.id.au

steven.daprano · February 25, 2023, 1:42am

Hi Cameron, thanks for taking an interest.

If you look at the first two posts in this topic, I link to various examples. Do they help?

Discourse will send you copies of your own posts in “mailing list mode” unless you explicitly disable it.

I can assemble a few examples of outgoing and incoming mail into a pair of mbox files, but I’m not sure where to post them. I don’t think Discourse allows uploading of arbitary files.