Semantic line breaks

One Sentence Per Line sounds like a misinterpretation of both SemBr and Kernighan’s thoughts from the early 70s. Quoting the latter:

Make lines short, and break lines at natural places, such as after commas and semicolons, rather than randomly.

I hope we end up somewhere closer to SemBr and Kernighan than OSPL.

3 Likes

While I appreciate shorter, easier-to-read diffs, the below are not multiple sentences on multiple lines, but one sentence with multiple clauses spread across several lines:

While the next example (the one @hugovk does not want) is, in fact, one sentence on one line:

I think it is important to remember that documents need to still be readable when in text format,

and when
the sentences are spread
across several lines
it looks like poetry
where there are pauses
at the end of lines
and it completely destroys
the flow of the prose
making it harder
to understand
the intent.

6 Likes

I think there’s a rare case where a “semantic piece of text” happens to be say 81 chars wide, there’s a conflict. Which one takes priority: having a max 79 char lines or semantic line breaks?

I’m not sure how common this would be (it depends somewhat on where one delineates one ‘idea’ or clause or etc. from the next).

Under the current system then 79 would still be the law, but it would be a very pedantic editor who requested the line to be broken as of a 2 character overage. Indeed we are currently quite lenient on URLs, whereas on strict reading we could require these to be broken as well (reST copes with URLs broken over multiple lines).

I think this a letter of the law versus application of the law matter, although I see why you make the argument.

A

I think that the idea of breaking lines to match the sentence structure is reasonable on the face of it, and probably does result in better diffs. But only if done with the above principle firmly in mind, and with the strong caveat that readability wins in all cases.

Unfortunately, the tendency with any style guideline is to attract people who like to apply that rule as if it were set in stone, and what was originally a useful and sensible principle becomes a burden. Hopefully this is less likely with text than with code (because it’s harder to write linters for text that blindly enforce a rule like this) but the risk is definitely there.

So I guess I’m in favour of the idea as a principle, but would prefer not to have it written down in a style guide.

5 Likes

If we need to specify something in this regard, we could specify a maximum line length (which I think we already have?) and explicitly state that there is no minimum line length, and that it is not necessary to reflow an entire paragraph if its first line(s) become shorter.

IIRC, the maximum line length for docs is 80 chars.

In fairness, it’s a standard — that’s its job. Even the law recognizes that vague standards are worse than no standards at all.

Starting the post with one sentence per line was somewhat misleading. I edited the first post to make my interpretation of the concept clearer:

Semantic line breaks are an alternative to
word-wrapping paragraphs at (say) the 79th column
in formats like ReST, Markdown or HTML, which allow arbitrary line wrapping.
Instead, lines are wrapped after periods
(or commas, semicolons, semantic pauses).
There's still a *maximum* line length, but lines are broken before that.
The advantage is that editing a few words
doesn't cause the whole paragraph to reflow.
2 Likes

I’ve worked on many doc efforts over the years, and I think semantic line breaks when you’re using markup/markdown are the way to go if you’re not faced with too many users using editors (i.e. wysiwyg) that make it hard. As long as it’s not too pedantic. If it’s a guideline couched as “if you’re looking for a place to break a line, prefer something that is a logical break over just hitting Enter any old place”. Breaking at every bit of punctuation Just Because is not great.

I guess the question is what are the goals? Markdown-type systems (md, reST, adoc, etc.) generally exist on the principle that the document should be fully readable when viewing the “source” rather than a rendering. Thus, editors might feel free to reflow things on changes. On the other hand, when changes are done via PR, reviewers need to have their lives made at least not completely miserable. [aside: I get periodically yelled at by a maintiner on a project I’m working on when I submit attempted surgery on doc parts and “make his eyes bleed” because it’s so hard to pick out the changes in github diff presentation. That one is Docbook xml, which is fiddlier than the markdown class. I do try, but some of the existing mess means it can’t always be avoided]. If making reviewing reasonable is a larger goal then semantic line breaks - definitely.

I’ve also found Git’s --word-diff (and its several operational
modes) helps a lot for reviewing documentation commits. I know some
code review platforms like Gerrit do a fairly good job of
highlighting wording changes independently of where lines have been
re-wrapped, but as I don’t really go near others like GitHub or
Gitlab all that often, I have no idea whether they do the same.

I still find that more difficult to read than, say:

As far as diffs go, I would suggest using two:

  • make the changes without reflow (so the changes are easier to see)
  • do the reflow, with the only changes being the location of line breaks
1 Like

While I might use semantic-ish breaking myself when writing new text, I expect everything to be autoformatted upon save in many editor configs. No reasonable autoformatter is going to get semantics as “right” as a human (I’m sure some ML model could, but so what…). So I’m fine allowing it… but not in order to exceed the line limit… and absolutely no requirement for it to be preserved and maintained by anyone during future edits.

In the end we’d be best off always auto-formatting.

diff readability problems in code reviews are a problem with the diffing tool, not with the editing. It is entirely reasonable for code review diff tools to not be line based but instead understand reflowing and highlight only the meaningfully changed bits.

4 Likes

Sounds reasonable.

How would you avouid editors with slightly different settings reflowing whole files in slightly different ways on each save?

That essentialy means the diff tool must know the markup language and handle places where newlines are significant (code blocks, blank lines for paragraphs…). I don’t think it’s a realistic default.
AFAIK, GitHub (the default tool for CPython reviews) doesn’t do this.

1 Like

Since this topic is pretty important to me, due to huge the amount of time, stress, cognitive load and sub-optimal content-relevant choices its saved me over the years in docs/website/etc repos that switched to OSPL, and the amount of the same it costs me every day as a PEP writer and editor with those that don’t, I’d been intending to cogently outline the detailed case for it (at least in the context of the PEPs repo) once I’d established more credibility as a PEP editor and in the community. Unfortunately, though I should have known better and refrained from mentioning it until ready to do so, it seems my OT aside let the cat out of the bag, and on a day when we were dealing with a severe weather situation too.

In any case, I’ve created a related thread with a detailed proposal for OSPL, at least nominally scoped to the PEPs repo initially, which also contains a section that addresses some of the merits and practical difficulties with what appears to be the nominal proposal here, to use SemBr instead. I welcome your feedback over there, and can address points relevant to SemBr over here as well. Thanks!

EDIT: Just to clarify, my position on SemBr is that it is generally an improvement over “dumb” hard wrapping, especially for reST where it is a semi- or completely manual process anyway, and it could make more sense for projects like the CPython documentation, in terms of being practically easier to adopt incrementally and non-strictly-enforced on existing, gradually-updated content that currently uses hard breaks, despite the practical downsides I highlight versus OSPL.

However, the case for OSPL is much stronger particularly for repos like the PEPs, where:

  • The benefits of OSPL are more acute (given the high, concentrated amount of rewriting, editing and review during the PEPs’ pre-draft and draft stage)
  • The difficulties in understanding, teaching and consistently enforcing SemBr come to the fore (since many authors are first-timers or don’t write PEPs regularly, versus a contributor base experianced in technical English writing)
  • The adoption issues are mostly moot (existing non-Draft PEPs stay are rarely edited and stay as they are, new PEPs can adopt it and Draft PEPs can if their authors choose to).

Also, this opens up the opportunity to trial it there, and then expand to others if successful and incorporating any lessons learned.

GitHub does indeed have this this feature, and I’ve found it extremely useful over the years when making, reviewing and suggesting changes on documentation, website and other prose-related content. Unfortunately, as you might expect, it doesn’t go so far as to parse and dynamically reflow reST content in order, so the reflows resulting from arbitrarily hard-wrapping text makes it almost useless for such documents, whereas it is far more useful with content with OSPL and moderately useful with SemBr (though it essentially equalizes what would otherwise be SemBr’s nominal advantage over OSPL in terms of diff granularity).

I usually try to do this, but unfortunately it requires digging into the commit history when reviewing, which also means that GitHub’s suggestion feature, which I make extensive use of in PEP, docs, etc. PRs, doesn’t work at all. It also doesn’t solve the increased likelihood of merge conflicts, the final squashed commit diff is still equally noisy, and it still means one needs to do the work of reflowing at some point.

From now I am starting to always split a line between sentences in a new text. For example:

``*+``, ``++``, ``?+``
  Like the ``'*'``, ``'+'``, and ``'?'`` quantifiers, those where ``'+'`` is
  appended also match as many times as possible.
  However, unlike the true greedy quantifiers, these do not allow
  back-tracking when the expression following it fails to match.
  These are known as :dfn:`possessive` quantifiers.
  For example, ``a*a`` will match ``'aaaa'`` because the ``a*`` will match
  all 4 ``'a'``s, but, when the final ``'a'`` is encountered, the
  expression is backtracked so that in the end the ``a*`` ends up matching
  3 ``'a'``s total, and the fourth ``'a'`` is matched by the final ``'a'``.
  However, when ``a*+a`` is used to match ``'aaaa'``, the ``a*+`` will
  match all 4 ``'a'``, but when the final ``'a'`` fails to find any more
  characters to match, the expression cannot be backtracked and will thus
  fail to match.
  ``x*+``, ``x++`` and ``x?+`` are equivalent to ``(?>x*)``, ``(?>x+)``
  and ``(?>x?)`` correspondigly.

It will save souls of future editors and reviewers.

2 Likes

I think that’ll certainly help, but I also think there are good arguments to be made for breaking it up more semantically. For example (with annotations):

``*+``, ``++``, ``?+``   # (1)
  Like the ``'*'``, ``'+'``, and ``'?'`` quantifiers,  # (1)
  those where ``'+'`` is appended also match as many times as possible.
  However, unlike the true greedy quantifiers,
  these do not allow back-tracking
  when the expression following it fails to match.
  These are known as :dfn:`possessive` quantifiers.
  For example, ``a*a`` will match ``'aaaa'``
  because the ``a*`` will match all 4 ``'a'``s,  # (2)
  but, when the final ``'a'`` is encountered,
  the expression is backtracked so that in the end
  the ``a*`` ends up matching 3 ``'a'``s total,  # (3)
  and the fourth ``'a'`` is matched by the final ``'a'``.
  However, when ``a*+a`` is used to match ``'aaaa'``,
  the ``a*+`` will match all 4 ``'a'``,  # (4)
  but when the final ``'a'`` fails to find any more characters to match,
  the expression cannot be backtracked and will thus fail to match.
  ``x*+``, ``x++`` and ``x?+`` are equivalent to  # (5)
  ``(?>x*)``, ``(?>x+)`` and ``(?>x?)`` correspondigly.

My reasoning:

  1. Though I concede it’s not very likely, in the grand scheme of things, one of the possible changes that would force an edit to this text is the addition of a fourth form of +-expression to go with the existing three. If such an addition were made, the fact that there’s room to add it in without having to reflow either of the first two lines is a nice-to-have.

  2. Having the fourth sentence (starts with “For example…”) broken up semantically calls attention to the fact that it’s, TBH, way too long.

  3. Way, waaaay too long. It’s still going here, five lines from where it started, and the entire thing spans six lines. Because of that length, it’s a bit hard to follow, as you can easily get lost along the way from start to finish.

    The semantic divisions help with that significantly. So they both call attention to the fact that it’s overlong, and help prevent that length from confusing the reader.

  4. By the same token, IMHO the semantic breaks make it far easier to spot that the end of this line is missing a plural; it should read

    all 4 ``'a'``s,
    

    The same way the line at annotation (2) does. That’s glaringly obvious in the semantically-divided version, more so (at least to me) than in the original. So, the semantic breaks help with proofreading and copyediting.

  5. The way this text is written, the three expression forms are always together at the start of a sentence statement, whether or not it’s the beginning of a sentence. In the semantic layout, they’re similarly always on the same line, and always at the start of the line. That holds in all four locations where they appear together. (First line, second line, and last two lines.)

    In the original, the final group of three is broken across two lines by the line wrapping. Having them always together and always aligned to the left of the text makes it far easier to keep track of what exactly is being documented here, both in the text’s current form and certainly (circling back to point 1) if it ever had to be expanded.

The guidance in PEP 12 recently changed to:

Lines should usually not extend past column 79, excepting URLs and similar circumstances. Tab characters must never appear in the document at all.

PEP authors are free to use semantic line breaks, or use Emacs to reflow paragraphs, or do anything else – except use long lines (so text stays readable both with and without line wrapping).

4 Likes

I’m late to the game here, but I want to emphasize Ethan’s point. The original attraction of markup systems like reST and Markdown (maybe their raison d’être) was that the text with its light markup was readable as-is. As opposed to, say, LaTeX.

My recommendation is that instead of adopting a somewhat meaningless convention just to make diffs more readable on GitHub, that you separate content changes from formatting changes. If you need to add/delete/change some text, do it without reflowing the paragraph. After awhile, if the paragraph (or document as a whole) gets tough to read, then make an edit which is notion more than reformatting — no content changes. This is no different than the admonition to separate semantic and formatting changes in the C or Python source code.

It also helps to have tools which do a better job presenting you with the actual changes. I happen to (lo, these many years) still be an Emacs user. It’s ediff system presents changes cleanly, identifying both the formatting bits (light green or red text background) and the content bits (somewhat darker green or red background). Here’s a simple edit of the C API’s unicode.rst file which demonstrates both.

I don’t normally use GitHub to compare differences between two versions of a file. Does it not do something meaningful like this? What other tools do people use to edit text that make it difficult to distinguish between semantic and formatting changes?

In the end, I hope it becomes a “to each their own” sort of thing. I wouldn’t want to see any particular editing style mandated.

3 Likes