Discourse archive and backup


(Christian Heimes) #1

Over the last couple of weeks and months, Python core dev has started to use Discourse as a platform to make decisions or to justify actions. Before Discourse we used email lists to come to agreements. Email lists like python-dev and python-committers were used as authoritative, primary channel. I’m not counting Zulip, because we just it as chat and not as authoritative source for decisions.

While I have been enjoying Discourse so far, I see one issue: long term archival and backup. With email lists, we had a simple archive on the primary mailing list server and multiple clones on news servers and mirrors like gmane, Google Mail, and so on. The distributed nature, simple file format, and simple access make mailing lists a good long term archival.

But how are future core developers, researches, and archivist going to access our discussions on Discourse in 10, 20, 50, and even 500 years from now? Python has become an important programming language and is likely of interest for researchers in the future. I’m sensitized for the topic because I used to work at a company that dealt with archiving and publishing data from 2000 years old manuscripts, medieval books, to modern PDFs. Digital memory loss is a big issue for archivists these days.

Should we backup, archive, and publish discourse on a regular interval in machine readable formats like JSON? If we publish the dump, how are we going to deal with internal, non-public areas?


(Victor Stinner) #2

@EWDurbin is doing backup, but I don’t think that it’s public.


(Victor Stinner) #3

By the way, there was a very short “discussion” about commit messages:

Gregory Szorc wrote:

My work on the Firefox and Mercurial projects has groomed me to include a summary of changes and their rationale in the commit message because - unlike links to bugs/issues - the commit message can be accessed without an Internet connection and more importantly doesn’t require the reader to peruse possibly dozens of updates/comments to glean knowledge: they just have to look in one place (the commit message) to achieve understanding. This approach facilitates easier code archeology and from my experience helps complex projects scale. So that’s why I did what I did. (…)

Python bugs moved from Sourceforge to bugs.python.org and may move to GitHub. At each migration, we loose a few bits of data.

Python code moved from CVS to SVN to HG and now to Git. Hopefully, it seems that data loss occurred during these migrations!

Yesterday, I digged into asyncio history and I found commits which were “sync asyncio with Tulip” with references to code.google.com… This website closed, but hopefully for the specific case of Tulip, the project has been mirrored on GitHub as https://github.com/python/asyncio/ For example, I was able to access https://github.com/python/asyncio/issues/195 … but the bugs are badly formatted. It’s confusing to have two dates and two authors (Google Code Exporter and the real author) per message for example.

Sometimes, discussions occur on the bug tracker, sometimes on the PR, sometimes on both. Will we keep the full history in a consistent way in 10 years?

Linux does a better job of tracking the history by requesting long commit message which explains the full rationale.

Right now, all links to Subversion commits in the bug tracker are broken. http://svn.python.org/view/ doesn’t work and we only have a mapping between Subversion commits and Mercurial commits… whereas the code has been migrated to Git. Well, we didn’t loose all data, but it can be painful to find an old Subversion commit in the recent Git repository. It happens to me times to times to be blocked at “rXXX”.


(Senthil) #4

On topic of discourse vs mailman, I think, this will need a revisit as most of us have evaluated this platform and know the pros and cons of this.

I agree with @tiran. Even if I am subscribed to the mailing list mode of discourse, there is no way I can land up a particular decision by searching the web, as it is the case when the discussions are public forums and archived by multiple parties.

This is great! Archives are valuable. They present the most accurate window into the past state.


(Hugo) #5

@tiran One option would be to subscribe to Discourse in email mode, and send that to a mailing list or lists.

It’ll miss out on poll results and edits and other “rich content”, but would make for a more durable archive.


(Neil Schemenauer) #6

Thanks for bringing this up. It concerns me as well. I was a early adopter of Usenet and I witnessed what I think is a tragic example of digital memory loss. Some people may remember Deja News. It was a big archive of nearly all the discussions on Usenet. Google acquired it (it became Google Groups) and eventually destroyed it. AFAIK, you can’t search for the old Usenet discussions.

Here is a little example of why these archives are useful. I am interested in the history of chess AI. With Deja News, you could read the original discussions between computer chess authors ( rec.games.chess.computer). There was discussions from the Deep Thought (later became Deep Blue) authors. There was computer analysis posted during the Deep Blue games with Kasparov. The archive of rec.games.chess.computer was a gold mine if you were interested in the history of chess AI.

Should we backup, archive, and publish discourse on a regular interval in machine readable formats like JSON? If we publish the dump, how are we going to deal with internal, non-public areas?

I hope we can be proactive rather than just hoping it will work out. With rich UIs, it will be more difficult to back things up in a readable format. It looks like Discourse has some sort of RPC API. Could we scrap discussion content using the API, archive it and provide an archive API?

If I had extra time, I would like to build a new version of NNTP. Make the messages support a richer markup scheme (e.g. markdown). Use HTTPS as the base protocol. I bet a lot of ideas from JMAP (IMAP replacement using JSON) could be adopted.

Mailing lists never seemed like a good replacement for NNTP. With news readers it was easy to ignore threads you were not interested in. Because they kept track of “read” status, it was easy to catch up and be fairly sure you didn’t miss anything. I’m not foolish enough to think we can resurrect Usenet and news readers. However, I think we should be trying emulate that system rather than implement things as mailing lists.

Edit: maybe saying Google destroyed the Usenet archive is too strong. I just checked it now and a search is able to find some old postings. I had tried it some years ago and I could find old posts. Maybe Google has fixed it since.


(Barry Warsaw) #7

I’ve always wanted to build NNTP or read-only IMAP into GNU Mailman, but it ain’t gonna come from me these days. Still, I like the NNTP protocol and Gmane shows how useful it can be.