Discourse archive and backup


(Christian Heimes) #1

Over the last couple of weeks and months, Python core dev has started to use Discourse as a platform to make decisions or to justify actions. Before Discourse we used email lists to come to agreements. Email lists like python-dev and python-committers were used as authoritative, primary channel. I’m not counting Zulip, because we just it as chat and not as authoritative source for decisions.

While I have been enjoying Discourse so far, I see one issue: long term archival and backup. With email lists, we had a simple archive on the primary mailing list server and multiple clones on news servers and mirrors like gmane, Google Mail, and so on. The distributed nature, simple file format, and simple access make mailing lists a good long term archival.

But how are future core developers, researches, and archivist going to access our discussions on Discourse in 10, 20, 50, and even 500 years from now? Python has become an important programming language and is likely of interest for researchers in the future. I’m sensitized for the topic because I used to work at a company that dealt with archiving and publishing data from 2000 years old manuscripts, medieval books, to modern PDFs. Digital memory loss is a big issue for archivists these days.

Should we backup, archive, and publish discourse on a regular interval in machine readable formats like JSON? If we publish the dump, how are we going to deal with internal, non-public areas?


(Victor Stinner) #2

@EWDurbin is doing backup, but I don’t think that it’s public.


(Victor Stinner) #3

By the way, there was a very short “discussion” about commit messages:

Gregory Szorc wrote:

My work on the Firefox and Mercurial projects has groomed me to include a summary of changes and their rationale in the commit message because - unlike links to bugs/issues - the commit message can be accessed without an Internet connection and more importantly doesn’t require the reader to peruse possibly dozens of updates/comments to glean knowledge: they just have to look in one place (the commit message) to achieve understanding. This approach facilitates easier code archeology and from my experience helps complex projects scale. So that’s why I did what I did. (…)

Python bugs moved from Sourceforge to bugs.python.org and may move to GitHub. At each migration, we loose a few bits of data.

Python code moved from CVS to SVN to HG and now to Git. Hopefully, it seems that data loss occurred during these migrations!

Yesterday, I digged into asyncio history and I found commits which were “sync asyncio with Tulip” with references to code.google.com… This website closed, but hopefully for the specific case of Tulip, the project has been mirrored on GitHub as https://github.com/python/asyncio/ For example, I was able to access https://github.com/python/asyncio/issues/195 … but the bugs are badly formatted. It’s confusing to have two dates and two authors (Google Code Exporter and the real author) per message for example.

Sometimes, discussions occur on the bug tracker, sometimes on the PR, sometimes on both. Will we keep the full history in a consistent way in 10 years?

Linux does a better job of tracking the history by requesting long commit message which explains the full rationale.

Right now, all links to Subversion commits in the bug tracker are broken. http://svn.python.org/view/ doesn’t work and we only have a mapping between Subversion commits and Mercurial commits… whereas the code has been migrated to Git. Well, we didn’t loose all data, but it can be painful to find an old Subversion commit in the recent Git repository. It happens to me times to times to be blocked at “rXXX”.


(Senthil) #4

On topic of discourse vs mailman, I think, this will need a revisit as most of us have evaluated this platform and know the pros and cons of this.

I agree with @tiran. Even if I am subscribed to the mailing list mode of discourse, there is no way I can land up a particular decision by searching the web, as it is the case when the discussions are public forums and archived by multiple parties.

This is great! Archives are valuable. They present the most accurate window into the past state.


(Hugo) #5

@tiran One option would be to subscribe to Discourse in email mode, and send that to a mailing list or lists.

It’ll miss out on poll results and edits and other “rich content”, but would make for a more durable archive.