GitHub Issues Migration is coming soon

ambv · February 18, 2022, 8:54pm

As you might know,
the Steering Council is working on migrating the data that is currently residing in Roundup at https://bugs.python.org/ (BPO) into the GitHub issues of the CPython repository hosted there. The ultimate goal is to move user- and core developer-provided issue-reporting entirely to Github. We will leave BPO running in a read-only state to ensure existing URLs online continue working. Each issue that currently exists on BPO will include metadata indicating where it was moved on Github. New issues will only exist on Github.

We hope that this will lower the bar for newer contributors and allow for a much smoother user experience than we’re currently having. More details in the accepted PEP 581.

Unfortunately, this is not an easy task technically, procedurally, or legally, as it involves coordinating with several external actors and solving technical challenges mostly unique to our current circumstances. As a result, while progress was steady, it took a long while to get to this point. I was asked by the Steering Council to take over project management on the migration. Since January I’ve been working with @ezio-melotti and our friends on the Github side to push the transition to completion.

Feedback, please

At the current stage, we’re asking you to take a look at the links and important dates below, and share any feedback you might have. To help us keep this process effective, please report concrete issues on https://github.com/psf/gh-migration/issues/. You can treat it as exercise in using Github issues Questions and general discussions of course welcome on https://discuss.python.org/ and python-dev.

Multiple dry-runs of the technical side of things were already completed, please take a look at some example migrated issues:

Issues · python/issues-test-demo-20220218 · GitHub

We are also reworking the Issue Tracker documentation in the Developer’s Guide, including a new FAQ section that is intended to answer any questions BPO users may have with regards to differences in the workflow when working with Github issues. See the rendered PR here:

https://cpython-devguide--814.org.readthedocs.build/

Look for the “Issue Tracking” section which now includes two sub-pages: Github labels and “Github issues for BPO users”. If you’d like to see the raw PR, here it is:

Document using Github issues as the issue tracker by ambv · Pull Request #814 · python/devguide · GitHub

Important dates

NOTE: those dates have been changed on 3/7 and now (3/31) changed again.

We now propose the following migration roadmap:

Friday, February 18th 2022: public feedback gathering period begins.
Friday, March 18th 2022: final end-to-end test migration executed with Github’s help to gather timings and ensure no blockers. We will be using 10% of issues for that test.

Final migration:

Friday, April 8th 2022 (sic) migration begins. BPO is put in read-only mode at 6pm UTC / 2pm ET / 9pm IDT. Data from BPO is exported and put in a temporary repository (this takes around ~22 hours with our current timings) on Github.
Saturday, April 9th 2022: Github starts transfer of the issues in the temporary repository to github.com/python/cpython/.

The migration is estimated to take anywhere from 1 to 3 days, depending on the load on Github.com. This is why we will be performing the bulk of it during the weekend to speed things up. While the migration is happening:

creating new issues WILL NOT be possible either on Github or on BPO;
creating new PRs and interacting with existing PRs will be possible on Github without interruption;
interaction with already migrated issues on Github is possible but destructive actions (changing issue titles, editing comment content, deleting comments, removal of labels) are HIGHLY DISCOURAGED as it will make it harder for us to audit whether the migration fully succeeded. In case we are unable to make the in-flight issues invisible until the migration is done, we will create pinned issues and bug tracker templates that explain the situation clearly to users.

Once the migration is over, we will notify everybody that they can interact freely with Github issues. In the unlikely case that the migration cannot be completed in 7 days, the Steering Council decided that we would abort it and re-enable BPO again.

Details of the plan along with some risk mitigation strategies are described in Migration and risk management plans · Issue #13 · psf/gh-migration · GitHub. That plan overrides what is currently written in PEP 588 at this point, we will merge the contents next week so that the PEP is up-to-date.

The LLVM project made a similar migration from Bugzilla in November 2021 and it took them 21 days to complete. We are lucky to be able to use their experience on the matter. We also have time estimates based on existing test runs and communication with Github employees. We are looking into ways of accelerating the process ~~but it looks like it might not be feasible to move our database of over 50,000 issues any faster than in 4 - 7 days~~. Github managed to figure out a way to speed up the migration so it takes less than 4 days in total.

Lastly, several legal concerns with respect to the procedure, process, and content were raised. In particular, the question of whether the Python Software Foundation should be able to move user-generated content as well as potentially personally-identifiable information from BPO to Github.com. The Steering Council together with Python Software Foundation lawyers resolved this issue with the following conclusion: the migration of BPO to Github is a ministerial, internal issue, and therefore one that doesn’t require user consent. Both BPO and Github are public-facing systems. Users actively placed their information (including PII) in the BPO system, which actively grants consent for that information to be stored, publicly accessible, and distributed on-demand. Changing our backend to Github does not revoke that permission. At the same time, the migration will not be surfacing any new user information that wasn’t previously publicly accessible in the BPO system.

Summary

Summing up, the migration is underway. We’re doing everything we can so that by PyCon US it will already be old news.

In the mean time, please look at the test runs to see if everything is clear to you. If you have any questions, please look at the devguide PR to see if they are already answered in the FAQ. If not, let us know to add the question.

eric.snow · February 18, 2022, 9:20pm

Thanks for the great write-up and for everyone involved for moving this forward!

Ouch!

Is there a way to mark the (large) subset of older/closed/inactive issues (i.e. very unlikely to be modified) as read-only and migrate those first? Then proceed with the remainder per the plan you described above. Assuming the remaining issues would be much fewer than the inactive ones, I’d expect the disruptive part of the migration would be much (proportionally?) shorter.

ezio-melotti · February 18, 2022, 10:12pm

This is technically doable, but the issue IDs will end up scrambled if we import new issues first and old issues later, since the transfer tool imports them chronologically and assigns them sequential IDs starting from the ID of the last PR plus one (PRs and issues share the same namespace). The original issue IDs will be lost anyway (also due to conflicts with existing PR numbers), but I’m trying to at least preserve the assumption that the ID order matches the creations order.

If we decide that this is not important and to migrate in two stages, I would have to tweak the exporter tool to select and export the newer subset first (or just open issues), then a different subset with the remaining issues (while being careful not to miss any), then import them into two separate new repos, then transfer the newer subset to python/cpython (these will get lower IDs), then open up the repo to the public so that users will be able to open new issues (with higher IDs), and then transfer the older issues (which will get even higher IDs).

Issue (re)numbering is discussed at Issue (re)numbering · Issue #1 · psf/gh-migration · GitHub (even though most of the idea suggested earlier in thread have been abandoned due to the limitations of the transfer tool).

ambv · February 18, 2022, 10:21pm

@ezio-melotti, if I understand correctly, Eric is asking about the opposite to what you described. He’d like to migrate the old issues first, only marking those as read-only on BPO.

This idea is very complicated in practice because:

as soon as those closed issues were migrated, Github issues would be open and would require policing that “this is for old issues only for now, please don’t create new ones!”;
while closed issues are closed, some still receive activity, to the point of being re-opened.

We currently don’t have the ability to keep Github issues disabled during the migration. Our friends at Github are looking into whether this could be introduced but from their initial assessment, it looks like they would need a much longer timeframe to implement changes to their migration tooling than we are willing to wait.

The idea to have two issue trackers open at the same time is making me nervous.

ezio-melotti · February 18, 2022, 10:32pm

Oops, I got that backward

I agree that having two issues trackers open at the same time is very error prone, and something we should avoid. Starting the migration from the older issues (after we freeze bpo) is what we are currently planning to do, that’s why I though @eric.snow was proposing the opposite, since that might indeed make the transition shorter if we are willing to sacrifice ID ordering.

eric.snow · February 18, 2022, 11:23pm

Thanks to both of you for the clear responses.

Also, I don’t mean to suggest more work for anyone. I’m only thinking of possibilities for reducing the downtime to the core workflow. 4-7 days feels like the end of the world.

FYI, I feel a little awkward offering advice like this on a project on which I’m unlikely to be putting actual effort. Just know that I support this project, I believe you’ll do your best, and you have my full support (whether or not you take my suggestions).

Yikes! Is there any indication of how this impacted contributions afterward?

Correct!

It’s a shame that we can’t disable issues temporarily.

Would it be possible to migrate the inactive issues into a different GH repo and then move the issues over to the CPython repo afterward (but just before migrating all the active ones)? IIRC, GH already has a public (UI) mechanism to move issues between repos. Or perhaps we’d need the fine folks at GitHub to do that part for us due to the scale and the race on creating new issues. Regardless, that’s an established workflow on GH so as a workaround it’s less likely to be fragile or problematic.

Agreed. That’s why I suggested marking those issues as read-only in BPO (if possible). Anyone that wants to re-open one of those issues could wait until the migration is over. This is worth it if it means the project-wide downtime is drastically reduced.

To be clear, I’m not saying we pre-migrate all closed issues. We would do it for closed issues that were closed more than 3(?) months ago and any open issue that hasn’t had any activity for 6(?) months. I’d call that “relatively inactive”. The actual thresholds would roughly match issues that are unlikely to having any activity during the start of the pre-migration to the end of the actual migration.

Yeah, it’s definitely worth avoiding that.

jack1142 · February 19, 2022, 12:44am

I have a question about those mannequin users. Is it possible to filter issues by a specific mannequin author? author:username doesn’t seem to be enough but perhaps there’s some other way to do this.

ezio-melotti · February 19, 2022, 12:52am

Your comment already led us to consider and discuss a couple of different approaches and optimizations. We will perform a full test migrations beforehand to get a more accurate time estimation, and then consider the trade-offs involved and decide whether to add optimizations or not.

If I understand correctly the actual transfer eventually took them a couple of days, but it had a few false starts and issues. I’ve been talking with the project manager of the LLVM project and a few other people that performed similar migrations in the past, so that we could learn from their mistakes and avoid them.

I can ask him about the impact of the migration, even though their goals and priorities were somewhat different from ours.

The migration plan describes this in more details, but basically we are doing what you described:

make bpo read-only and export all issues
import all issues in an empty repo
transfer the issues from the empty repo to python/cpython

The first step takes about half an hour, the second step about a day, and the rest is taken by the third step.
The third step is already handled by GitHub from their backend and we don’t have control or opportunities to optimize it unless we decide to import a smaller subset first (e.g. only the open/active issues) and open up the issues to everyone while the older issues are still being transferred (this is possible, but will mess up the ID order). There is some room for optimization during the second step, but we have to see if it’s worth the extra time/complexity/risk.

Exporting old issues and marking them as read-only as they are exported is technically doable, but will require some work. The main issue with this is that if we can’t keep the “issues tab” disabled on GitHub during the transfer, people will see some old issues appearing on python/cpython and might start interacting with them and creating new ones before the migration is completed.

We are currently looking into possible solutions to fix this, so stay tuned

ezio-melotti · February 19, 2022, 1:14am

Good question! GitHub supports a few different filters: author:, assignee:, commenter:, mentions:, involves: but none of these seem to work with mannequins. All nosy list members are @mentioned at the top of each issue, but apparently this is not enough to make mentions: work.

I will ask the GitHub team if this is expected and if there is any solution, and in the meanwhile I can offer two workarounds:

since all the nosy list members are listed at the top, just a plain search for the username will return all the issues the user was following. In the result list, the author is also displayed, so (assuming there aren’t too many), you could manually select the ones with a certain author.
we are planning to send out an email to bpo users after the migration with a list of issues they created and/or followed and the corresponding GitHub link. This will allow them to subscribe to those issues. See Notify bpo users once the migration is done · Issue #12 · psf/gh-migration · GitHub for more details.

Unfortunately there doesn’t seem to be a way to automatically subscribe other people, except for mentioning them en masse generating a flood of notifications.

Edit: GitHub said that mannequins are not expected to show up in user searches (e.g. author:).

eric.snow · February 19, 2022, 1:15am

Would it be possible to first create a dummy GH issue for each BPO issue, remember the mapping, and then migrate the relevant subset (e.g. active-only) to the corresponding GH issues? Then we could “quickly” migrate all the active issues first without losing the desired ID order.

(Honestly, that sounds messy and too complicated, but it at least seems feasible. )

ezio-melotti · February 19, 2022, 1:36am

AFAIK the transfer tool can only transfer whole issues at once, and it will create a new issue with a new ID – it can’t transfer into existing issues.

iritkatriel · February 19, 2022, 9:10am

Is it part of the plan to ask people to review old issues and close them if they are no longer relevant? This migration is an opportunity to deep clean bpo.

If we migrate an issue and nobody from the nosy list chooses to follow it on github, or even just the OP, it would be nice to know why that is.

ambv · February 19, 2022, 9:11am

We haven’t thought that far ahead but this sounds like a great idea.

vstinner · February 21, 2022, 12:08pm

I’m in the nosy list of 885 BPO issues. Should I expect 885 emails, or can you disable sending email notifications to all users of the 40k+ issues?

ezio-melotti · February 21, 2022, 1:59pm

The email I was talking about would be a single email sent from bpo, listing all issues you created/followed/are assigned to. This would be especially useful for occasional contributors that follow at most one or two dozen of issues, since they can go through them, review them, and resubscribe manually (if they are still interested).

However I do realize this is not ideal for very active contributors like you (I’m in the nosy list of over 4000 issues myself, so I definitely hear you). I’m still looking for a better solution that would allow to either preserve the nosy list subscriptions, or restore them after the migration.

vstinner · February 21, 2022, 2:58pm

Maybe give a list of the 50 most recent issues. For persons who are in the nosy list of more than 50 issues, add a link listing all issues?

Does the email will contain a link to the “new” GitHub issue?

ezio-melotti · February 21, 2022, 3:45pm

I updated the issue tracking this (Notify bpo users once the migration is done · Issue #12 · psf/gh-migration · GitHub) with some more comments and ideas.

The lists are already available on bpo by clicking on the sidebar, but currently there’s no easy way to open GH issues directly from the list (this can be fixed though, I added a note to Add links from bpo to GitHub · Issue #15 · psf/gh-migration · GitHub). A mail with the lists will still be useful to inform users about the migration and for users that lost their bpo access and can’t access the lists.

ezio-melotti · February 21, 2022, 4:31pm

Good idea! I added a note about this in Notify bpo users once the migration is done · Issue #12 · psf/gh-migration · GitHub

People tend to report, comment, and follow issue when they are relevant to them. As time passes, they either find other workarounds and solutions, lose interest, forgot the context of the issue, or get busy with other things and this might lead them to ignore those issues (or in some cases actively unsubscribe, especially if they are a source of noise in their inbox). This doesn’t necessarily make the issue invalid: it might still affect people, even though they are not subscribed to the issue.

On the other hand, recently @guido shared an interesting article about noisy monitors (by Sam Schillace) that argues that the noise caused by those old issues might end up being more detrimental than the issues themselves.

Some time after the migration we can certainly review issues with no subscribers and decide whether we want to close them or not. This will also become easier if we set up an action to mark inactive issues as stale (even though, IMHO, the final judgment should be given by a human and they shouldn’t be closed automatically).

steve.dower · February 22, 2022, 2:29pm

The closed IDs aren’t going to be useful anyway though, right? All existing commits and NEWS items will still have to go through bpo anyway, or is there some redirection mapping that will handle that?

While there’s always some amount of further discussion on closed issues, the vast majority are never going to be touched again. Why recreate them?

iritkatriel · February 22, 2022, 2:47pm

If you want to search closed tickets for some error message, for instance, you want to search in only one place.

There are issues where the problem is not fixed, but the ticket has relevant discussion and workarounds.