Shared student IDs between courses

Imagine I have a spreadsheet with a campus’ courses and rosters for a specific semester. The spreadsheet has 2 columns: 1 column with course IDs, and the other column with student IDs for all students enrolled in each course.

How would I build a heat map / correlation matrix / something that will identify pairs of courses with the highest number of shared student IDs?

By Robert Brace via Discussions on Python.org at 30Aug2022 14:25:

Imagine I have a spreadsheet with a campus’ courses and rosters for a
specific semester. The spreadsheet has 2 columns: 1 column with
course IDs, and the other column with student IDs for all students
enrolled in each course.

How would I build a heat map / correlation matrix / something that will
identify pairs of courses with the highest number of shared student
IDs?

I’d do the following:

  • load the data from the spreadsheet using the builtin csv module if
    it is CSV data, openpyxl is it is an Excel spereadsheet, work
    something out for other formats
  • shared students is a set relationship between the student ids of each
    course; Python has a builtin set type for stuff like this; make a
    mapping (probably a dict) of course id to set-of-student-ids; find
    common students with a set intersection between the sets for chosen
    pairs of courses
  • highest shared might be absolute (length of the interection set) or as
    a ratio (that length versus the membership count of the courses); it
    is easy to compute both
  • you can make a grid of courses as a 2-dimensional pandas
    DataFrame, with shared sizes as the data values
  • you can plot that grid using the DataFrame’s plot method

Cheers,
Cameron Simpson cs@cskk.id.au