Imagine I have a spreadsheet with a campus’ courses and rosters for a specific semester. The spreadsheet has 2 columns: 1 column with course IDs, and the other column with student IDs for all students enrolled in each course.
How would I build a heat map / correlation matrix / something that will identify pairs of courses with the highest number of shared student IDs?
By Robert Brace via Discussions on Python.org at 30Aug2022 14:25:
Imagine I have a spreadsheet with a campus’ courses and rosters for a
specific semester. The spreadsheet has 2 columns: 1 column with
course IDs, and the other column with student IDs for all students
enrolled in each course.
How would I build a heat map / correlation matrix / something that will
identify pairs of courses with the highest number of shared student
IDs?
I’d do the following:
load the data from the spreadsheet using the builtin csv module if
it is CSV data, openpyxl is it is an Excel spereadsheet, work
something out for other formats
shared students is a set relationship between the student ids of each
course; Python has a builtin set type for stuff like this; make a
mapping (probably a dict) of course id to set-of-student-ids; find
common students with a set intersection between the sets for chosen
pairs of courses
highest shared might be absolute (length of the interection set) or as
a ratio (that length versus the membership count of the courses); it
is easy to compute both
you can make a grid of courses as a 2-dimensional pandas DataFrame, with shared sizes as the data values
you can plot that grid using the DataFrame’s plot method