Implementing in Statistics the Shapley value

Context

The Shapley value is an analysis tool for coalitional game in game theory (mathematics), but it can also be paired with the Sobol indices to create a tool to analyze strong correlations [Owen, 2014]. The main idea is that instead of analyzing the participation of each variable at once, you will compute a global-scale variable that will summarize all the contributions and cross-contributions of the variables.
For example, it can be used in PWR nuclear reactors to assign a number on the contributions of each uncertainties [Clouvel 2019].
It is described as the following:
Let phi_i be the Shapley value of player i out of a set of players N (whose cardinal is noted n).
Consider A_i the set of all subsets of N without the player i.
Consider v a function to measure the synergy (the importance) of a group of player.
phi_i is the sum over A_i of v(subset union player i) minus v(subset), multiplied by a normalizing variable (whose value is 1/n*C(n-1, cardinal(subset)) ).

A Naive Algorithm

We can start building a naive algorithm. Of course, as it relies heavily on combinatorics, itertools must be used. Here is my attempt:

import itertools
import typing
from math import comb
from collections import OrderedDict
from pprint import pprint

def shapley_individual(player_i: str,
                       set_omega: typing.Set[str],
                       my_function: typing.Callable[[typing.Set[str]], float]) -> float:
    """For each player player_i represented by a string, part of the set of players set_omega,
    computes the Shapley value of this player
    using the my_function function to evaluate the players' importance ("synergy") in the team"""
    N = len(set_omega)
    my_sum = 0

    # getting the list of subsets of set_iteration to iterate over
    set_iteration = set_omega.copy()
    set_iteration.discard(player_i)
    list_of_sets = []
    for r_value in range(N-1):
        # Trying all values of r up to N-1
        for my_subset in itertools.combinations(set_iteration, r=r_value):
            list_of_sets.append(set(my_subset))
    # list_of_sets now contains all the sets of iteration

    # Summing over each subset
    for my_subset in list_of_sets:
        # print(my_subset)
        # print(my_subset | {player_i})
        left_value: float = my_function(my_subset | {player_i})
        right_value: float = my_function(my_subset)
        # Denominator
        denominator: float = comb(N-1, len(my_subset))
        my_sum += (left_value - right_value)/denominator
    return my_sum


def my_shapley(
        set_a: typing.Set[str],
        my_function: typing.Callable[[typing.Set[str]], float]) \
        -> typing.Dict[str, float]:
    """Calls shapley_individual for each player,
    then returns the result as a dictionary."""
    dict_result = OrderedDict()
    assert my_function(set_a) == 1 # Condition of normalization
    for player_i in set_a:
        dict_result[player_i] =\
            shapley_individual(
                player_i,
                set(set_a),
                my_function)
    return dict_result


def my_sobol(my_set: typing.Set[str]) -> float:
    """
    Placeholder for the function to insert in myShapley
    Must hold the property that mySobol of the universe Omega is 1.
    There exists other properties that can be used.
    """
    # Here, in this example, with 3 elements, S(Omega) = 1 implies
    normalizing_value = 3
    return len(my_set) / normalizing_value


my_set_a = {'A', 'B', 'C'}

my_dict_result = my_shapley(my_set_a, my_sobol)
pprint(my_dict_result)

Flaws of the Naive Algorithm

This algorithm is straightforward using its mathematical definition. However, it seems far from being perfect.

Calculation

We compute twice s(A), twice s(B), but also s(‘A’,‘B’). This can seem low, but this problems scales up quickly as the number of variables increases. It could be interesting to perhaps precalculate those values?

Use of mathematical properties

The Shapley values has plenty of useful properties that could be used to simplify the calculations, especially symmetry, that could perhaps improve the algorithm.

Improve the types that can be passed

Here, the Sobol value manipulated was a float, which can seem convenient, but in this precise example, perhaps a Fraction object could be more interesting to manipulate. Also the set of elements was a set of strings, but it could on paper be a set of anything.

Scaling

Having done tests with large scales of similar versions of this naive algorithm, this algorithm does not work well with large sets.

Idea

I request to implement an efficient algorithm in Python statistics to perform such tasks, because it is, despite being very specific, quite useful. Having an efficient version of it that can be picked up by anyone is great. Perhaps implementing it in CPython with a storage of values could be interesting.

There is a third party package on PyPI that calculates Shapley values in some sense, although it is focused on applying them to calculate explanations of models: https://github.com/slundberg/shap.

To your other comment, algorithms for very specific problems like this are generally not added into the core Python project but maintained as 3rd party packages.

1 Like

I think that the Shapely value is too specialised and too advanced for the Python standard library statistics module.

It may even be too specialised for SciPy.

Can you give an example of using the Shapely value suitable for, let’s say, high school students, or first year undergraduates?

Ok, I can close this discussion, I requested it in the SciPy.

It is true that the Shapley value is not for undergraduates necessarily so yes, it doesn’t make sense to have it in the statistics module.
Regarding the slundberg/shap, I tried to answer in the SciPy issue.

Thank you all for your remarks.