Clustering#
Equality of Outcome Metrics#
Social Fairness Ratio: Given a centroid based clustering, this metric compute the average distance to the nearest centroid for both groups. The metric is the ratio of the resulting distance for group_a to group_b.
A value of 1 is desired. Lower values indicate the group a is on average closer to the respective centroids. Higher values indicate that group_a is on average further from the respective centroids.
where \(d_{g}\) is the average distance to the nearest centroid for group \(g\).
Silhouette Difference: We compute the difference of the mean silhouette score for both groups.
The silhouette difference ranges from -1 to 1, with lower values indicating bias towards group a and larger values indicating bias against group b.
where \(\text{silhouette}_{g}\) is the mean silhouette score for group \(g\).
Equality of Opportunity Metrics#
Cluster Balance: Given a clustering and protected attribute. The cluster balance is the minimum over all groups and clusters of the ratio of the representation of members of that group in that cluster to the representation overall.
A value of 1 is desired. That is when all clusters have the exact same representation as the data. Lower values imply the existence of clusters where either group a or group b is underrepresented.
where \(N_{g,c}\) is the number of members of group \(g\) in cluster \(c\) and \(N_{c}\) is the total number of members in cluster \(c\).
Minimum Cluster Ratio: Given a clustering and protected attributes. The min cluster ratio is the minimum over all clusters of the ratio of number of group a members to the number of group b members.
A value of 1 is desired. That is when all clusters are perfectly balanced. Low values imply the existence of clusters where group a has fewer members than group b.
where \(N_{g,c}\) is the number of members of group \(g\) in cluster \(c\).
Cluster Distribution Total Variation: This metric computes the distribution of group a and group b across clusters. It then outputs the total variation distance between these distributions.
A value of 0 is desired. That indicates that both groups are distributed similarly amongst the clusters. The metric ranges between 0 and 1, with higher values indicating the groups are distributed in very different ways.
where \(cluster_{dist_{g}}\) is the distribution of group \(g\) across clusters.
Cluster Distribution KL Div: This metric computes the distribution of group a and group b membership across the clusters. It then returns the KL distance from the distribution of group a to the distribution of group b.
A value of 0 is desired. That indicates that both groups are distributed similarly amongst the clusters. Higher values indicate the distributions of both groups amongst the clusters differ more.
where \(cluster_{dist_{g}}\) is the distribution of group \(g\) across clusters.