12 June 2022 0 4K Report

I've seen posts discussing clustering on labelled datasets. Below is an approach I'm considering. IN short, I would be clustering "base data" vs. "XAI-calculated" data to see how the XAI-calculated data discriminates better. Do the steps below make sense? My specific questions are listed after the steps below, but in short:

The key question I have behind all this is, besides whether the overall flow below makes sense, is the very last question at the bottom.

This last question asks about what appropriate statistical test to use to test my hypothesis that XAI (Shapley values) discriminate data better than base data alone. The metrics I would use are summary metrics based on clustering. Would I still be able to perform a statistical test if the summary metrics are not observation-specific but instead aggregate metrics?

BACKGROUND OF PROCESS:

Note: raw dataset is labelled, i.e., includes a binary target outcome variable

  • Apply predictive machine learning model, e.g., random forest, XGBoost, or some other ensemble algorithm
  • Apply PCA (principal component analysis) - on model or raw data? - to determine top features for clustering later
  • Calculate an XAI (explainable AI) metric on predictive model
  • Cluster (k-means) the XAI metrics based on the top two features identified by a Shapley summary plot - is k-means clustering OK even though the raw dataset is labelled?
  • Cluster the raw data values from the PCA step #2 above
  • Compare the clustering results between #4, #5 through clustering metrics (completeness, homogeneity, etc)
  • Apply a statistical inference test like t-test to see whether differences are significant between XAI-generated results vs. base data results - this part is a bit foggy to me, not sure if it can be done here?
  • My questions and problem related to the above are the following:

    • When I apply PCA, my understanding is, it is on the raw data, not predicted output?
    • I assume it is OK to apply (unsupervised) k-means cluster to a labelled dataset if I am clustering based on the top two features identified by Shapley values (XAI metric) vs. clustering base data using PCA values on the top two features?
    • To test my hypothesis as to whether Shapley values are more effective at discriminating data than base data alone, I was going to perform a t-test (or non-parametric equivalent) on the clustering metrics based on Shapley values vs. clustering metrics using base data. The base data would be without any model or XAI applied ... Does this approach make sense? The clustering metrics would be summary-level, not for each observation.
    More Sue Hl's questions See All
    Similar questions and discussions