Model parameter analysis¶
Parameter sweeps of multiple outcome metrics can be performed with the
function analyze_model_parameters()
. As it has a large number of
parameters, we demonstrate and comment on them here.
A typical use case of analyze_model_parameters()
is the following:
In [1]: from gemmr.sample_analysis import *
In [2]: results = analyze_model_parameters(
...: model='cca',
...: estr='CCA',
...: n_rep=10,
...: n_perm=10,
...: n_Sigmas=1,
...: n_test=1000,
...: pxs=(2,),
...: rs=(0.9,),
...: axPlusay_range=(0, 0),
...: n_per_ftrs='auto',
...: addons=[
...: addon.weights_true_cossim,
...: addon.test_scores,
...: addon.scores_true_spearman,
...: addon.loadings_true_pearson,
...: addon.remove_weights_loadings,
...: addon.remove_test_scores
...: ],
...: mk_test_statistics=addon.mk_test_statistics_scores,
...: saved_perm_features=['between_assocs'],
...: postprocessors=[
...: postproc.power,
...: postproc.remove_between_assocs_perm
...: ],
...: random_state=42,
...: show_progress=True
...: )
...:
In [3]: results
Out[3]:
<xarray.Dataset>
Dimensions: (Sigma_id: 1, mode: 1, n_per_ftr: 3, px: 1, r: 1, rep: 10, test_sample: 1000, x_feature: 2, x_orig_feature: 2, y_feature: 2, y_orig_feature: 2)
Coordinates:
* test_sample (test_sample) int64 0 1 2 ... 997 998 999
* y_orig_feature (y_orig_feature) int64 0 1
* r (r) float64 0.9
* Sigma_id (Sigma_id) int64 0
* n_per_ftr (n_per_ftr) int64 3 4 8
* x_orig_feature (x_orig_feature) int64 0 1
* x_feature (x_feature) int64 0 1
* y_feature (y_feature) int64 0 1
* px (px) int64 2
Dimensions without coordinates: mode, rep
Data variables:
between_assocs (px, r, Sigma_id, n_per_ftr, rep) float64 0.8429 ... 0.9005
between_covs_sample (px, r, Sigma_id, n_per_ftr, rep) float64 0.6283 ... 0.879
between_corrs_sample (px, r, Sigma_id, n_per_ftr, rep) float64 0.8429 ... 0.9005
x_weights_true_cossim (px, r, Sigma_id, n_per_ftr, rep) float64 0.9998 ... 0.9694
y_weights_true_cossim (px, r, Sigma_id, n_per_ftr, rep) float64 0.9626 ... 0.9999
x_test_scores_true_spearman (px, r, Sigma_id, n_per_ftr, rep) float64 0.9997 ... 0.9694
y_test_scores_true_spearman (px, r, Sigma_id, n_per_ftr, rep) float64 0.9563 ... 0.9999
x_test_loadings_true_pearson (px, r, Sigma_id, n_per_ftr, rep, mode) float64 1.0 ... 1.0
y_test_loadings_true_pearson (px, r, Sigma_id, n_per_ftr, rep, mode) float64 1.0 ... 1.0
x_test_crossloadings_true_pearson (px, r, Sigma_id, n_per_ftr, rep, mode) float64 1.0 ... 1.0
y_test_crossloadings_true_pearson (px, r, Sigma_id, n_per_ftr, rep, mode) float64 1.0 ... 1.0
between_assocs_true (px, r, Sigma_id, mode) float64 0.9
x_weights_true (px, r, Sigma_id, x_feature, mode) float64 -0.9404 0.34
y_weights_true (px, r, Sigma_id, y_feature, mode) float64 -0.9585 0.285
ax (px, r, Sigma_id) float64 0.0
ay (px, r, Sigma_id) float64 0.0
latent_expl_var_ratios_x (px, r, Sigma_id, mode) float64 0.5
latent_expl_var_ratios_y (px, r, Sigma_id, mode) float64 0.5
weight_selection_algorithm (px, r, Sigma_id) <U4 'qr__'
x_loadings_true (px, r, Sigma_id, x_feature, mode) float64 -0.9404 0.34
x_crossloadings_true (px, r, Sigma_id, x_feature, mode) float64 -0.8464 0.306
y_loadings_true (px, r, Sigma_id, y_feature, mode) float64 -0.9585 0.285
y_crossloadings_true (px, r, Sigma_id, y_feature, mode) float64 -0.8627 0.2565
x_test_scores_true (px, r, Sigma_id, test_sample) float64 -0.02386 ... -2.778
y_test_scores_true (px, r, Sigma_id, test_sample) float64 -0.8795 ... -2.494
py (px) int64 2
power (px, r, Sigma_id, n_per_ftr) float64 0.0 ... 0.0
Here’s what all the parameters are for:
model
indicates whether synthetic data for'cca'
or'pls'
should be generatedestr
specifies an estimator with which the synthetic datasets are analyzed. This can beNone
or'auto'
in which case an estimator corresponding to the model will be used. It can also be more specifically'CCA'
,'PLS'
or'SparseCCA'
, or an instance of an estimator class.n_rep
is the number of synthetic datasets drawn from each normal distributionn_perm
is the number of times the rows of the \(Y\) dataset are permuted. For each permutation the resulting dataset is analyzed in exactly the same way as the unpermuted dataset. As the permutations destroy associations between \(X\) and \(Y\), null-distributions of quantities can be obtained in this way. Specifically, the permutations are required to calculate statistical power. We suggest to use at least 1000 permutations. Note also that the total computational cost heavily depends onn_perm
.n_Sigmas
specifies how many normal distributions (more specifically: joint covariance matrices) are set up. If more than 1 is used they differ in the within-set variance spectrum (seeaxPlusay_range
) and in the direction of the true weight vectors relative to the principal component axes. For CCA, whenaxPlusay_range=(0,0)
, given the number of features and true correlation, all covariance matrices are identical, so that a small number forn_Sigmas
should be sufficient to explore random fluctutations.n_test
is the sample size used for a separate test dataset drawn from the same normal distribution as the synthetic dataset analyzed. Each draw from the normal distributions results in data for different “subjects”, i.e. the rows of the generated data matrices have independent identities across repetitions. Some add-on functions that intend to compare generated data across repetitions therefore use a common.pxs
is an iterable specifying the number of features for dataset \(X\). The number of features for dataset \(Y\) is assumed to be identical by default, but see argumentpy
ofccapwr.sample_analysis.analyzers.analyze_model_parameters()
rs
is an iterable specifying the assumed true correlations between datasetsaxPlusay_range
is a tuple specifying the minimum and maximum value for \(a_x+a_y\). \(a_x\) and \(a_y\) are, respectively, the decay constants for the powerlaws describing the within-set variance spectrum for datasets \(X\) and \(Y\). Each time a covariance matrix is set up values for \(a_x+a_y\) are drawn uniformly within this range. For CCA useaxPlusay_range=(0,0)
n_per_ftrs
is an iterable giving the number of samples per total number of features (i.e. the number of features in \(X\) plus the number of features in \(Y\)) to use. It can also be set to'auto'
in which case a crude experience-based heuristic is used to choose the set of numbersaddons
is a list of add-on functions that allow to run arbitrary analyses on each synthetic dataset after it has been fitted. A number of such functions is provided in moduleaddon
,mk_test_statistics
is call-able object providing statistics of the test dataset that are made available to all add-on functionssaved_perm_features
allows to specify which outcomes are saved for permuted datasets. Each permuted dataset is analyzed in exactly the same way as the unpermuted dataset, but if only a subset of the outcomes are of interest, these can be specified herepostprocessors
is a list of functions that are called after the loop over all other parameters has finished. For example, statistical power can be calculated with this mechanism. A number of such functions i provided in modulepostproc
.random_state
must be set to distinct values ifanalyze_model_parameters
is is called multiple times to explore the variability across covariance matricesshow_progress
shows progress bars for the loops over parameters if set toTrue