Model parameter analysis
Parameter sweeps of multiple outcome metrics can be performed with the
function analyze_model_parameters()
. As it has a large number of
parameters, we demonstrate and comment on them here.
A typical use case of analyze_model_parameters()
is the following:
In [1]: from gemmr.sample_analysis import *
In [2]: results = analyze_model_parameters(
...: model='cca',
...: estr='CCA',
...: n_rep=10,
...: n_perm=10,
...: n_Sigmas=1,
...: n_test=1000,
...: pxs=(2,),
...: rs=(0.9,),
...: powerlaw_decay=('random_sum', 0, 0),
...: n_per_ftrs='auto',
...: addons=[
...: addon.weights_true_cossim,
...: addon.test_scores,
...: addon.test_scores_true_pearson,
...: addon.loadings_true_pearson,
...: addon.remove_weights_loadings,
...: addon.remove_test_scores
...: ],
...: mk_test_statistics=addon.mk_test_statistics_scores,
...: saved_perm_features=['between_assocs'],
...: postprocessors=[
...: postproc.power,
...: postproc.remove_between_assocs_perm
...: ],
...: random_state=42,
...: show_progress=True
...: )
...:
In [3]: results
Out[3]:
<xarray.Dataset>
Dimensions: (px: 1, r: 1, Sigma_id: 1, n_per_ftr: 6,
rep: 10, x_feature: 2, y_feature: 2,
test_sample: 1000, mode: 1)
Coordinates:
* x_feature (x_feature) int64 0 1
* y_feature (y_feature) int64 0 1
* test_sample (test_sample) int64 0 1 2 3 ... 997 998 999
* n_per_ftr (n_per_ftr) int64 3 4 8 16 32 64
* Sigma_id (Sigma_id) int64 0
* r (r) float64 0.9
* px (px) int64 2
Dimensions without coordinates: rep, mode
Data variables: (12/25)
between_assocs (px, r, Sigma_id, n_per_ftr, rep) float64 0...
between_covs_sample (px, r, Sigma_id, n_per_ftr, rep) float64 0...
between_corrs_sample (px, r, Sigma_id, n_per_ftr, rep) float64 0...
x_weights_true_cossim (px, r, Sigma_id, n_per_ftr, rep) float64 1...
y_weights_true_cossim (px, r, Sigma_id, n_per_ftr, rep) float64 0...
x_test_scores_true_pearson (px, r, Sigma_id, n_per_ftr, rep) float64 1...
... ...
y_loadings_true (px, r, Sigma_id, y_feature, mode) float64 ...
y_crossloadings_true (px, r, Sigma_id, y_feature, mode) float64 ...
x_test_scores_true (px, r, Sigma_id, test_sample) float64 -1.6...
y_test_scores_true (px, r, Sigma_id, test_sample) float64 -1.4...
power (px, r, Sigma_id, n_per_ftr) float64 0.0 .....
py (px) int64 2
Attributes:
model: cca
estr: SVDCCA(calc_loadings=True, cov_out_of_bounds='raise')
powerlaw_decay: ('random_sum', 0, 0)
created: 2023-12-04 01:54:28.914602
gemmr_version: 0+unknown
Here’s what all the parameters are for:
model
indicates whether synthetic data for'cca'
or'pls'
should be generatedestr
specifies an estimator with which the synthetic datasets are analyzed. This can beNone
or'auto'
in which case an estimator corresponding to the model will be used. It can also be more specifically'CCA'
,'PLS'
or'SparseCCA'
, or an instance of an estimator class.n_rep
is the number of synthetic datasets drawn from each normal distributionn_perm
is the number of times the rows of the \(Y\) dataset are permuted. For each permutation the resulting dataset is analyzed in exactly the same way as the unpermuted dataset. As the permutations destroy associations between \(X\) and \(Y\), null-distributions of quantities can be obtained in this way. Specifically, the permutations are required to calculate statistical power. We suggest to use at least 1000 permutations. Note also that the total computational cost heavily depends onn_perm
.n_Sigmas
specifies how many normal distributions (more specifically: joint covariance matrices) are set up. If more than 1 is used they differ in the within-set variance spectrum (seeaxPlusay_range
) and in the direction of the true weight vectors relative to the principal component axes. For CCA, whenaxPlusay_range=(0,0)
, given the number of features and true correlation, all covariance matrices are identical, so that a small number forn_Sigmas
should be sufficient to explore random fluctutations.n_test
is the sample size used for a separate test dataset drawn from the same normal distribution as the synthetic dataset analyzed. Each draw from the normal distributions results in data for different “subjects”, i.e. the rows of the generated data matrices have independent identities across repetitions. Some add-on functions that intend to compare generated data across repetitions therefore use a common.pxs
is an iterable specifying the number of features for dataset \(X\). The number of features for dataset \(Y\) is assumed to be identical by default, but see argumentpy
ofccapwr.sample_analysis.analyzers.analyze_model_parameters()
rs
is an iterable specifying the assumed true correlations between datasetspowerlaw_decay
is a tuple specifying the minimum and maximum value for \(a_x+a_y\). \(a_x\) and \(a_y\) are, respectively, the decay constants for the powerlaws describing the within-set variance spectrum for datasets \(X\) and \(Y\). Each time a covariance matrix is set up values for \(a_x+a_y\) are drawn uniformly within this range.n_per_ftrs
is an iterable giving the number of samples per total number of features (i.e. the number of features in \(X\) plus the number of features in \(Y\)) to use. It can also be set to'auto'
in which case a crude experience-based heuristic is used to choose the set of numbersaddons
is a list of add-on functions that allow to run arbitrary analyses on each synthetic dataset after it has been fitted. A number of such functions is provided in moduleaddon
,mk_test_statistics
is call-able object providing statistics of the test dataset that are made available to all add-on functionssaved_perm_features
allows to specify which outcomes are saved for permuted datasets. Each permuted dataset is analyzed in exactly the same way as the unpermuted dataset, but if only a subset of the outcomes are of interest, these can be specified herepostprocessors
is a list of functions that are called after the loop over all other parameters has finished. For example, statistical power can be calculated with this mechanism. A number of such functions i provided in modulepostproc
.random_state
must be set to distinct values ifanalyze_model_parameters
is is called multiple times to explore the variability across covariance matricesshow_progress
shows progress bars for the loops over parameters if set toTrue