Quickstart

I want to use CCA or PLS. How many samples are required?

Simply import gemmr

In [1]: from gemmr import *

Then, for CCA, only the number of features in each of the 2 datasets needs to be specified, e.g.

In [2]: cca_sample_size(100, 50)
Out[2]: {0.1: 83427, 0.3: 6562, 0.5: 2012}

The result is a dictionary with keys indicating assumed ground truth correlations and values giving the corresponding sample size estimate. For PLS, in addition to the number of features in each dataset the powerlaw decay constant for the within-set principal component spectrum needs to be specified:

In [3]: pls_sample_size(100, 50, -1.5, -.5)
Out[3]: {0.1: 236437, 0.3: 32769, 0.5: 13074}

Note

Required sample sizes are calculated to obtain at least 90% power and less than 10% error in a number of other metrics. See the [gemmr] publication for more details.

More use cases and options of the sample size functions are discussed in Sample size calculation.

How can I generate synthetic data for CCA or PLS?

The functionality is provided in module generative_model and requires two steps

In [4]: from gemmr.generative_model import setup_model, generate_data

First, a model needs to be specified. The required parameters are:

  • the number of features in X and Y
  • the assumed true correlation between scores in X and Y
  • the power-law exponents describing the within-set principal component spectra of X and Y
In [5]: px, py = 3, 5

In [6]: r_between = 0.3

In [7]: ax, ay = -1, -.5

In [8]: Sigma = setup_model('cca', px=px, py=py, ax=ax, ay=ay, r_between=r_between)

Analogously, if a model for PLS is desired the first argument becomes 'pls'.

Second, given the model, data can be drawn from the normal distribution associated with the covariance matrix Sigma:

In [9]: X, Y = generate_data(Sigma, px, n=5000)

In [10]: X.shape, Y.shape
Out[10]: ((5000, 3), (5000, 5))

See the API reference for generative_model.setup_model() and generative_model.generate_data() for more details.

How do the provided CCA or PLS estimators work?

We assume two data arrays X and Y are given and shall be analyzed with CCA or PLS. The provided estimators work like those in sklearn. For example, to perform a CCA:

In [11]: from gemmr.estimators import SVDCCA

In [12]: cca = SVDCCA(n_components=1)

In [13]: cca.fit(X, Y)
Out[13]: SVDCCA()

After fitting several attributes become available. Estimated canonical correlations are stored in

In [14]: cca.corrs_
Out[14]: array([0.30308498, 0.04694188, 0.01331049])

weight (rotation) vectors in

In [15]: cca.x_rotations_
Out[15]: 
array([[0.18886894],
       [0.68264445],
       [0.70592144]])

and analogously in cca.y_rotations_, and the attributes x_scores_ and y_scores_ provide the in-sample scores:

In [16]: plt.scatter(cca.x_scores_, cca.y_scores_, s=1)
Out[16]: <matplotlib.collections.PathCollection at 0x7f82d8e3b4e0>
_images/svdcca_scatter_scores.png

SVDPLS works analogously, but note that it finds maximal covariances instead of correlations, and correspondingly has an attribute covs_.

For more information see the reference pages for estimators.SVDCCA and estimators.SVDPLS.

A sparse CCA estimator, based on the R-package PMA, is implemented as estimators.SparseCCA.

How can I investigate parameter dependencies of CCA or PLS?

This can be done with the function sample_analysis.analyze_model_parameters(). A basic use case is shown here:

In [17]: from gemmr.sample_analysis import *

In [18]: results = analyze_model_parameters(
   ....:     'cca',
   ....:     pxs=(2, 5), rs=(.3, .5), n_per_ftrs=(2, 10, 30, 100),
   ....:     n_rep=10, n_perm=1,
   ....: )
   ....: 

In [19]: results
Out[19]: 
<xarray.Dataset>
Dimensions:                     (Sigma_id: 1, mode: 1, n_per_ftr: 4, perm: 1, px: 2, r: 2, rep: 10, x_feature: 5, x_orig_feature: 5, y_feature: 5, y_orig_feature: 5)
Coordinates:
  * y_orig_feature              (y_orig_feature) int64 0 1 2 3 4
  * r                           (r) float64 0.3 0.5
  * Sigma_id                    (Sigma_id) int64 0
  * n_per_ftr                   (n_per_ftr) int64 2 10 30 100
  * x_orig_feature              (x_orig_feature) int64 0 1 2 3 4
  * x_feature                   (x_feature) int64 0 1 2 3 4
  * y_feature                   (y_feature) int64 0 1 2 3 4
  * px                          (px) int64 2 5
Dimensions without coordinates: mode, perm, rep
Data variables:
    between_assocs              (px, r, Sigma_id, n_per_ftr, rep) float64 0.5856 ... 0.4817
    between_covs_sample         (px, r, Sigma_id, n_per_ftr, rep) float64 0.2379 ... 0.3608
    between_corrs_sample        (px, r, Sigma_id, n_per_ftr, rep) float64 0.5856 ... 0.4817
    x_weights                   (px, r, Sigma_id, n_per_ftr, rep, x_feature, mode) float64 -0.7755 ... 0.5364
    y_weights                   (px, r, Sigma_id, n_per_ftr, rep, y_feature, mode) float64 -0.9897 ... -0.6565
    x_loadings                  (px, r, Sigma_id, n_per_ftr, rep, x_orig_feature, mode) float64 -0.7337 ... 0.4976
    y_loadings                  (px, r, Sigma_id, n_per_ftr, rep, y_orig_feature, mode) float64 -0.9197 ... -0.6245
    between_assocs_perm         (px, r, Sigma_id, n_per_ftr, rep, perm) float64 0.8095 ... 0.1353
    between_covs_sample_perm    (px, r, Sigma_id, n_per_ftr, rep, perm) float64 0.3596 ... 0.1024
    between_corrs_sample_perm   (px, r, Sigma_id, n_per_ftr, rep, perm) float64 0.8095 ... 0.1353
    x_weights_perm              (px, r, Sigma_id, n_per_ftr, rep, perm, x_feature, mode) float64 -0.9788 ... -0.3067
    y_weights_perm              (px, r, Sigma_id, n_per_ftr, rep, perm, y_feature, mode) float64 -0.9975 ... -0.4528
    x_loadings_perm             (px, r, Sigma_id, n_per_ftr, rep, perm, x_orig_feature, mode) float64 -0.9671 ... -0.2974
    y_loadings_perm             (px, r, Sigma_id, n_per_ftr, rep, perm, y_orig_feature, mode) float64 -0.9882 ... -0.4158
    between_assocs_true         (px, r, Sigma_id, mode) float64 0.3 0.5 0.3 0.5
    x_weights_true              (px, r, Sigma_id, x_feature, mode) float64 -0.4642 ... 0.5642
    y_weights_true              (px, r, Sigma_id, y_feature, mode) float64 -0.9942 ... -0.672
    ax                          (px, r, Sigma_id) float64 -0.6454 ... -0.2939
    ay                          (px, r, Sigma_id) float64 -0.257 ... -0.1957
    latent_expl_var_ratios_x    (px, r, Sigma_id, mode) float64 0.4374 ... 0.1808
    latent_expl_var_ratios_y    (px, r, Sigma_id, mode) float64 0.5434 ... 0.1924
    weight_selection_algorithm  (px, r, Sigma_id) <U4 'qr__' 'qr__' ... 'qr__'
    x_loadings_true             (px, r, Sigma_id, x_feature, mode) float64 -0.5482 ... 0.5354
    x_crossloadings_true        (px, r, Sigma_id, x_feature, mode) float64 -0.1645 ... 0.2677
    y_loadings_true             (px, r, Sigma_id, y_feature, mode) float64 -0.9951 ... -0.6408
    y_crossloadings_true        (px, r, Sigma_id, y_feature, mode) float64 -0.2985 ... -0.3204
    py                          (px) int64 2 5

The variable results contains a number of outcome metrics by default, and further ones can be obtained through add-ons specified as keyword-argument addons to sample_analysis.analyze_model_parameters().

Dependence of outcomes on, for example, sample size, can then be inspected:

In [20]: plt.loglog(results.n_per_ftr, results.between_assocs.sel(px=5, r=0.3, Sigma_id=0).mean('rep'), label='r=0.3')
Out[20]: [<matplotlib.lines.Line2D at 0x7f82d8e3bda0>]

In [21]: plt.loglog(results.n_per_ftr, results.between_assocs.sel(px=5, r=0.5, Sigma_id=0).mean('rep'), label='r=0.5')
Out[21]: [<matplotlib.lines.Line2D at 0x7f82d8e2f898>]

In [22]: plt.xlabel('samples per feature')
Out[22]: Text(0.5, 0, 'samples per feature')

In [23]: plt.ylabel('canonical correlation')
Out[23]: Text(0, 0.5, 'canonical correlation')

In [24]: plt.legend()
Out[24]: <matplotlib.legend.Legend at 0x7f82d8f22f98>

In [25]: plt.gcf().tight_layout()
_images/canonical_correlation_vs_n.png

See Model parameter analysis for a more extensive example and the reference page for sample_analysis.analyzers.analyze_model_parameters() for more details.