Quickstart

I want to use CCA or PLS. How many samples are required?

Simply import gemmr

In [1]: from gemmr import *

Then, for CCA, only the number of features in each of the 2 datasets, as well as the powerlaw decay constant for the within-set principal component spectrum, need to be specified, e.g.

In [2]: cca_sample_size(100, 50, -.8, -1.2)
Out[2]: {0.1: 189701, 0.3: 17211, 0.5: 5638}

The result is a dictionary with keys indicating assumed ground truth correlations and values giving the corresponding sample size estimate. For PLS, sample size can be calculated similarly:

In [3]: pls_sample_size(100, 50, -1.5, -.5)
Out[3]: {0.1: 331230, 0.3: 52919, 0.5: 22555}

Note

Required sample sizes are calculated to obtain at least 90% power and less than 10% error in a number of other metrics. See the [gemmr] publication for more details.

More use cases and options of the sample size functions are discussed in Sample size calculation.

How can I generate synthetic data for CCA or PLS?

The functionality is provided in module generative_model and requires two steps

In [4]: from gemmr.generative_model import GEMMR

First, a model needs to be specified. The required parameters are:

the number of features in X and Y
the assumed true correlation between scores in X and Y
the power-law exponents describing the within-set principal component spectra of X and Y

In [5]: px, py = 3, 5

In [6]: r_between = 0.3

In [7]: ax, ay = -1, -.5

In [8]: gm = GEMMR('cca', wx=px, wy=py, ax=ax, ay=ay, r_between=r_between)

Analogously, if a model for PLS is desired the first argument becomes 'pls'.

Second, data can be drawn from the model distribution:

In [9]: X, Y = gm.generate_data(n=5000)

In [10]: X.shape, Y.shape
Out[10]: ((5000, 3), (5000, 5))

See the API reference for generative_model.setup_model() and generative_model.generate_data() for more details.

How do the provided CCA or PLS estimators work?

We assume two data arrays X and Y are given and shall be analyzed with CCA or PLS. The provided estimators work like those in sklearn. For example, to perform a CCA:

In [11]: from gemmr.estimators import SVDCCA

In [12]: cca = SVDCCA(n_components=1)

In [13]: cca.fit(X, Y)
Out[13]: SVDCCA()

After fitting several attributes become available. Estimated canonical correlations are stored in

In [14]: cca.corrs_
Out[14]: array([0.27596667])

weight (rotation) vectors in

In [15]: cca.x_rotations_
Out[15]: 
array([[-0.87806565],
       [-0.11246735],
       [-0.46513633]])

and analogously in cca.y_rotations_, and the attributes x_scores_ and y_scores_ provide the in-sample scores:

In [16]: plt.scatter(cca.x_scores_, cca.y_scores_, s=1)
Out[16]: <matplotlib.collections.PathCollection at 0x7f224142edc0>

SVDPLS works analogously, but note that it finds maximal covariances instead of correlations, and correspondingly has an attribute covs_.

For more information see the reference pages for estimators.SVDCCA and estimators.SVDPLS.

A sparse CCA estimator, based on the R-package PMA, is implemented as estimators.SparseCCA.

How can I investigate parameter dependencies of CCA or PLS?

This can be done with the function sample_analysis.analyze_model_parameters(). A basic use case is shown here:

In [17]: from gemmr.sample_analysis import *

In [18]: results = analyze_model_parameters(
   ....:     'cca',
   ....:     pxs=(2, 5), rs=(.3, .5), n_per_ftrs=(2, 10, 30, 100),
   ....:     n_rep=10, n_perm=1,
   ....: )
   ....: 

In [19]: results
Out[19]: 
<xarray.Dataset>
Dimensions:                    (x_feature: 5, y_feature: 5, n_per_ftr: 4,
                                Sigma_id: 1, r: 2, px: 2, rep: 10, mode: 1,
                                perm: 1)
Coordinates:
  * x_feature                  (x_feature) int64 0 1 2 3 4
  * y_feature                  (y_feature) int64 0 1 2 3 4
  * n_per_ftr                  (n_per_ftr) int64 2 10 30 100
  * Sigma_id                   (Sigma_id) int64 0
  * r                          (r) float64 0.3 0.5
  * px                         (px) int64 2 5
Dimensions without coordinates: rep, mode, perm
Data variables: (12/25)
    between_assocs             (px, r, Sigma_id, n_per_ftr, rep) float64 0.88...
    between_covs_sample        (px, r, Sigma_id, n_per_ftr, rep) float64 0.52...
    between_corrs_sample       (px, r, Sigma_id, n_per_ftr, rep) float64 0.88...
    x_weights                  (px, r, Sigma_id, n_per_ftr, rep, x_feature, mode) float64 ...
    y_weights                  (px, r, Sigma_id, n_per_ftr, rep, y_feature, mode) float64 ...
    x_loadings                 (px, r, Sigma_id, n_per_ftr, rep, x_feature, mode) float64 ...
    ...                         ...
    ay                         (px, r, Sigma_id) float64 -1.0 -1.0 -1.0 -1.0
    x_loadings_true            (px, r, Sigma_id, x_feature, mode) float64 0.9...
    x_crossloadings_true       (px, r, Sigma_id, x_feature, mode) float64 0.2...
    y_loadings_true            (px, r, Sigma_id, y_feature, mode) float64 0.5...
    y_crossloadings_true       (px, r, Sigma_id, y_feature, mode) float64 0.1...
    py                         (px) int64 2 5
Attributes:
    model:           cca
    estr:            SVDCCA(calc_loadings=True, cov_out_of_bounds='raise')
    powerlaw_decay:  (-1, -1)
    created:         2023-12-04 01:54:36.753119
    gemmr_version:   0+unknown

The variable results contains a number of outcome metrics by default, and further ones can be obtained through add-ons specified as keyword-argument addons to sample_analysis.analyze_model_parameters().

Dependence of outcomes on, for example, sample size, can then be inspected:

In [20]: plt.loglog(results.n_per_ftr, results.between_assocs.sel(px=5, r=0.3, Sigma_id=0).mean('rep'), label='r=0.3')
Out[20]: [<matplotlib.lines.Line2D at 0x7f2241439850>]

In [21]: plt.loglog(results.n_per_ftr, results.between_assocs.sel(px=5, r=0.5, Sigma_id=0).mean('rep'), label='r=0.5')
Out[21]: [<matplotlib.lines.Line2D at 0x7f224070f850>]

In [22]: plt.xlabel('samples per feature')
Out[22]: Text(0.5, 0, 'samples per feature')

In [23]: plt.ylabel('canonical correlation')
Out[23]: Text(0, 0.5, 'canonical correlation')

In [24]: plt.legend()
Out[24]: <matplotlib.legend.Legend at 0x7f224070f610>

In [25]: plt.gcf().tight_layout()

See Model parameter analysis for a more extensive example and the reference page for sample_analysis.analyzers.analyze_model_parameters() for more details.