Sample size calculation

We demonstrate here the options available for the sample size calculation functions cca_sample_size() and pls_sample_size().

We first import these functions

In [1]: from gemmr import cca_sample_size, pls_sample_size

The basic use case only requires the number of features and the principal component spectrum decay constants:

In [2]: cca_sample_size(5, 10, -0.8, -1.2)
Out[2]: {0.1: 12624, 0.3: 1145, 0.5: 375}

In [3]: pls_sample_size(5, 10, -0.5, -1.5)
Out[3]: {0.1: 12223, 0.3: 1952, 0.5: 832}

Instead of the number of features, the functions also accept data matrices. To demonstrate this, we first generate a dataset

In [4]: from gemmr.data import generate_example_dataset

In [5]: Xcca, Ycca = generate_example_dataset('cca', px=5, py=10, ax=-0.8, ay=-1.2)

In [6]: Xpls, Ypls = generate_example_dataset('pls', px=5, py=10, ax=-.5, ay=-1.5, n=10000)

Then we can calculate sample sizes for these datasets as follows:

In [7]: cca_sample_size(Xcca, Ycca)
Out[7]: {0.1: 13083, 0.3: 1187, 0.5: 388}

In [8]: pls_sample_size(Xpls, Ypls)
Out[8]: {0.1: 11932, 0.3: 1906, 0.5: 812}

Note that for PLS only the data matrices are given as arguments, not the principal component decay constants that we needed above. That is because the decay constants are estimated from the data matrices. Correspondingly, the PLS sample sizes here are similar but not identical to the ones we got above (as the estimated decay constants are only approximately equal to the true ones). For CCA, on the other hand, we get the same sample sizes as before, as expected.

The assumed true correlations for which sample sizes are calculated can be specified as follows:

In [9]: cca_sample_size(Xcca, Ycca, rs=(.2, .7))
Out[9]: {0.2: 2878, 0.7: 186}

In [10]: pls_sample_size(Xpls, Ypls, rs=(.4, .6))
Out[10]: {0.4: 1179, 0.6: 599}

It is also possible to specify the target power and error levels:

In [11]: cca_sample_size(5, 10, -0.8, -1.2, target_power=0.8, target_error=.5)
Loading data from subfolder 'tmp'
Out[11]: {0.1: 2730, 0.3: 300, 0.5: 107}

Finally, the criterion on which the calculation is based, can be specified. By default, the “combined” criterion is used, meaning that power, association stength error, weight error, score error and loading error are considered at the same time and the linear model predicts the maximum sample size across all these metrics. Alternatively, the calculation can be based on each of these metrics alone:

In [12]: cca_sample_size(5, 10, -0.8, -1.2, criterion='power')
Loading data from subfolder 'tmp'
Out[12]: {0.1: 2971, 0.3: 322, 0.5: 114}

In [13]: cca_sample_size(5, 10, -0.8, -1.2, criterion='association_strength')
Loading data from subfolder 'tmp'
Out[13]: {0.1: 7334, 0.3: 575, 0.5: 176}

In [14]: cca_sample_size(5, 10, -0.8, -1.2, criterion='weight')
Loading data from subfolder 'tmp'
Out[14]: {0.1: 11599, 0.3: 1063, 0.5: 350}

In [15]: cca_sample_size(5, 10, -0.8, -1.2, criterion='score')
Loading data from subfolder 'tmp'
Out[15]: {0.1: 4695, 0.3: 432, 0.5: 142}

In [16]: cca_sample_size(5, 10, -0.8, -1.2, criterion='loading')
Loading data from subfolder 'tmp'
Out[16]: {0.1: 4715, 0.3: 514, 0.5: 183}