gemmr.sample_analysis.macros.analyze_subsampled_and_resampled

gemmr.sample_analysis.macros.analyze_subsampled_and_resampled(estr, X, Y, Xorig=None, Yorig=None, permutations=1000, ns=None, n_min_subsample=None, frac_max_subsample=0.5, n_subsample_ns=5, n_rep_subsample=100, n_perm_subsample=1000, n_test_subsample=0, cv=True, n_jobs=1, fit_params=None, random_state=0)

Analyzes the given data with the given estimator.

Specifially:

  • calculates the permutation-based p-value

  • analyzes the full sample, and its permutations

  • analyzes non-overlapping subsamples of the data

Parameters:
  • estr (sklearn-style estimator) – estimator used to analyze the data, needs to be compatible with analyzers in ccapwr.sample_analysis.analyzers

  • X (np.ndarray (n_samples, n_X_features)) – dataset X

  • Y (np.ndarray (n_samples, n_Y_features)) – dataset Y

  • Xorig (np.ndarray (n_samples, n_X_features)) – X dataset of original variables (for calculating loadings)

  • Yorig (np.ndarray (n_samples, n_Y_features)) – Y dataset of original variables (for calculating loadings)

  • permutations (int or iterable) – used for calculating p-value and the whole-sample analysis. If int, gives the number of permutations used, if iterable each element gives one set of permutation indices

  • ns (list of int) – number of samples to which the data are subsampled. If None calculated from n_min_subsample, frac_max_subsample and n_subsample_ns

  • n_min_subsample (None or int) – minimum number of samples to which the data are subsampled. If None X.shape[1]+Y.shape[1]+2 is used. Ignored if ns is not None

  • frac_max_subsample (float between 0 and 1) – the maximum number of samples to which the data are subsampled is frac_max_subsample * len(X). Ignored if ns is not None

  • n_subsample_ns (int) – the list of sample sizes to which the data are subsampled is a np.logspace with this many entries. Ignored if ns is not None

  • n_rep_subsample (int) – number of times a subsampled dataset of a given size is generated

  • n_perm_subsample (int) – number of permutations for each subsampled datasets

  • n_test_subsample (int or 'auto') – number of subjects to use as test set in subsampled datasets. If n_test == 'auto' then n_test = n_samples - max(ns) will be used.

  • cv (bool) – if True run cross-validations

  • n_jobs (int or None) – number of parallel jobs (see joblib.Parallel)

  • fit_params (dict) – keyword-arguments for estr.fit

  • random_state (None, int or rng-instance) – random seed

Returns:

results – with items:

  • full_samplexr.Dataset (output of analyze_resampled)

    This also contains p-value as a data-variable

  • subsampled : xr.Dataset (output of analyze_subsampled)

Return type:

dict