gemmr.data.preprocessing.preproc_smith

gemmr.data.preprocessing.preproc_smith(fc, sm, feature_names=None, final_sm_deconfound=True, confounders=(), hcp_confounders=False, hcp_confounder_software_version=True, squared_confounders=False, hcp_data_dict_correct_pct_to_t=False)

Data preprocessing pipeline from Smith et al. (2015).

Parameters:
  • fc (np.ndarray, pd.DataFrame or xr.DataArray (n_samples, n_X_features)) – neuroimaging data matrix
  • sm (pd.DataFrame (n_samples, n_Y_features)) – behavioral and demographic data matrix. Names of features to include, and confounds must be column names
  • feature_names (None or list-like) – names of features to use, names must be columns in sm. If None default feature names are used
  • final_sm_deconfound (bool) – if True the subject measure data matrix will be deconfounded again as a very last preprocessing step, as in Smith et al. (2015). In that case, however, the resulting columns of Y will NOT be principal component scores.
  • confounders (tuple of str) – column-names in sm to be used as confounders. If some are not found a warning is issued and the code will continue without the missing ones.
  • hcp_confounders (bool) – if True ‘Weight’, ‘Height’, ‘BPSystolic’, ‘BPDiastolic’, ‘HbA1C’ as well as the cubic roots of ‘FS_BrainSeg_Vol’, ‘FS_IntraCranial_Vol’ are included as confounders
  • hcp_confounder_software_version (bool) – if True and hcp_confounders is also True, then the feature ‘fMRI_3T_ReconVrs’ (encoded as a dummy variable) is used as confounder
  • squared_confounders (bool) – if True the squares of all confounders (except software version, if used) are used as additional confounders
  • hcp_data_dict_correct_pct_to_t (bool) – concerns only feature_names from HCP data dictionary. If True a number of feature_names are replaced, see _check_feature_names().
Returns:

preprocessed_data – with items:

  • X : np.ndarray (n_samples, n_X_features)
    dataset X
  • Y : np.ndarray (n_samples, n_Y_features)
    dataset Y
  • X_whitened : np.ndarray (n_samples, n_X_features)
    whitened dataset X
  • Y_whitened : np.ndarray (n_samples, n_Y_features)
    whitened dataset Y
  • Y_raw : np.ndarray (n_samples, n_Y_features)
    unprocessed Y data comprising only the selected features (i.e. the matrix S4)
  • feature_names : list
    ordered list of feature names corresponding to the columns of Y
  • X_pc_axes : np.ndarray (n_X_features, n_components)
    X principal component axes

Return type:

dict

References

Smith et al., A positive-negative mode of population covariation links brain connectivity, demographics and behavior, Nature Neuroscience (2015)