gemmr.data.preprocessing.preproc_smith
- gemmr.data.preprocessing.preproc_smith(fc, sm, feature_names=None, final_sm_deconfound=True, confounders=(), hcp_confounders=False, hcp_confounder_software_version=True, squared_confounders=False, hcp_data_dict_correct_pct_to_t=False, include_N2=False, confounds_impute=0)
Data preprocessing pipeline from Smith et al. (2015).
- Parameters:
fc (np.ndarray, pd.DataFrame or xr.DataArray (n_samples, n_X_features)) – neuroimaging data matrix
sm (pd.DataFrame (n_samples, n_Y_features)) – behavioral and demographic data matrix. Names of features to include, and confounds must be column names
feature_names (None, slice or list-like) – names of features to use, names must be columns in
sm
. IfNone
default (i.e. from Smith et al. 2015, applicable to HCP data) feature names are usedfinal_sm_deconfound (bool) – if
True
the subject measure data matrix will be deconfounded again as a very last preprocessing step, as in Smith et al. (2015). In that case, however, the resulting columns of Y will NOT be principal component scores.confounders (tuple of str) – column-names in
sm
to be used as confounders. If some are not found a warning is issued and the code will continue without the missing ones.hcp_confounders (bool) – if
True
‘Weight’, ‘Height’, ‘BPSystolic’, ‘BPDiastolic’, ‘HbA1C’ as well as the cubic roots of ‘FS_BrainSeg_Vol’, ‘FS_IntraCranial_Vol’ are included as confoundershcp_confounder_software_version (bool) – if
True
andhcp_confounders
is alsoTrue
, then the feature ‘fMRI_3T_ReconVrs’ (encoded as a dummy variable) is used as confoundersquared_confounders (bool) – if
True
the squares of all confounders (except software version, if used) are used as additional confoundershcp_data_dict_correct_pct_to_t (bool) – concerns only feature_names from HCP data dictionary. If
True
a number of feature_names are replaced, see_check_feature_names()
.include_N2 (bool) – if True, the data matrix, normalized by the absolute value of the mean of each feature, will be used as additional features, concatenating it horizontally to the z-scored data matrix
confounds_impute (None, 0 or "mice") – if 0, missing confound values are imputed with 0 (after an inverse normal transformation), if “mice” sklearn.impute.IterativeImputer is used, if None no imputation is performed
- Returns:
preprocessed_data – with items:
- Xnp.ndarray (n_samples, n_X_features)
dataset X (principal component scores)
- Ynp.ndarray (n_samples, n_Y_features)
dataset Y (principal component scores)
- X_whitenednp.ndarray (n_samples, n_X_features)
whitened dataset X
- Y_whitenednp.ndarray (n_samples, n_Y_features)
whitened dataset Y
- Y_rawnp.ndarray (n_samples, n_Y_features)
unprocessed Y data comprising only the selected features (i.e. the matrix S4_raw)
- feature_nameslist
ordered list of feature names corresponding to the columns of Y
- X_pc_axesnp.ndarray (n_X_features, n_components)
X principal component axes
- confoundersnp.ndarray (n_samples, n_features)
confounder data matrix
- X_preprocnp.ndarray (n_samples, n_X_features)
preprocessed X data (not PC-ed)
- Y_preprocnp.ndarray (n_samples, n_Y_features)
preprocessed Y data (not PC-ed)
- Return type:
dict
References
Smith et al., A positive-negative mode of population covariation links brain connectivity, demographics and behavior, Nature Neuroscience (2015)