Principal Component Analysis (PCA)

The nipalsPCA class carries out principal component analysis. It analyses one data array and looks for systematic variance in the data using principal components (PC’s). See below for a description of the methods in nipalsPCA as well as some examples of how to use it.

class hoggorm.pca.nipalsPCA(arrX, numComp=None, Xstand=False, cvType=None)

This class carries out Principal Component Analysis using the NIPALS algorithm.

Parameters:
  • arrX (numpy array) – A numpy array containing the data
  • numComp (int, optional) – An integer that defines how many components are to be computed
  • Xstand (boolean, optional) –

    Defines whether variables in arrX are to be standardised/scaled or centered

    False : columns of arrX are mean centred (default)
    Xstand = False
    True : columns of arrX are mean centred and devided by their own standard deviation
    Xstand = True
  • cvType (list, optional) –

    The list defines cross validation settings when computing the PCA model. Note if cvType is not provided, cross validation will not be performed and as such cross validation results will not be available. Choose cross validation type from the following:

    loo : leave one out / a.k.a. full cross validation (default)
    cvType = ["loo"]
    KFold : leave out one fold or segment
    cvType = ["KFold", numFolds]

    numFolds: int

    Number of folds or segments

    lolo : leave one label out
    cvType = ["lolo", lablesList]

    lablesList: list

    Sequence of lables. Must be same lenght as number of rows in arrX. Leaves out objects with same lable.

Returns:

A class that contains the PCA model and computational results

Return type:

class

EXAMPLES

First import the hoggorm package.

>>> import hoggorm as ho

Import your data into a numpy array.

>>> myData
array([[ 5.7291665,  3.416667 ,  3.175    ,  2.6166668,  6.2208333],
       [ 6.0749993,  2.7416666,  3.6333339,  3.3833334,  6.1708336],
       [ 6.1166663,  3.4916666,  3.5208333,  2.7125003,  6.1625004],
       ...,
       [ 6.3333335,  2.3166668,  4.1249995,  4.3541665,  6.7500005],
       [ 5.8250003,  4.8291669,  1.4958333,  1.0958334,  6.0999999],
       [ 5.6499996,  4.6624999,  1.9291668,  1.0749999,  6.0249996]])
>>> np.shape(myData)
(14, 5)

Examples of how to compute a PCA model using different settings for the input parameters.

>>> model = ho.nipalsPCA(arrX=myData, numComp=5, Xstand=False)
>>> model = ho.nipalsPCA(arrX=myData)
>>> model = ho.nipalsPCA(arrX=myData, numComp=3)
>>> model = ho.nipalsPCA(arrX=myData, Xstand=True)
>>> model = ho.nipalsPCA(arrX=myData, cvType=["loo"])
>>> model = ho.nipalsPCA(arrX=myData, cvType=["Kfold", 4])
>>> model = ho.nipalsPCA(arrX=myData, cvType=["lolo", [1, 2, 3, 4, 5, 6, 7, 1, 2, 3, 4, 5, 6, 7]])

Examples of how to extract results from the PCA model.

>>> scores = model.X_scores()
>>> loadings = model.X_loadings()
>>> cumulativeCalibratedExplainedVariance_allVariables = model.X_cumCalExplVar_indVar()
X_MSECV()

Returns an array holding MSECV across all variables in X acquired through cross validation after each computed component. First row is MSECV for zero components, second row for component 1, third row for component 2, etc.

X_MSECV_indVar()

Returns an arrary holding MSECV for each variable in X acquired through cross validation. First row is MSECV for zero components, second row for component 1, etc.

X_MSEE()

Returns an array holding MSEE across all variables in X acquired through calibration after each computed component. First row is MSEE for zero components, second row for component 1, third row for component 2, etc.

X_MSEE_indVar()

Returns an array holding MSEE for each variable in array X acquired through calibration after each computed component. First row holds MSEE for zero components, second row for component 1, third row for component 2, etc.

X_PRESSCV()

Returns an array holding PRESSCV across all variables in X acquired through cross validation after each computed component. First row is PRESSEV for zero components, second row for component 1, third row for component 2, etc.

X_PRESSCV_indVar()

Returns array holding PRESSEV for each individual variable in X acquired through cross validation after each computed component. First row is PRESSCV for zero components, second row for component 1, third row for component 2, etc.

X_PRESSE()

Returns array holding PRESSE across all variables in X acquired through calibration after each computed component. First row is PRESSE for zero components, second row for component 1, third row for component 2, etc.

X_PRESSE_indVar()

Returns array holding PRESSE for each individual variable in X acquired through calibration after each computed component. First row is PRESSE for zero components, second row for component 1, third row for component 2, etc.

X_RMSECV()

Returns an array holding RMSECV across all variables in X acquired through cross validation after each computed component. First row is RMSECV for zero components, second row for component 1, third row for component 2, etc.

X_RMSECV_indVar()

Returns an arrary holding RMSECV for each variable in X acquired through cross validation after each computed component. First row is RMSECV for zero components, second row for component 1, third row for component 2, etc.

X_RMSEE()

Returns an array holding RMSEE across all variables in X acquired through calibration after each computed component. First row is RMSEE for zero components, second row for component 1, third row for component 2, etc.

X_RMSEE_indVar()

Returns an array holding RMSEE for each variable in array X acquired through calibration after each components. First row holds RMSEE for zero components, second row for component 1, third row for component 2, etc.

X_calExplVar()

Returns a list holding the calibrated explained variance for each component. First number in list is for component 1, second number for component 2, etc.

X_corrLoadings()

Returns array holding correlation loadings of array X. First column holds correlation loadings for component 1, second column holds correlation loadings for component 2, etc.

X_cumCalExplVar()

Returns a list holding the cumulative validated explained variance for array X after each component. First number represents zero components, second number represents component 1, etc.

X_cumCalExplVar_indVar()

Returns an array holding the cumulative calibrated explained variance for each variable in X after each component. First row represents zero components, second row represents one component, third row represents two components, etc. Columns represent variables.

X_cumValExplVar()

Returns a list holding the cumulative validated explained variance for array X after each component.

X_cumValExplVar_indVar()

Returns an array holding the cumulative validated explained variance for each variable in X after each component. First row represents zero components, second row represents component 1, third row for compnent 2, etc. Columns represent variables.

X_loadings()

Returns array holding loadings P of array X. Rows represent variables and columns represent components. First column holds loadings for component 1, second column holds scores for component 2, etc.

X_means()

Returns array holding the column means of input array X.

X_predCal()

Returns a dictionary holding the predicted arrays Xhat from calibration after each computed component. Dictionary key represents order of component.

X_predVal()

Returns a dictionary holding the predicted arrays Xhat from validation after each computed component. Dictionary key represents order of component.

X_residuals()

Returns a dictionary holding arrays of residuals for array X after each computed component. Dictionary key represents order of component.

X_scores()

Returns array holding scores T. First column holds scores for component 1, second column holds scores for component 2, etc.

X_scores_predict(Xnew, numComp=[])

Returns array of X scores from new X data using the exsisting model. Rows represent objects and columns represent components.

X_valExplVar()

Returns a list holding the validated explained variance for X after each component. First number in list is for component 1, second number for component 2, third number for component 3, etc.

__init__(arrX, numComp=None, Xstand=False, cvType=None)

On initialisation check how arrX and arrY are to be pre-processed (Xstand and Ystand are either True or False). Then check whether number of components chosen by user is OK.

corrLoadingsEllipses()

Returns a dictionary hodling coordinates of ellipses that represent 50% and 100% expl. variance in correlation loadings plot. The coordinates are stored in arrays.

cvTrainAndTestData()

Returns a list consisting of dictionaries holding training and test sets.

modelSettings()

Returns a dictionary holding the settings under which NIPALS PCA was run.