Matrix correlation coefficient methods

This module provides statistical tools for computation of matrix correlation coefficients (MCC). The MCCs provide information on to what degree multivariate data contained in two data arrays are correlated.

hoggorm.mat_corr_coeff.RV2coeff(dataList)

This function computes the RV matrix correlation coefficients between pairs of arrays. The number and order of objects (rows) for the two arrays must match. The number of variables in each array may vary. The RV2 coefficient is a modified version of the RV coefficient with values -1 <= RV2 <= 1. RV2 is independent of object and variable size.

REF: A.K. Smilde, et al. Bioinformatics (2009) Vol 25, no 3, 401-405

Parameters:dataList (list) – A list holding an arbitrary number of numpy arrays for which the RV coefficient will be computed.
Returns:A list holding an arbitrary number of numpy arrays for which the RV coefficient will be computed.
Return type:numpy array

Examples

>>> import hoggorm as ho
>>> import numpy as np
>>>
>>> # Generate some random data. Note that number of rows must match across arrays
>>> arr1 = np.random.rand(50, 100)
>>> arr2 = np.random.rand(50, 20)
>>> arr3 = np.random.rand(50, 500)
>>>
>>> # Center the data before computation of RV coefficients
>>> arr1_cent = arr1 - np.mean(arr1, axis=0)
>>> arr2_cent = arr2 - np.mean(arr2, axis=0)
>>> arr3_cent = arr3 - np.mean(arr3, axis=0)
>>>
>>> # Compute RV matrix correlation coefficients on mean centered data
>>> rv_results = ho.RVcoeff([arr1_cent, arr2_cent, arr3_cent])
>>> array([[ 1.        , -0.00563174,  0.04028299],
           [-0.00563174,  1.        ,  0.08733739],
           [ 0.04028299,  0.08733739,  1.        ]])
>>>
>>> # Get RV for arr1_cent and arr2_cent
>>> rv_results[0, 1]
    -0.00563174
>>>
>>> # or
>>> rv_results[1, 0]
    -0.00563174
>>>
>>> # Get RV for arr2_cent and arr3_cent
>>> rv_results[1, 2]
    0.08733739
>>>
>>> # or
>>> rv_results[2, 1]
    0.08733739
hoggorm.mat_corr_coeff.RVcoeff(dataList)

This function computes the RV matrix correlation coefficients between pairs of arrays. The number and order of objects (rows) for the two arrays must match. The number of variables in each array may vary.

REF: H. Abdi, D. Valentin; ‘The STATIS method’

Parameters:dataList (list) – A list holding numpy arrays for which the RV coefficient will be computed.
Returns:A numpy array holding RV coefficients for pairs of numpy arrays. The diagonal in the result array holds ones, since RV is computed on identical arrays, i.e. first array in dataList against frist array in
Return type:numpy array

Examples

>>> import hoggorm as ho
>>> import numpy as np
>>>
>>> # Generate some random data. Note that number of rows must match across arrays
>>> arr1 = np.random.rand(50, 100)
>>> arr2 = np.random.rand(50, 20)
>>> arr3 = np.random.rand(50, 500)
>>>
>>> # Center the data before computation of RV coefficients
>>> arr1_cent = arr1 - np.mean(arr1, axis=0)
>>> arr2_cent = arr2 - np.mean(arr2, axis=0)
>>> arr3_cent = arr3 - np.mean(arr3, axis=0)
>>>
>>> # Compute RV matrix correlation coefficients on mean centered data
>>> rv_results = ho.RVcoeff([arr1_cent, arr2_cent, arr3_cent])
>>> array([[ 1.        ,  0.41751839,  0.77769025],
           [ 0.41751839,  1.        ,  0.51194496],
           [ 0.77769025,  0.51194496,  1.        ]])
>>>
>>> # Get RV for arr1_cent and arr2_cent
>>> rv_results[0, 1]
    0.41751838661314689
>>>
>>> # or
>>> rv_results[1, 0]
    0.41751838661314689
>>>
>>> # Get RV for arr2_cent and arr3_cent
>>> rv_results[1, 2]
    0.51194496245209853
>>>
>>> # or
>>> rv_results[2, 1]
    0.51194496245209853
class hoggorm.mat_corr_coeff.SMI(X1, X2, **kargs)

Similarity of Matrices Index (SMI)

A similarity index for comparing coupled data matrices. A two-step process starts with extraction of stable subspaces using Principal Component Analysis or some other method yielding two orthonormal bases. These bases are compared using Orthogonal Projection (OP / ordinary least squares) or Procrustes Rotation (PR). The result is a similarity measure that can be adjusted to various data sets and contexts and which includes explorative plotting and permutation based testing of matrix subspace equality.

Reference: A similarity index for comparing coupled matrices - Ulf Geir Indahl, Tormod Næs, Kristian Hovde Liland

Parameters:
  • X1 (numpy array) – first matrix to be compared.
  • X2 (numpy array) – second matrix to be compared.
  • ncomp1 (int, optional) – maximum number of subspace components from the first matrix.
  • ncomp2 (int, optional) – maximum number of subspace components from the second matrix.
  • projection (list, optional) – type of projection to apply, defaults to “Orthogonal”, alternatively “Procrustes”.
  • Scores1 (numpy array, optional) – user supplied score-matrix to replace singular value decomposition of first matrix.
  • Scores2 (numpy array, optional) – user supplied score-matrix to replace singular value decomposition of second matrix.
Returns:

Return type:

An SMI object containing all combinations of components.

EXAMPLES

>>> import numpy as np
>>> import SMI as S
>>> import statTools as st
>>> X1 = st.centre(np.random.rand(100,300))
>>> U, s, V = np.linalg.svd(X1, 0)
>>> X2 = np.dot(np.dot(np.delete(U, 2,1), np.diag(np.delete(s,2))), np.delete(V,2,0))
>>> smiOP = S.SMI(X1,X2, ncomp1 = 10, ncomp2 = 10)
>>> smiPR = S.SMI(X1,X2, ncomp1 = 10, ncomp2 = 10, projection = "Procrustes")
>>> smiCustom = S.SMI(X1,X2, ncomp1 = 10, ncomp2 = 10, Scores1 = U)
>>> print(smiOP.smi)
>>> print(smiOP.significance())
>>> print(smiPR.significance(B = 100))
significance(**kargs)

Significance estimation for Similarity of Matrices Index (SMI)

For each combination of components significance is estimated by sampling from a null distribution of no similarity, i.e. when the rows of one matrix is permuted B times and corresponding SMI values are computed. If the vector replicates is included, replicates will be kept together through permutations.

Parameters:
  • integer (B) – number of permutations, default = 10000.
  • replicates (numpy array) – integer vector of replicates (must be balanced).
Returns:

Return type:

An array containing P-values for all combinations of components.