maxfuse.match_utils.get_initial_matching

maxfuse.match_utils.get_initial_matching(arr1, arr2, clust_labels1=None, clust_labels2=None, edges1=None, edges2=None, wt1=0.3, wt2=0.3, randomized_svd=True, svd_runs=1, svd_components1=None, svd_components2=None, verbose=True)[source]

Assume the features of arr1 and arr2 are column-wise directly comparable, obtain a matching by minimizing the correlation distance between arr1 and arr2.

Parameters:
  • arr1 (np.array of shape (n_samples1, n_features1)) – The first data matrix.

  • arr2 (np.array of shape (n_samples2, n_features2)) – The second data matrix.

  • clust_labels1 (None or np.array of shape (n_samples1, )) – If not None, then it is the clustering label of the first data matrix, and the smoothing of this matrix will be done via cluster centroid shrinkage.

  • clust_labels2 (None or np.array of shape (n_samples2, )) – Same as clust_labels1 but for the second data matrix.

  • edges1 (None or list of length two or three) – If not None, then each edge in the graph is (edges[0][i], edges[1][i]) with weight edges[2][i] (if exists) and the smoothing of this matrix will be done via graph smoothing.

  • edges2 (None or scipy.sparse.csr_matrix of shape (n_samples2, n_samples2)) – Same as edges1 but for the second data matrix.

  • wt1 (float, default=0.3) – The smoothing of the first data matrix will be wt1 * arr1 + (1-wt1) * shrinkage_targets, where the shrinkage_targets will be either the cluster centroids or the average of graph neighbors.

  • wt2 (float, default=0.3) – Same as wt1 but for the second data matrix.

  • randomized_svd (bool, default=False) – Whether to use randomized svd.

  • svd_runs (int, default=1) – Randomized SVD will result in different runs, so if randomized_svd=True, perform svd_runs many randomized SVDs, and pick the one with the smallest Frobenious reconstruction error. If randomized_svd=False, svd_runs is forced to be 1.

  • svd_components1 (None or int) – If None, then do not do SVD, else, number of components to keep when doing SVD de-noising for the first data matrix.

  • svd_components2 (None or int) – Same as svd_components1 but for the second data matrix.

  • verbose (bool, default=True) – Whether to print the progress.

Returns:

matching (list of length 3) – rows, cols, vals = matching, Each matched pair is rows[i], cols[i], their distance is vals[i].