maxfuse.match_utils.get_refined_matching

maxfuse.match_utils.get_refined_matching(init_matching, arr1, arr2, randomized_svd=False, svd_runs=1, svd_components1=None, svd_components2=None, clust_labels1=None, clust_labels2=None, edges1=None, edges2=None, wt1=0.5, wt2=0.5, n_iters=3, filter_prop=0, cca_components=15, cca_max_iter=2000, verbose=True)[source]

Refinement of init_matching.

Parameters:
  • init_matching (list) – init_matching[0][i], init_matching[1][i] is a matched pair, and init_matching[2][i] is the distance for this pair.

  • arr1 (np.array of shape (n_samples1, n_features1)) – The first data matrix.

  • arr2 (np.array of shape (n_samples2, n_features2)) – The second data matrix.

  • randomized_svd (bool, default=False) – Whether to use randomized SVD

  • svd_runs (int, default=1) – Randomized SVD will result in different runs, so if randomized_svd=True, perform svd_runs many randomized SVDs, and pick the one with the smallest Frobenious reconstruction error. If randomized_svd=False, svd_runs is forced to be 1.

  • svd_components1 (None or int) – If None, then do not do SVD, else, number of components to keep when doing SVD de-noising for the first data matrix before feeding into CCA.

  • svd_components2 (None or int) – Same as svd_components1 but for the second data matrix.

  • clust_labels1 (None or np.array of shape (n_samples1, )) – If not None, then it is the clustering label of the first data matrix, and the smoothing of this matrix will be done via cluster centroid shrinkage.

  • clust_labels2 (None or np.array of shape (n_samples2, )) – Same as clust_labels1 but for the second data matrix.

  • edges1 (None or list of length two or three) – If not None, then each edge in the graph is (edges[0][i], edges[1][i]) with weight edges[2][i] (if exists) and the smoothing of this matrix will be done via graph smoothing.

  • edges2 (None or scipy.sparse.csr_matrix of shape (n_samples2, n_samples2)) – Same as edges1 but for the second data matrix.

  • wt1 (float, default=0.5) – The smoothing of the first data matrix will be wt1 * (cca embedding of arr1) + (1-wt1) * shrinkage_targets, where the shrinkage_targets will be either the cluster centroids or the average of graph neighbors.

  • wt2 (float, default=0.5) – Same as wt1 but for the second data matrix.

  • n_iters (int, default=3) – Number of refinement iterations.

  • filter_prop (float, default=0) – Proportion of matched pairs to discard before feeding into refinement iterations.

  • cca_components (int, default=15) – Number of CCA components.

  • cca_max_iter (int, default=2000,) – Maximum number of CCA iterations.

  • verbose (bool, default=True) – Whether to print the progress.

Returns:

matching (list of length 3) – rows, cols, vals = matching, Each matched pair is rows[i], cols[i], their distance is vals[i].