{ "cells": [ { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "# Example of MaxFuse usage between RNA and Protein modality." ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "In this tutorial, we demonstrate the application of MaxFuse integration and matching across weak-linked modalities. Here we showcase an example between RNA and Protein modality. For testing reason, we uses a CITE-seq pbmc data with 228 antibodies from Hao et al. (2021), and we use the Protein and RNA information but __disregard the fact they are multiome data__." ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "/Users/zongming/miniconda3/envs/maxfuse_ipynb/lib/python3.8/site-packages/umap/distances.py:1063: NumbaDeprecationWarning: The 'nopython' keyword argument was not supplied to the 'numba.jit' decorator. The implicit default value for this argument is currently False, but it will be changed to True in Numba 0.59.0. See https://numba.readthedocs.io/en/stable/reference/deprecation.html#deprecation-of-object-mode-fall-back-behaviour-when-using-jit for details.\n", " @numba.jit()\n", "/Users/zongming/miniconda3/envs/maxfuse_ipynb/lib/python3.8/site-packages/umap/distances.py:1071: NumbaDeprecationWarning: The 'nopython' keyword argument was not supplied to the 'numba.jit' decorator. The implicit default value for this argument is currently False, but it will be changed to True in Numba 0.59.0. See https://numba.readthedocs.io/en/stable/reference/deprecation.html#deprecation-of-object-mode-fall-back-behaviour-when-using-jit for details.\n", " @numba.jit()\n", "/Users/zongming/miniconda3/envs/maxfuse_ipynb/lib/python3.8/site-packages/umap/distances.py:1086: NumbaDeprecationWarning: The 'nopython' keyword argument was not supplied to the 'numba.jit' decorator. The implicit default value for this argument is currently False, but it will be changed to True in Numba 0.59.0. See https://numba.readthedocs.io/en/stable/reference/deprecation.html#deprecation-of-object-mode-fall-back-behaviour-when-using-jit for details.\n", " @numba.jit()\n", "/Users/zongming/miniconda3/envs/maxfuse_ipynb/lib/python3.8/site-packages/tqdm/auto.py:21: TqdmWarning: IProgress not found. Please update jupyter and ipywidgets. See https://ipywidgets.readthedocs.io/en/stable/user_install.html\n", " from .autonotebook import tqdm as notebook_tqdm\n", "/Users/zongming/miniconda3/envs/maxfuse_ipynb/lib/python3.8/site-packages/umap/umap_.py:660: NumbaDeprecationWarning: The 'nopython' keyword argument was not supplied to the 'numba.jit' decorator. The implicit default value for this argument is currently False, but it will be changed to True in Numba 0.59.0. See https://numba.readthedocs.io/en/stable/reference/deprecation.html#deprecation-of-object-mode-fall-back-behaviour-when-using-jit for details.\n", " @numba.jit()\n" ] } ], "source": [ "import numpy as np\n", "import pandas as pd\n", "from scipy.io import mmread\n", "\n", "import matplotlib.pyplot as plt\n", "plt.rcParams[\"figure.figsize\"] = (6, 4)\n", "\n", "from sklearn.metrics import confusion_matrix, ConfusionMatrixDisplay\n", "\n", "import anndata as ad\n", "import scanpy as sc\n", "import maxfuse as mf" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "## Data acquire" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "Since the example data we are uisng in the tutorial excedes the size limit for github repository files, we have uploaded them onto a server and can be easily donwloaded with the code below. Also this code only need to run **once** for both of the tutorial examples." ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [], "source": [ "import requests, zipfile, io\n", "r = requests.get(\"http://stat.wharton.upenn.edu/~zongming/maxfuse/data.zip\")\n", "z = zipfile.ZipFile(io.BytesIO(r.content))\n", "z.extractall(\"../\")" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "## Data preprocessing" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "We begin by reading in protein measurements and RNA measurements.\n", "\n", "Note that the two modalities in this example have *matching rows* since CITE-Seq measures proteins and RNAs simultaneously.\n", "But we will ignore this fact and treat the two modalities as if they are measured separately.\n", "\n", "The file format for MaxFuse to read in is ```adata```. In this tutorial we read in the original RNA counts or Protein counts where each row is a cell and each column is a feature, then turn them into ```adata``` objects." ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [], "source": [ "# read in protein data\n", "protein = pd.read_csv(\"../data/citeseq_pbmc/pro.csv\") # 10k cells (protein)\n", "# convert to AnnData\n", "protein_adata = ad.AnnData(\n", " protein.to_numpy(), dtype=np.float32\n", ")\n", "protein_adata.var_names = protein.columns" ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [], "source": [ "# read in RNA data\n", "rna = mmread(\"../data/citeseq_pbmc/rna.txt\") # rna count as sparse matrix, 10k cells (RNA)\n", "rna_names = pd.read_csv('../data/citeseq_pbmc/citeseq_rna_names.csv')['names'].to_numpy()\n", "# convert to AnnData\n", "rna_adata = ad.AnnData(\n", " rna.tocsr(), dtype=np.float32\n", ")\n", "rna_adata.var_names = rna_names" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "**Optional**: meta data for the cells. In this case we are using them to **evaluate the integration results**, but for actual running, MaxFuse does not require you have this information." ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [], "source": [ "# read in celltyle labels\n", "metadata = pd.read_csv('../data/citeseq_pbmc/meta.csv')\n", "labels_l1 = metadata['celltype.l1'].to_numpy()\n", "labels_l2 = metadata['celltype.l2'].to_numpy()\n", "\n", "protein_adata.obs['celltype.l1'] = labels_l1\n", "protein_adata.obs['celltype.l2'] = labels_l2\n", "rna_adata.obs['celltype.l1'] = labels_l1\n", "rna_adata.obs['celltype.l2'] = labels_l2" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "Here we are integrating protein and RNA data, and most of the time there are name differences between protein (antibody) and their corresponding gene names. \n", "\n", "These \"weak linked\" features will be used during initialization (we construct two arrays, `rna_shared` and `protein_shared`, whose columns are matched, and the two arrays can be used to obtain the initial matching). \n", "\n", "To construct the feature correspondence in straight forward way, we prepared a ```.csv``` file containing most of the antibody name (seen in cite-seq or codex etc) and their corresponding gene names:\n" ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
| \n", " | Protein name | \n", "RNA name | \n", "
|---|---|---|
| 0 | \n", "CD80 | \n", "CD80 | \n", "
| 1 | \n", "CD86 | \n", "CD86 | \n", "
| 2 | \n", "CD274 | \n", "CD274 | \n", "
| 3 | \n", "CD273 | \n", "PDCD1LG2 | \n", "
| 4 | \n", "CD275 | \n", "ICOSLG | \n", "
| \n", " | mod1_indx | \n", "mod2_indx | \n", "score | \n", "
|---|---|---|---|
| 0 | \n", "6424 | \n", "0 | \n", "0.861942 | \n", "
| 1 | \n", "9096 | \n", "1 | \n", "0.803392 | \n", "
| 2 | \n", "9198 | \n", "5 | \n", "0.885155 | \n", "
| 3 | \n", "4086 | \n", "9 | \n", "0.827848 | \n", "
| 4 | \n", "3400 | \n", "11 | \n", "0.879686 | \n", "
| ... | \n", "... | \n", "... | \n", "... | \n", "
| 9995 | \n", "9227 | \n", "9994 | \n", "0.516751 | \n", "
| 9996 | \n", "7038 | \n", "9995 | \n", "0.524879 | \n", "
| 9997 | \n", "1947 | \n", "9996 | \n", "0.573741 | \n", "
| 9998 | \n", "1648 | \n", "9997 | \n", "0.549954 | \n", "
| 9999 | \n", "2451 | \n", "9998 | \n", "0.757793 | \n", "
10000 rows × 3 columns
\n", "