单细胞转录组 —— kb-python 原始数据处理

前言

kallisto|bustools 是一种用于预处理 scRNA-seq 数据的工作流程。

数据预处理步骤包括：

将 reads 与其来源细胞关联起来；
根据唯一分子标识符（UMI）对 reads 进行去重；
从 reads 中生成基因或特征计数，以生成 细胞x基因 矩阵。

使用 kallisto|bustools 有以下几点优势

生成与 细胞x基因 或 细胞x转录本 等价的计数矩阵
执行 RNA 速率分析和 snRNA-seq 分析
对 10x、inDrops 和 Dropseq 等多种技术的数据进行定量。
为新技术和方案定制工作流程。
处理特征条码数据，如 CITE-seq、REAP-seq、MULTI-seq、Clicktags 和 Perturb-seq。
生成 QC 报告
速度飞快

使用

kb-python 是一个用于处理 scRNA 序列的 python 软件包，它封装了 kallisto|bustools 单细胞 RNA 分析流程。

可以使用 pip 安装

pip install kb-python

也可以使用 conda 安装

conda install -c bioconda kb-python

或者安装开发版

pip install git+https://github.com/pachterlab/kb_python

命令行输入 kb，输出类似如下信息

usage: kb [-h] [--list] <CMD> ...

kb_python 0.28.2

positional arguments:
  <CMD>
    info      Display package and citation information
    compile   Compile `kallisto` and `bustools` binaries from source
    ref       Build a kallisto index and transcript-to-gene mapping
    count     Generate count matrices from a set of single-cell FASTQ files

optional arguments:
  -h, --help  Show this help message and exit
  --list      Display list of supported single-cell technologies

主要包含 4 个子命令，我们主要使用后两个。

构建索引

可通过 ref 子命令，使用 kallisto 建立转录组索引。只需传入参考基因组和基因组注释文件即可。

重要参数如下

流程主要包含 4 种类型，默认是 standard，RNA 速率分析用 nac，kite 主要用于 Feature Barcode 测序，如 10x Feature Barcode 技术可以同时分析基因表达和免疫受体序列（BCR 和 TCR）。

也可以使用 custom 模式，定制自己的参考基因组索引。

定量

重要参数如下

kb-python 支持的单细胞技术可以使用下面的命令查看

kb --list

其中，后三列的数字代表索引位置信息：文件索引（0 表示 R1）, 起始位置, 结束位置。如果为 None，说明整个文件都是。

例如，10XV2 表示 10x 的 V2 版，barcode 和 UMI 都在 R1 中，前 16 位为 barcode 序列，后 10 位为 UMI 序列，R2 都是 cDNA 序列

实战

10x scRNA-seq

我们以昨天的数据为例，使用 kb-python 来进行原始数据的处理

创建参考基因组的索引，使用 standard 模式

kb ref -i mm10.standard.idx -g t2g.txt \
    -f1 cdna.fa -f2 intron.fa \
    --workflow standard \
    mm10.fa \
    mm10.refGene.gtf

定量分析，多个 FASTQ 文件要按照文件排序，同一个 lane 的文件要先 R1 再 R2

kb count -i mm10.standard.idx \
    -g t2g.txt -x 10xv2 \
    -o output -t 2 --workflow standard \
    --h5ad --cellranger --filter bustools \
    GSM4812353_S1_L001_R1_001.fastq.gz \
    GSM4812353_S1_L001_R2_001.fastq.gz \
    ... \
    GSM4812353_S1_L002_R1_001.fastq.gz \
    GSM4812353_S1_L002_R2_001.fastq.gz

输出结果的结构大致如下

├── 10x_version2_whitelist.txt
├── counts_filtered
│   ├── adata.h5ad
│   ├── cells_x_genes.barcodes.txt
│   ├── cells_x_genes.genes.names.txt
│   ├── cells_x_genes.genes.txt
│   └── cells_x_genes.mtx
├── counts_unfiltered
│   ├── adata.h5ad
│   ├── cellranger
│   │   ├── barcodes.tsv
│   │   ├── genes.tsv
│   │   └── matrix.mtx
│   ├── cells_x_genes.barcodes.txt
│   ├── cells_x_genes.genes.names.txt
│   ├── cells_x_genes.genes.txt
│   └── cells_x_genes.mtx
├── filter_barcodes.txt
├── inspect.json
├── kb_info.json
├── matrix.ec
├── output.bus
├── output.filtered.bus
├── output.unfiltered.bus
├── run_info.json
└── transcripts.txt

可以使用 counts_filtered/adata.h5ad 作为下游分析的起点。

因为我们设置了 --cellranger 参数，所以输出目录中也有 cellranger 结构的输出结果。

不推荐使用 --report 输出报告，使用时报了很多错误。

10x snRNA-seq

我们以 10x 的测序数据 1k Brain Nuclei from an E18 Mouse 为例。

我们将生成两个矩阵：一个是剪接转录本矩阵，另一个是未剪接转录本矩阵，然后将它们相加得出核转录本总数。

首先，下载数据

wget https://caltech.box.com/shared/static/j337aflq9ublmwaripkepob41mr23216.txt -O checksums.txt
wget https://caltech.box.com/shared/static/2j8shgwmalzcjawuow51678a8yssvdef.gz -O nuclei_900_S1_L001_R1_001.fastq.gz
wget https://caltech.box.com/shared/static/k2yydqlz2jtckw1shk5h536mxn47nm9n.gz -O nuclei_900_S1_L001_R2_001.fastq.gz
wget https://caltech.box.com/shared/static/tlqdm0w3tvy8ogyktsz7ahggwurc6kkj.gz -O nuclei_900_S1_L002_R1_001.fastq.gz
wget https://caltech.box.com/shared/static/gqrvkqllr9d7zq4e3yfrng9kgfbejowe.gz -O nuclei_900_S1_L002_R2_001.fastq.gz

校验下载文件是否完整

md5sum -c checksums.txt --ignore-missing
# nuclei_900_S1_L001_R1_001.fastq.gz: 成功
# nuclei_900_S1_L001_R2_001.fastq.gz: 成功
# nuclei_900_S1_L002_R1_001.fastq.gz: 成功
# nuclei_900_S1_L002_R2_001.fastq.gz: 成功

下载参考基因组信息，如果已经下载过，可以直接使用本地文件

wget ftp://ftp.ensembl.org/pub/release-98/fasta/mus_musculus/dna/Mus_musculus.GRCm38.dna.primary_assembly.fa.gz
wget ftp://ftp.ensembl.org/pub/release-98/gtf/mus_musculus/Mus_musculus.GRCm38.98.gtf.gz

构建小鼠的 DNA 和内含子索引，使用 nac 模式，后续可以进行 RNA 速率分析

kb ref -i index.idx -g t2g.txt \
    -f1 cdna.fa -f2 intron.fa \
    -c1 cdna_t2c.txt -c2 intron_t2c.txt \
    --workflow nac \
    Mus_musculus.GRCm38.dna.primary_assembly.fa.gz \
    Mus_musculus.GRCm38.98.gtf.gz

定量分析

kb count -i index.idx \
    -g t2g.txt -c1 cdna_t2c.txt \
    -c2 intron_t2c.txt -x 10xv2 \
    -o output -t 2 --workflow nac --h5ad \
    nuclei_900_S1_L001_R1_001.fastq.gz \
    nuclei_900_S1_L001_R2_001.fastq.gz \
    nuclei_900_S1_L002_R1_001.fastq.gz \
    nuclei_900_S1_L002_R2_001.fastq.gz

10x Feature Barcode

我们使用 Kallisto Indexing and Tag Extraction (KITE) 流程对 10x Genomics pbmc_1k_protein_v3 特征条形码数据集进行预处理和分析。

在特征条形码芯片是建立在 scRNA-seq 的基础上，可以同时获得大量单个细胞中基因的表达量和细胞表面蛋白的表达情况，并将细胞数据记录为短 DNA 序列

KITE 处理流程会生成 错配图（Mismatch Map），其中包含实验中使用的所有特征条形码序列及其所有单碱基错配。

错配图用于生成转录本到基因的映射文件（.t2g）和 fasta文件，作为 kallisto 的输入。

使用 kallisto index 建立索引后，kallisto|bustools 就能有效地搜索测序数据，查找错配图中的序列。

下载数据

wget -q https://caltech.box.com/shared/static/asmj4nu90ydhsrk3pm7aaxu00cnnfige.txt -O checksums.txt
wget -q https://caltech.box.com/shared/static/mp2vr3p6dztdyatuag8ir3cektmrztg8.gz -O pbmc_1k_protein_v3_antibody_S2_L001_R1_001.fastq.gz
wget -q https://caltech.box.com/shared/static/f3payi1za7mn0jfai7vm10sy3yqwgpqh.gz -O pbmc_1k_protein_v3_antibody_S2_L001_R2_001.fastq.gz
wget -q https://caltech.box.com/shared/static/e112bbczh9o1rl6gfin36bqp0ga7uvdy.gz -O pbmc_1k_protein_v3_antibody_S2_L002_R1_001.fastq.gz
wget -q https://caltech.box.com/shared/static/3ve2axc8dr8v5nnrhmynrdgpqj6xg42k.gz -O pbmc_1k_protein_v3_antibody_S2_L002_R2_001.fastq.gz

检查数据的完整性

md5sum -c checksums.txt --ignore-missing

创建错配索引

kb 能够生成一个 FASTA 文件，其中包含所有汉明距离小于 2 的特征条形码变体，并创建这些序列的 kallisto 索引。

要做到这一点，我们首先需要准备一个 TSV 文件，其中第一列包含特征条形码序列，第二列包含特征条形码名称。

首先，我们下载 10x Genomics 提供的 feature reference文件。

wget -q http://cf.10xgenomics.com/samples/cell-exp/3.0.0/pbmc_1k_protein_v3/pbmc_1k_protein_v3_feature_ref.csv

使用 pandas 处理成 kb-python 接受的输入形式

import pandas as pd

df = pd.read_csv('pbmc_1k_protein_v3_feature_ref.csv')
df[['sequence', 'id']].to_csv('features.tsv', 
        index=None, header=None, sep='\t')

创建索引

kb ref -i mismatch.idx -f1 mismatch.fa -g t2g.txt --workflow kite features.tsv

定量

kb count --h5ad -i mismatch.idx \
    -g t2g.txt -x 10xv3 --workflow kite -t 8 \
    pbmc_1k_protein_v3_antibody_S2_L001_R1_001.fastq.gz \
    pbmc_1k_protein_v3_antibody_S2_L001_R2_001.fastq.gz \
    pbmc_1k_protein_v3_antibody_S2_L002_R1_001.fastq.gz \
    pbmc_1k_protein_v3_antibody_S2_L002_R2_001.fastq.gz

Drop-seq

数据集中 GSE178612 研究的是 FoxM1 与 Rb 基因在小鼠乳腺癌中的相互作用。

我们选择其中一个样本 GSM5394388，下载其原始数据

prefetch SRR14872449

解压

fastq-dump SRR14872449/SRR14872449.sra --split-files --gzip -O SRR14872449

表达定量，使用之前构建好的小鼠的索引

kb count --h5ad -i mm10.standard.idx \
    -g t2g.txt -x DROPSEQ --workflow standard -t 8 \
    --h5ad --cellranger --filter bustools \
    SRR14872449/SRR14872449_1.fastq.gz \
    SRR14872449/SRR14872449_2.fastq.gz \

Indrop

GSE111672 数据集中包含 6 例原发性胰腺癌组织的单细胞 RNA 测序和空间转录组学，

我们随便选择一个样本 GSM3036909 下载原始数据

prefetch SRR6825055

解压

fastq-dump SRR6825055/SRR6825055.sra --split-files --gzip -O SRR6825055

其中 R1 长度为 35，R2 长度为 51，是 Indrop-seq 的 V2 版

不同版本之间的差别：

v1：原始版，其中 R2 为 cDNA 序列，R1 为元数据（UMI 和 barcode）。
v2：v1 的反转，R1 和 R2 的内容互换
v3：2016 年夏季重新设计，需要手动解复用。R1 是 cDNA 序列，R2 包含凝胶条形码的前半部分，R3 包含文库索引，R4 包含凝胶条形码的后半部分、UMI 和部分 polyA 尾部。

下载构建好的人类参考基因组索引

mkdir human
kb ref -d human \
    -i human/kb_ref.idx \
    -g human/t2g.txt

表达定量

kb count --h5ad -i human/kb_ref.idx \
    -g human/t2g.txt -x INDROPSV2 \
    --workflow standard -t 8 \
    --h5ad --cellranger --filter bustools \
    SRR6825055/SRR6825055_1.fastq.gz \
    SRR6825055/SRR6825055_2.fastq.gz \

SMART-seq2

我们从研究人类皮质球体内星形胶质细胞的成熟数据 GSE99951 中，下载 GSM2665701 样本进行分析

prefetch SRR5676730

解压

fastq-dump SRR5676730/SRR5676730.sra --split-files --gzip -O SRR5676730

表达定量，不支持 --filter 参数，同时必须加上 --parity 参数

kb count --h5ad -i human/kb_ref.idx \
    -g human/t2g.txt -x SMARTSEQ2 -t 8 \
    --parity paired --workflow standard \
    --h5ad --cellranger \
    SRR5676730/SRR5676730_1.fastq.gz \
    SRR5676730/SRR5676730_2.fastq.gz \

其他测序数据可以自行探索一下，前面几种用法基本涵盖了。