数据分析：基于STAR+FeatureCounts的RNA-seq分析全流程流程

请添加图片描述

流程主要包含两部分组成：

第一部分：二代测序数据的Raw data的fastq文件转换成Gene Count或者Features Counts表（行是Features，列是样本名）；
第二部分：对counts 表进行统计分析，并对其生物学意义或者临床意义进行解读。

Installating Software

分析流程涉及到众多的软件以及R包等，为了更方便配置该环境，建议使用anaconda软件安装。anaconda是包管理工具，可以将软件作为其包进行安装管理，并且可以设置多个环境，方便不同依赖环境的软件在同一台机器安装。安装anaconda方法见网上教程。

流程所需要软件：Use conda search software before conda install

conda install -c bioconda fastqc --yes
conda install -c bioconda trim-galore --yes
conda install -c sortmerna --yes
conda install -c bioconda star --yes
conda install -c bioconda subread --yes
conda install -c bioconda multiqc --yes
conda install -c bioconda r-base

流程所需要的R包：

# method1
if (!requireNamespace("BiocManager", quietly = TRUE))
    install.packages("BiocManager")

RNA_seq_packages <- function(){
metr_pkgs <- c("DESeq2", "ggplot2", "clusterProfiler", "biomaRt", "ReactomePA", "DOSE", 
               "KEGG.db", "KEGG.db", "org.Mm.eg.db", "org.Hs.eg.db", "pheatmap",
               "genefilter","RColorBrewer", "GO.db", "topGO","dplyr","gage","ggsci")
list_installed <- installed.packages()
new_pkgs <- subset(metr_pkgs, !(metr_pkgs %in% list_installed[, "Package"]))
if(length(new_pkgs)!=0){if (!requireNamespace("BiocManager", quietly = TRUE))
        install.packages("BiocManager")
        BiocManager::install(new_pkgs)
        print(c(new_pkgs, " packages added..."))
    }
if((length(new_pkgs)<1)){
        print("No new packages added...")
    }
}
RNA_seq_packages()

# method2
install.packages("pacman")
pacman::p_load(c("DESeq2", "ggplot2", "clusterProfiler", "biomaRt", "ReactomePA", "DOSE", 
               "KEGG.db", "KEGG.db", "org.Mm.eg.db", "org.Hs.eg.db", "pheatmap",
               "genefilter","RColorBrewer", "GO.db", "topGO","dplyr","gage","ggsci"))

The Folder Structure

── RNA_seq_dir/
  │   └── annotation/               <- Genome annotation file (.GTF/.GFF)
  │  
  │   └── genome/                   <- Host genome file (.FASTA)
  │  
  │   └── raw_data/                 <- Location of input  RNAseq data
  │  
  │   └── output/                   <- Data generated during processing steps
  │       ├── 01.fastqc/            <- Main alignment files for each sample
  │       ├── 02.trim/              <-  Log from running STAR alignment step
  │       ├── 03.sortrRNA/          <- STAR alignment counts output (for comparison with featureCounts)
  │           ├── aligned/          <-  Sequences that aligned to rRNA databases (rRNA contaminated)
  │           ├── filtered/         <-  Sequences with rRNA sequences removed  (rRNA-free)
  │           ├── logs/             <- logs from running SortMeRNA
  │       ├── 04.aligment/          <- Main alignment files for each sample
  │           ├── aligned_bam/      <-  Alignment files generated from STAR (.BAM)
  │           ├── aligned_logs/     <- Log from running STAR alignment step
  │       ├── 05.genecount/         <- Summarized gene counts across all samples
  │       ├── 06.multiQC/           <- Overall report of logs for each step
  │  
  │   └── sortmerna_db/             <- Folder to store the rRNA databases for SortMeRNA
  │       ├── index/                <- indexed versions of the rRNA sequences for faster alignment
  │       ├── rRNA_databases/       <- rRNA sequences from bacteria, archea and eukaryotes
  │  
  │   └── star_index/               <-  Folder to store the indexed genome files from STAR

Download Host Genome

下载基因组以及基因组注释信息，前者是fasta格式，后者是GTF或者GFF等格式，两者的版本要是同一版本。UCSC、ENSEMBL、NCBI和GENCODE提供了多个下载网址。注意每个网址对同一物种基因组的命名，它会反映出版本不同。下载压缩文件gz后，可以使用gunzip解压。

GENCODE:

# Download genome fasta file into the genome/ folder
wget ftp://ftp.sanger.ac.uk/pub/gencode/Gencode_mouse/release_M12/GRCm38.p5.genome.fa.gz

# Download annotation file into the annotation/ folder
wget ftp://ftp.sanger.ac.uk/pub/gencode/Gencode_mouse/release_M12/gencode.vM12.annotation.gtf.gz

ENSEMBL:

# download genome 
wget ftp://ftp.ensembl.org/pub/release-100/fasta/mus_musculus/dna/Mus_musculus.GRCm38.dna.primary.fa.gz
# download annotation file
wget ftp://ftp.ensembl.org/pub/release-100/gtf/mus_musculus/Mus_musculus.GRCm38.100.gtf.gz

The Procedures of Analysis pipeline

所需软件安装完成后，可以通过which命令查看是否已经export在环境中。分析流程主要包含11步，其中1-6步是fastq数据质控以及注释；7-12步是简单的统计分析；后续会扩展其他分析。

Step1: Quality Control by Fastqc

fastqc能够对二代测序数据的原始数据fastq进行质控检测，它能从多方面反映出fastq数据的质量情况（如碱基质量分布和GC含量等）

fastqc -o results/01.fastqc/ --noextract  raw_data/sampleid.fq.gz

Step2: Removing Low Quality Sequences with Trim_Galore

Trim_Galore是一款控制reads质量并且能够移除接头的软件。在第一步了解fastq质量后，设定过滤fastq碱基质量的阈值，对fastq数据进行质控。Trim_Galore包含了cutadapt和fastqc，前者目的是移除接头，或者是获取reads质量情况再根据阈值过滤低质量的reads。

trim_galore \
--quality 20 \
--fastqc \
--length 25 \
--output_dir results/02.trim/ \
input/sample.fastq

Step3: Removing rRNA Sequences with SortMeRNA

SortMeRNA是一款在宏基因组和宏转录组领域内对二代测序数据进行过滤、比对和OTU聚类的软件，其核心算法是根据种子序列快速比对敏感序列，该软件的目的是过滤宏转录组数据的核糖体DNA序列。在使用该软件前，需要下载核糖体DNA序列（fasta格式）并对DNA序列进行建立比对索引。(在测试过程，发现耗时很久，并且存在会对db报错，暂时没有解决)

# download the rRNA DNA sequences
wget https://github.com/biocore/sortmerna/tree/master/data/rRNA_databases/*fasta

# build index 
STAR \
	--runThreadN 6 \
	--runMode genomeGenerate \
	--genomeDir index/silva-euk-28s-id98 \
	--genomeFastaFiles ./silva-euk-28s-id98.fasta
# remove rRNA sequence 
rm -rf /home/sortmerna/run/  # debug for exist db
sortmerna \
	--ref /home/database/sortmerna_db/silva-bac-16s-id90.fasta \
	--reads ./result/02.trim/TAC_NC04_RNA_b1.r1_val_1.fq.gz \
	--reads ./result/02.trim/TAC_NC04_RNA_b1.r2_val_2.fq.gz \
	--aligned ./result/03.sortrRNA/TAC_NC04_aligned \
	--other ./result/03.sortrRNA/TAC_NC04_filtered \
	--fastx

Step4: Aligning to Genome with STAR-aligner

STAR(Spliced Transcripts Alignment to a Reference)是基于Sequential maximum mappable seed search方法将RNA_seq数据比对到参考基因组上的软件。

# create genome index for alignment
mkdir genome_index_star
STAR \
	--runThreadN 30 \
	--runMode genomeGenerate \
	--genomeDir genome_index_star \
	--genomeFastaFiles Mus_musculus.GRCm38.dna_sm.primary_assembly.fa \
	--sjdbGTFfile ./genome_annotation/Mus_musculus.GRCm38.100.gtf \
	--sjdbOverhang 99

# run aligning 
STAR \
	--genomeDir /home/database/Mus_musculus.GRCm38_release100/genome_index_star \
	--readFilesIn ./result/03.sortrRNA/HF07_filtered.fq \
	--runThreadN 10 \
	--outSAMtype BAM SortedByCoordinate \
	--outFileNamePrefix ./result/04.aligment_v2/HF07 \
	--quantMode GeneCounts #\
	#--readFilesCommand zcat        # for gz files

Step5: Summarizing Gene Counts with featureCounts (subread)

subread是一个包含多个高效处理二代测序数据的软件的包，其中featureCounts能够从比对结果SAM/BAM和注释信息文件（annotation files GTF）获取genomic features（genes, exons,promoter,gene bodies, genomic bins和chromosomal locations）等。

# bam/sam files
dirlist=$(ls -t ./result/04.aligment/*.bam | tr '\n' ' ')
# obtain the features
featureCounts \
-a /home/database/Mus_musculus.GRCm38_release100/genome_annotation/Mus_musculus.GRCm38.100.gtf -o /home/project/Heart_failure/Assay/00.RNA_seq/result/05.genecount/final_count_v2.tsv \
-t exon \
-g gene_name \
--primary \
-T 10 \
$dirlist

Step6: Generating analysis report with multiqc

MultiQC可以综合多个软件的日志文件最后形成一个统一的报告文件，方便后续审视结果。

multiqc ./result/ --outdir result/06.multiQC

Step7: Importing Gene Counts into R/Rstudio

在将数据导入R前，需要了解不同数据库对基因ID的编码方式。大部分基因都有自己的5种类型ID，特定的基因如miRNA在miRBase中有自己的ID (NCBI的entrez ID及gene symbol，Ensembl的gene ID，UCSC的gene ID，KEGG的gene ID)。ID之间的关系参考bioDBnet网址信息。

请添加图片描述

Entrez id: Entrez是NCBI是美国国立生物技术信息中心(National Center for BiotechnologyInformation)的检索系统。NCBI的Gene数据库包含了不同物种的基因信息，其中每一个基因都被编制一个唯一的识别号ID（因此不同生物或者同属不同种的生物间的同源基因编号也不相同）, 这个ID就叫做Entrez ID，就是基因身份证啦。目前通用的是Entrez id，也就是gene id。

Entrez的第一个版本由NCBI于1991年在CD -ROM上发布，当时核酸序列来自GenBank和Protein Data Bank（PDB）数据库，蛋白序列来自GenBank、Protein Information Resource (PIR)、SWISS-PROT、PDB以及Protein Research Foundation (PRF)数据库，还从MEDLINE数据库（现在是PubMed）整合了文献摘要。

Gene：基因序列注释+检索，目前共有61118个人类的记录，68389个小鼠的记录（含有功能基因、假基因、预测基因等）
Gene symbol: HUGO Gene Symbol(也叫做HGNC Symbol，即基因符号)是HGNC组织对基因进行命名描述的一个缩写标识符。
Ensembl id: 由英国的Sanger研究所以及欧洲生物信息学研究所（EMBI-EBI）联合共同协作开发的一套基因信息标记系统。 PS: 不同物种的基因ID是不同的。
HGNC id: HGNC（人类基因命名委员会）由美国国家人类基因组研究所（NHGRI）和 Wellcome Trust（英国）共同资助，其中的每个基因只有一个批准的基因symbol。HGNC ID是HGNC数据库分配的基因编号，每一个标准的Symbol都有对应的HGNC ID 。我们可以用这个编号，在HGNC数据库中搜索相关的基因。例如：HGNC:11998。
Gene Name: Gene Name是经过HGNC批准的全基因名称;对应于上面批准的符号（Gene Symbol）。例如TP53对应的Gene Name就是：tumor protein p53。
UCSC id: UCSC的ID以uc开头，BRCA1对应的就是uc002ict.4。

基因ID转换：

常常在Entrez gene ID与Ensembl gene ID以及gene ID与gene symbol之间转换，常用的工具就是
clusterProfiler：使用方法：`bitr(geneID, fromType, toType, OrgDb, drop = TRUE)``

``geneID是输入的基因ID，fromType是输入的ID类型，toType是输出的ID类型，OrgDb注释的db文件，drop`表示是否剔除NA数据
biomaRt：
mouse_mart <- useMart(host="www.ensembl.org",biomart="ENSEMBL_MART_ENSEMBL", 
                        dataset="mmusculus_gene_ensembl")
mouse_ensemble_gene_id <- gene_count_format$Geneid
mouse_gene_all <- getBM(attributes=c('ensembl_gene_id', 'entrezgene_id', "hgnc_symbol",
                                       'external_gene_name', 'mgi_symbol', 
                                       'transcript_biotype', 'description'),
                              filters='ensembl_gene_id',
                              values=mouse_ensemble_gene_id, mart=mouse_mart)

读入数据

### load packages
library(dplyr)
library(tibble)
library(data.table)
library(DESeq2)
library(stringr)
### load data 
prof <- fread("final_count_v2.tsv")
phen <- read.csv("phenotype_20200629.csv")
### curation data 
gene_count_format <- prof %>% dplyr::select(c("Geneid", ends_with("bam"))) %>% 
      dplyr::rename_at(vars(ends_with("bam")), funs(str_replace(., "./result/04.aligment/", ""))) %>%
      dplyr::rename_at(vars(ends_with("bam")), funs(str_replace(., "Aligned.sortedByCoord.out.bam", "")))

处理数据：Differential Expression Gene by DESeq2 packages

Deseq2fun <- function(metaData, countData, 
                      group_col=c("TAC", "TAC_NC"),
                      occurence=0.5, ncount=10){
  # metaData <- phen
  # countData <- gene_count_format
  # group_col <- c("TAC", "TAC_NC")
  # occurence <- 0.5
  # ncount <- 10
  
  # get overlap 
  sid <- intersect(metaData$SampleID, colnames(countData))
  meta <- metaData %>% filter(SampleID%in%sid) %>% 
        filter(Group%in%group_col) %>%
        mutate(Group=factor(Group, levels = group_col)) %>%
        column_to_rownames("SampleID") 
  
  # quality control
  count <- countData %>% column_to_rownames("Geneid") %>% 
    dplyr::select(rownames(meta)) %>% 
    rownames_to_column("Type") %>% 
    filter(apply(dplyr::select(., -one_of("Type")), 1, function(x){sum(x != 0)/length(x)}) > occurence) %>%
            data.frame(.) %>% 
    column_to_rownames("Type")
  count <- count[rowSums(count) > ncount, ]
  
  # judge no row of profile filter
	if (nrow(count) == 0) {
		stop("No row of profile to be choosed\n")
	}
  
  # determine the right order between profile and phenotype 
  for(i in 1:ncol(count)){ 
    if (!(colnames(count) == rownames(meta))[i]) {
      stop(paste0(i, " Wrong"))
    }
  }
  
  dds <- DESeqDataSetFromMatrix(countData=count, 
                              colData=meta,
                              design=~Group)
  
  dds <- DESeq(dds)
  res <- results(dds, pAdjustMethod = "fdr", alpha = 0.05) %>% na.omit()
  res <- res[order(res$padj), ]
  return(list(dds=dds,results=res))
}

TAC_dge <- Deseq2fun(phen, gene_count_format)
TAC_dge_result <- TAC_dge$results 
TAC_dge_dds <- TAC_dge$dds
summary(TAC_dge_result)

Step8: Annotate Gene Symbols

除了上述两种ID转换方法，还存在其他ID转化方法。该方法结合DESeq2结果文件获取其他ID。使用org.Mm.eg.db包的mapIDs函数。

library(org.Mm.eg.db) 

# Add gene full name
TAC_dge_result$description <- mapIds(x = org.Mm.eg.db,
                              keys = row.names(TAC_dge_result),
                              column = "GENENAME",
                              keytype = "SYMBOL",
                              multiVals = "first")

# Add gene symbol
TAC_dge_result$symbol <- row.names(TAC_dge_result)

# Add ENTREZ ID
TAC_dge_result$entrez <- mapIds(x = org.Mm.eg.db,
                         keys = row.names(TAC_dge_result),
                         column = "ENTREZID",
                         keytype = "SYMBOL",
                         multiVals = "first")

# Add ENSEMBL
TAC_dge_result$ensembl <- mapIds(x = org.Mm.eg.db,
                          keys = row.names(TAC_dge_result),
                          column = "ENSEMBL",
                          keytype = "SYMBOL",
                          multiVals = "first")

# Subset for only significant genes (q < 0.05)
TAC_dge_sig <- subset(TAC_dge_result, padj < 0.05)

### output
# Write normalized gene counts to a .tsv file
write.table(x = as.data.frame(counts(TAC_dge_dds), normalized = T), 
            file = '../../Result/Profile/normalized_counts.tsv', 
            sep = '\t', 
            quote = F,
            col.names = NA)

# Write significant normalized gene counts to a .tsv file
write.table(x = counts(TAC_dge_dds[row.names(TAC_dge_sig)], normalized = T),
            file = '../../Result/Profile/normalized_counts_significant.tsv',
            sep = '\t',
            quote = F,
            col.names = NA)

# Write the annotated results table to a .tsv file
write.table(x = as.data.frame(TAC_dge_result), 
            file = "../../Result/Profile/results_gene_annotated.tsv", 
            sep = '\t', 
            quote = F,
            col.names = NA)

# Write significant annotated results table to a .tsv file
write.table(x = as.data.frame(TAC_dge_sig), 
            file = "../../Result/Profile/results_gene_annotated_significant.tsv", 
            sep = '\t', 
            quote = F,
            col.names = NA)

Step9: Plotting Gene Expression Data

从整体上看不同组的样本基因表达是否呈现自我成簇情况，这需要用到降维方法，通常适合转录组数据的降维方法有PCA和Rtsne等，这里使用PCA方法。后续会扩展其他方法。

library(ggplot2)
# Convert all samples to rlog
ddsMat_rlog <- rlog(TAC_dge_dds, blind = FALSE)

# Plot PCA by column variable
plotPCA(ddsMat_rlog, intgroup = "Group", ntop = 500) +
  geom_point(label=colnames(TAC_dge_dds), size = 5) + # Increase point size
  geom_hline(yintercept = 0, linetype = 2) + 
  geom_vline(xintercept = 0, linetype = 2) + 
  scale_y_continuous(limits = c(-20, 20)) + # change limits to fix figure dimensions
  ggtitle(label = "Principal Component Analysis (PCA)", 
          subtitle = "Top 500 most variable genes") +
  theme_bw() # remove default ggplot2 theme

基因表达情况还可以使用热图展示。

# Gather 30 significant genes and make matrix
mat <- assay(ddsMat_rlog[row.names(TAC_dge_sig)])[1:40, ]

# Choose which column variables you want to annotate the columns by.
annotation_col = data.frame(
  Group = factor(colData(ddsMat_rlog)$Group), 
  row.names = rownames(colData(ddsMat_rlog))
)

# Specify colors you want to annotate the columns by.
ann_colors = list(
  Group = c(TAC_NC = "lightblue", TAC = "darkorange")
)

library(pheatmap)
library(RColorBrewer)
# Make Heatmap with pheatmap function.
pheatmap(mat = mat, 
         color = colorRampPalette(brewer.pal(9, "YlOrBr"))(255), 
         scale = "row", # Scale genes to Z-score (how many standard deviations)
         annotation_col = annotation_col, # Add multiple annotations to the samples
         annotation_colors = ann_colors,# Change the default colors of the annotations
         fontsize = 10, # Make fonts smaller
         cellwidth = 10, # Make the cells wider
         show_rownames = T,
         show_colnames = T)

火山图能反应组间基因表达情况，通常分成3组：up-regulated, down-regulated 和 nonsignificant

# Gather Log-fold change and FDR-corrected pvalues from DESeq2 results
## - Change pvalues to -log10 (1.3 = 0.05)
data <- data.frame(gene = row.names(TAC_dge_result),
                   pval = -log10(TAC_dge_result$padj), 
                   lfc = TAC_dge_result$log2FoldChange) 

# Remove any rows that have NA as an entry
data <- na.omit(data)

# Color the points which are up or down
## If fold-change > 0 and pvalue > 1.3 (Increased significant)
## If fold-change < 0 and pvalue > 1.3 (Decreased significant)
data <- mutate(data, color = case_when(data$lfc > 0 & data$pval > 1.3 ~ "Increased",
                                       data$lfc < 0 & data$pval > 1.3 ~ "Decreased",
                                       data$pval < 1.3 ~ "nonsignificant"))

# Make a basic ggplot2 object with x-y values
ggplot(data, aes(x = lfc, y = pval, color = color)) + 
  ggtitle(label = "Volcano Plot", subtitle = "Colored by fold-change direction") +
  geom_point(size = 2.5, alpha = 0.8, na.rm = T) +
  scale_color_manual(name = "Directionality",
                     values = c(Increased = "#008B00", Decreased = "#CD4F39", nonsignificant = "darkgray")) +
  theme_bw(base_size = 14) + # change overall theme
  theme(legend.position = "right") + # change the legend
  xlab(expression(log[2]("TAC_NC" / "TAC"))) + # Change X-Axis label
  ylab(expression(-log[10]("adjusted p-value"))) + # Change Y-Axis label
  geom_hline(yintercept = 1.3, colour = "darkgrey") + # Add p-adj value cutoff line
  scale_y_continuous(trans = "log1p") # Scale yaxis due to large p-values

Step10: Finding Pathways from Differential Expressed Genes

通路富集分析（pathway enrichment analysis）是根据单个基因变化产生总体结论的好方法。有时个体基因变化或大或小，均难以解释。但是通过分析基因涉及的代谢途径，我们可以在更高层次上解释处理因素对基因的影响。常用富集分析的R包clusterProfiler。

# Remove any genes that do not have any entrez identifiers
results_sig_entrez <- subset(TAC_dge_sig, is.na(entrez) == FALSE)

# Create a matrix of gene log2 fold changes
gene_matrix <- results_sig_entrez$log2FoldChange

# Add the entrezID's as names for each logFC entry
names(gene_matrix) <- results_sig_entrez$entrez

# KEGG
library(clusterProfiler)
kegg_enrich <- enrichKEGG(gene = names(gene_matrix),
                          organism = 'mouse',
                          pvalueCutoff = 0.05, 
                          qvalueCutoff = 0.10)

# Plot results
barplot(kegg_enrich, 
        drop = TRUE, 
        showCategory = 10, 
        title = "KEGG Enrichment Pathways",
        font.size = 8)

# GO 
go_enrich <- enrichGO(gene = names(gene_matrix),
                      OrgDb = 'org.Mm.eg.db', 
                      readable = T,
                      ont = "BP",
                      pvalueCutoff = 0.05, 
                      qvalueCutoff = 0.10)

# Plot results
barplot(go_enrich, 
        drop = TRUE, 
        showCategory = 10, 
        title = "GO Biological Pathways",
        font.size = 8)

Step11: Plotting KEGG Pathways

Pathview是一个软件包，可以使用KEGG标识符和对发现明显不同的基因进行覆盖倍数变化。 Pathview还可以与在KEGG数据库中找到的其他生物一起工作，并且可以绘制出特定生物的任何KEGG途径。

library(pathview)
pathview(gene.data = gene_matrix, 
         pathway.id = "04070", 
         species = "mouse")

Step12: Single Sample Gene-Set Enrichment Analysis

Single-sample GSEA (ssGSEA), an extension of Gene Set Enrichment Analysis (GSEA), calculates separate enrichment scores for each pairing of a sample and gene set. Each ssGSEA enrichment score represents the degree to which the genes in a particular gene set are coordinately up- or down-regulated within a sample.

ssGSEA富集分数表示通路的基因在样本中高低表达的程度，可以替代表达值。

# load packages
library(dplyr)
library(tibble)
library(data.table)
library(GSEABase)
library(GSVA)

# load data 
prof <- fread("../../Result/Profile/final_gene_hgnc.profile")
phen <- read.csv("../../Result/Phenotype/Heart_failure_phenotype_20200629.csv")
gene_set <- qusage::read.gmt("../../Result/GeneSetdb/msigdb.v7.1.symbols_v2.gmt") # msigdb: geneset

# Build ExpressionSet object 
get_expr_Set <- function(assay, meta){
  
  require(convert)
  colnames(assay)[1] <- "name"
  assay <- assay[rowSums(assay[, -1]) > 0, ]
  dat_assay <- setDT(assay)[, lapply(.SD, mean), keyby = name] %>% 
    column_to_rownames("name") 
  #if(length(subtype)>0){
  #  dat_meta <- meta %>% filter(Group%in%subtype)
  #}else{
    dat_meta <- meta
  #}
  sid <- intersect(dat_meta$SampleID, colnames(dat_assay))
  dat_meta.cln <- dat_meta %>% filter(SampleID%in%sid) %>%
    column_to_rownames("SampleID")
  dat_assay.cln <- dat_assay %>% rownames_to_column("tmp") %>%
    dplyr::select(c(tmp, rownames(dat_meta.cln))) %>%
    column_to_rownames("tmp")
  #dat_meta.cln$SampleID == colnames(dat_assay.cln)
  exprs <- as.matrix(dat_assay.cln)
  adf <-  new("AnnotatedDataFrame", data=dat_meta.cln)
  experimentData <- new("MIAME",
        name="Jin Chao", lab="Gong gdl Lab",
        contact="dong_ming@grmh-gdl.cn",
        title="Heart-failure Experiment",
        abstract="The gene ExpressionSet",
        url="www.grmh-gdl.cn",
        other=list(notes="Created from text files"))
  expressionSet <- new("ExpressionSet", exprs=exprs,
                       phenoData=adf, 
                       experimentData=experimentData)
  
  return(expressionSet)
}
gene_expr_set <- get_expr_Set(prof, phen)

# choose gene_set 
gene_set_KEGG <- gene_set[grep("^KEGG", names(gene_set))]

# ssgsea by GSVA packages
heartfailure_ssgsea <- gsva(gene_expr_set, 
                               gene_set_KEGG,
                               method="ssgsea", 
                               min.sz=5, 
                               max.sz=500,
                               kcdf="Poisson")

exprs(heartfailure_ssgsea) %>% t() %>% data.frame() %>% rownames_to_column("tmp") %>% arrange(tmp) %>% column_to_rownames("tmp") %>% t() %>% data.frame() -> dat2

# heatmap 
library(pheatmap)
library(RColorBrewer)
pheatmap(mat = dat2[c(1:20), ], 
         color = colorRampPalette(brewer.pal(9, "YlOrBr"))(255), 
         scale = "row", # Scale genes to Z-score (how many standard deviations)
         # annotation_col = annotation_col, # Add multiple annotations to the samples
         # annotation_colors = ann_colors,# Change the default colors of the annotations
         cluster_row = FALSE, 
         #cluster_cols = FALSE,
         fontsize = 10, # Make fonts smaller
         cellwidth = 15, # Make the cells wider
         show_colnames = T)