Question

Tutorial:Bulk RNA-seq: Different Expression Analysis with DEseq2

1

Entering edit mode

3 months ago

Julia Ma ▴ 120

Content taken verbatim from: https://omicverse.readthedocs.io/en/latest/Tutorials-bulk/t_deseq2/

An important task of bulk rna-seq analysis is the different expression , which we can perform with omicverse. For different expression analysis, ov change the gene_id to gene_name of matrix first.

Now we can use PyDEseq2 to perform DESeq2 analysis like R

Paper: PyDESeq2: a python package for bulk RNA-seq differential expression analysis

Code: https://github.com/owkin/PyDESeq2

Colab_Reproducibility: https://colab.research.google.com/drive/1fZS-v0zdIYkXrEoIAM1X5kPoZVfVvY5h?usp=sharing

import omicverse as ov
ov.utils.ov_plot_set()

/Users/fernandozeng/miniforge3/envs/scbasset/lib/python3.8/site-packages/phate/__init__.py
```py


Note that this dataset has not been processed in any way and is only exported by featureCounts, and Sequence alignment was performed from the genome file of CRCm39

```py
data=ov.utils.read('https://raw.githubusercontent.com/Starlitnightly/Pyomic/master/sample/counts.txt',index_col=0,header=1)
#replace the columns `.bam` to `` 
data.columns=[i.split('/')[-1].replace('.bam','') for i in data.columns]
data.head()

	1--1	1--2	2--1	2--2	3--1	3--2	4--1	4--2	4-3	4-4	Blank-1	Blank-2
Geneid
---	---	---	---	---	---	---	---	---	---	---	---	---
ENSMUSG00000102628	0	0	0	0	5	0	0	0	0	0	0	9
ENSMUSG00000100595	0	0	0	0	0	0	0	0	0	0	0	0
ENSMUSG00000097426	5	0	0	0	0	0	0	1	0	0	0	0
ENSMUSG00000104478	0	0	0	0	0	0	0	0	0	0	0	0
ENSMUSG00000104385	0	0	0	0	0	0	0	0	0	0	0	0

ID mapping

We performed the gene_id mapping by the mapping pair file GRCm39 downloaded before.

ov.utils.download_geneid_annotation_pair()

data=ov.bulk.Matrix_ID_mapping(data,'genesets/pair_GRCm39.tsv')
data.head()

	1--1	1--2	2--1	2--2	3--1	3--2	4--1	4--2	4-3	4-4	Blank-1	Blank-2
Gm14845	115	116	84	86	133	170	130	105	91	127	124	99
Hdc	97	579	123	172	571	119	106	28	217	156	2	51
H2bu2	60	59	58	22	71	73	75	138	55	38	18	53
Gm6693	0	0	0	0	0	0	0	0	0	0	0	0
Rnd3	2423	2289	1996	1750	2304	2669	2952	2109	2030	2026	875	2555

Different expression analysis with ov

We can do differential expression analysis very simply by ov, simply by providing an expression matrix. To run DEG, we simply need to:

Read the raw count by featureCount or any other qualify methods.
Create an ov DEseq object.

dds=ov.bulk.pyDEG(data)

We notes that the gene_name mapping before exist some duplicates, we will process the duplicate indexes to retain only the highest expressed genes

dds.drop_duplicates_index()
print('... drop_duplicates_index success')

... drop_duplicates_index success

Now we can calculate the different expression gene from matrix, we need to input the treatment and control groups

treatment_groups=['4-3','4-4']
control_groups=['1--1','1--2']
result=dds.deg_analysis(treatment_groups,control_groups,method='DEseq2')


Fitting size factors...
... done in 0.00 seconds.

Fitting dispersions...
... done in 1.59 seconds.

Fitting dispersion trend curve...
... done in 2.82 seconds.

logres_prior=1.1538905878789707, sigma_prior=0.25
Fitting MAP dispersions...
... done in 1.57 seconds.

Fitting LFCs...
... done in 1.27 seconds.

Refitting 0 outliers.

Running Wald tests...
... done in 1.33 seconds.

Log2 fold change & Wald test p-value: condition Treatment vs Control

	baseMean	log2FoldChange	lfcSE	stat	pvalue	padj
Gm14845	111.727600	-0.049168	0.470660	-0.104467	0.916799	0.975241
Hdc	258.120455	-0.809097	1.116541	-0.724646	0.468669	0.789482
H2bu2	52.656807	-0.323968	0.652995	-0.496127	0.619805	0.877166
Gm6693	0.000000	NaN	NaN	NaN	NaN	NaN
Rnd3	2180.318184	-0.183828	0.190533	-0.964809	0.334641	0.690369
...	...	...	...	...	...	...
Gm18244	0.000000	NaN	NaN	NaN	NaN	NaN
Gm50317	0.000000	NaN	NaN	NaN	NaN	NaN
Olfr516	0.000000	NaN	NaN	NaN	NaN	NaN
Gm37042	0.000000	NaN	NaN	NaN	NaN	NaN
Prelid1	3335.335908	-0.032464	0.190413	-0.170493	0.864622	0.958729

54504 rows × 6 columns

One important thing is that we do not filter out low expression genes when processing DEGs, and in future versions I will consider building in the corresponding processing.

print(result.shape)
result=result.loc[result['log2(BaseMean)']>1]
print(result.shape)

(54504, 14)
(23377, 14)

We also need to set the threshold of Foldchange, we prepare a method named foldchange_set to finish. This function automatically calculates the appropriate threshold based on the log2FC distribution, but you can also enter it manually.

# -1 means automatically calculates
dds.foldchange_set(fc_threshold=-1,
                   pval_threshold=0.05,
                   logp_max=10)

... Fold change threshold: 1.6248531033651643

Visualize the DEG result and specific genes

To visualize the DEG result, we use plot_volcano to do it. This fuction can visualize the gene interested or high different expression genes. There are some parameters you need to input:

title: The title of volcano
figsize: The size of figure
plot_genes: The genes you interested
plot_genes_num: If you don't have interested genes, you can auto plot it.

dds.plot_volcano(title='DEG Analysis',figsize=(4,4),
                 plot_genes_num=8,plot_genes_fontsize=12,)

<Axes: title={'center': 'DEG Analysis'}, xlabel='$log_{2}FC$', ylabel='$-log_{10}(qvalue)$'>

enter image description here

To visualize the specific genes, we only need to use the dds.plot_boxplot function to finish it.

dds.plot_boxplot(genes=['Ckap2','Lef1'],treatment_groups=treatment_groups,
                control_groups=control_groups,figsize=(2,3),fontsize=12,
                 legend_bbox=(2,0.55))

(<Figure size 160x240 with 1 Axes>,
 <Axes: title={'center': 'Gene Expression'}>)

enter image description here

dds.plot_boxplot(genes=['Ckap2'],treatment_groups=treatment_groups,
                control_groups=control_groups,figsize=(2,3),fontsize=12,
                 legend_bbox=(2,0.55))



(<Figure size 160x240 with 1 Axes>,
 <Axes: title={'center': 'Gene Expression'}>)

enter image description here

Pathway enrichment analysis by Pyomic

Here we use the gseapy package, which included the GSEA analysis and Enrichment. We have optimised the output of the package and given some better looking graph drawing functions

Similarly, we need to download the pathway/genesets first. Five genesets we prepare previously, you can use Pyomic.utils.download_pathway_database() to download automatically. Besides, you can download the pathway you interested from enrichr: https://maayanlab.cloud/Enrichr/#libraries

ov.utils.download_pathway_database()

......Pathway Geneset download start: GO_Biological_Process_2021
......Loading dataset from genesets/GO_Biological_Process_2021.txt
......Pathway Geneset download start: GO_Cellular_Component_2021
......Loading dataset from genesets/GO_Cellular_Component_2021.txt
......Pathway Geneset download start: GO_Molecular_Function_2021
......Loading dataset from genesets/GO_Molecular_Function_2021.txt
......Pathway Geneset download start: WikiPathway_2021_Human
......Loading dataset from genesets/WikiPathway_2021_Human.txt
......Pathway Geneset download start: WikiPathways_2019_Mouse
......Loading dataset from genesets/WikiPathways_2019_Mouse.txt
......Pathway Geneset download start: Reactome_2022
......Loading dataset from genesets/Reactome_2022.txt
......Pathway Geneset download finished!
......Other Genesets can be dowload in `https://maayanlab.cloud/Enrichr/#libraries`

pathway_dict=ov.utils.geneset_prepare('genesets/WikiPathways_2019_Mouse.txt',organism='Mouse')

To perform the GSEA analysis, we need to ranking the genes at first. Using dds.ranking2gsea can obtain a ranking gene's matrix sorted by -log10(padj).

$Metric=\frac{-log_{10}(padj)}{sign(log2FC)}$

rnk=dds.ranking2gsea()

We used ov.bulk.pyGSEA to construct a GSEA object to perform enrichment.

gsea_obj=ov.bulk.pyGSEA(rnk,pathway_dict)
enrich_res=gsea_obj.enrichment()
2023-05-18 03:12:10,455 Input gene rankings contains NA values(gene name and ranking value), drop them all!

The results are stored in the enrich_res attribute.

gsea_obj.enrich_res.head()

	es	nes	pval	fdr	geneset_size	matched_size	genes	ledge_genes	logp	logc	num	fraction	Term	P-value
Term
---	---	---	---	---	---	---	---	---	---	---	---	---	---	---
Complement and Coagulation Cascades WP449	0.732116	2.140070	0.0	0.000000	62	56	Cfd;Masp1;F2r;C4b;Hc;Cfh;F7;F12;Pros1;Serping1...	Cfd;Masp1;F2r;C4b;Hc;Cfh;F7;F12;Pros1;Serping1...	9.210340	2.140070	56	0.903226	Complement and Coagulation Cascades WP449	0.000000
Matrix Metalloproteinases WP441	0.879498	2.397240	0.0	0.000000	29	27	Mmp11;Mmp14;Mmp3;Mmp12;Timp4;Timp1;Mmp28;Mmp9;...	Mmp11;Mmp14;Mmp3;Mmp12;Timp4;Timp1;Mmp28;Mmp9;...	9.210340	2.397240	27	0.931034	Matrix Metalloproteinases WP441	0.000000
TYROBP Causal Network WP3625	0.786372	2.358131	0.0	0.000000	58	57	Itgax;Itgb2;Rgs1;Gpx1;Lhfpl2;Tcirg1;Cxcl16;Cd3...	Itgax;Itgb2;Rgs1;Gpx1;Lhfpl2;Tcirg1;Cxcl16;Cd3...	9.210340	2.358131	57	0.982759	TYROBP Causal Network WP3625	0.000000
PPAR signaling pathway WP2316	0.681572	2.074737	0.0	0.003011	81	69	Hmgcs2;Pck1;Slc27a1;Scd3;Acox3;Acsbg1;Scd1;Ang...	Hmgcs2;Pck1;Slc27a1;Scd3;Acox3;Acsbg1;Scd1;Ang...	5.772960	2.074737	69	0.851852	PPAR signaling pathway WP2316	0.003011
Metapathway biotransformation WP1251	0.643937	1.991519	0.0	0.012042	141	120	Cyp26b1;Cyp2e1;Fmo2;Gpx1;Cyp4b1;Cyp11a1;Mgst2;...	Cyp26b1;Cyp2e1;Fmo2;Gpx1;Cyp4b1;Cyp11a1;Mgst2;...	4.411073	1.991519	120	0.851064	Metapathway biotransformation WP1251	0.012042

To visualize the enrichment, we use plot_enrichment to do.

num: The number of enriched terms to plot. Default is 10.
node_size: A list of integers defining the size of nodes in the plot. Default is [5,10,15].
cax_loc: The location of the colorbar on the plot. Default is 2.
cax_fontsize: The fontsize of the colorbar label. Default is 12.
fig_title: The title of the plot. Default is an empty string.
fig_xlabel: The label of the x-axis. Default is 'Fractions of genes'.
figsize: The size of the plot. Default is (2,4).
cmap: The colormap to use for the plot. Default is 'YlGnBu'.

gsea_obj.plot_enrichment(num=10,node_size=[10,20,30],
                        cax_loc=2.5,cax_fontsize=12,
                        fig_title='Wiki Pathway Enrichment',fig_xlabel='Fractions of genes',
                        figsize=(2,4),cmap='YlGnBu',
                        text_knock=2,text_maxsize=30)



<Axes: title={'center': 'Wiki Pathway Enrichment'}, xlabel='Fractions of genes'>

enter image description here

Not only the basic analysis, pyGSEA also help us to visualize the term with Ranked and Enrichment Score.

We can select the number of term to plot, which stored in gsea_obj.enrich_res.index, the 0 is Complement and Coagulation Cascades WP449 and the 1 is Matrix Metalloproteinases WP441

gsea_obj.enrich_res.index[:5]

Index(['Complement and Coagulation Cascades WP449',
       'Matrix Metalloproteinases WP441', 'TYROBP Causal Network WP3625',
       'PPAR signaling pathway WP2316',
       'Metapathway biotransformation WP1251'],
      dtype='object', name='Term')

We can set the gene_set_title to change the title of GSEA plot

fig=gsea_obj.plot_gsea(term_num=1,
                  gene_set_title='Matrix Metalloproteinases',
                  figsize=(3,4),
                  cmap='RdBu_r',
                  title_fontsize=14,
                  title_y=0.95)

enter image description here

python pyDESeq2 RNA-seq • 567 views

ADD COMMENT • link updated 3 months ago by Ram 43k • written 3 months ago by Julia Ma ▴ 120

1

Entering edit mode

Also worth noting this is a re-implementation of DESeq2 in python rather than a wrapper around the R package. The authors themselves admit there may be some drift between the values and results returned from the two.

ADD REPLY • link 3 months ago by jared.andrews07 ★ 16k

0

Entering edit mode

Use better tags. bioinformatics is a bad tag because every your post is not about the field itself, it's just part of the field. It's like adding english as a tag because the post is in English.

Use something like differential-expression and RNA-seq as tags.

ADD REPLY • link 3 months ago by Ram 43k

0

Entering edit mode

Might be worth putting python in the titles, as that is what seperates this from a million other DESeq tutorials.

ADD REPLY • link 3 months ago by i.sudbery 19k