Tutorial:bam2rpkm by bedtools
Entering edit mode
3.5 years ago
jmzeng1314 ▴ 110

bam2rpkm by bedtools

From alignment files (BAM format), multiple tools (htseq-counts, featureCounts) can count the number of reads located in a specific genomic regions(genes/exons/introns/ur/promoters/enhancers) based on the annotation files.

there are always the case others need RPKM values, instead of raw read counts, and you probably don't want to search public tools again and again.

I will show you a easy way to use bedtools to solve that problem.

I had to mention that PPKM value is not a good way to do downstream analysis: "Measurement of mRNA abundance using RNA-seq data: RPKM measure is inconsistent among samples". Wagner GP, Kin K, Lynch VJ. _Theory Biosci._ 2012. PubMed

Firstly create the coordinate of all of the exons

Download the file: CCDS.20160908.txt , you can choose the versions and species you need.

cat CCDS.20160908.txt |perl -alne '{/\[(.*?)\]/;next unless $1;$gene=$F[2];$exons=$1;$exons=~s/\s//g;$exons=~s/-/\t/g;print "$F[0]\t$_\t$gene" foreach split/,/,$exons;}'|sort -u |bedtools sort -i >exon_probe.hg38.gene.bed
head exon_probe.hg38.gene.bed

The results from the script above would be :

chr1    69090   70007   OR4F5
chr1    450739  451677  OR4F29
chr1    685715  686653  OR4F16
chr1    925941  926012  SAMD11
chr1    930154  930335  SAMD11
chr1    931038  931088  SAMD11
chr1    935771  935895  SAMD11
chr1    939039  939128  SAMD11
chr1    939274  939459  SAMD11
chr1    941143  941305  SAMD11

calculate RPKM value from bam and gtf in batch

The formula of RPKM like below: C = Number of reads mapped to a gene N = Total mapped reads in the experiment L = exon length in base-pairs for a gene Equation = RPKM = (10^9 * C)/(N * L)

So the script is :

for bam in  /home/project/*.bam
file=$(basename $bam )
echo $sample
export total_reads=$(samtools idxstats  $bam|awk -F '\t' '{s+=$3}END{print s}')
echo The number of reads is $total_reads
bedtools multicov -bams  $bam -bed $bed |perl -alne '{$len=$F[2]-$F[1];if($len <1 ){print "$.\t$F[4]\t0" }else{$rpkm=(1000000000*$F[4]/($len* $ENV{total_reads}));print "$.\t$F[4]\t$rpkm"}}' >$sample.rpkm.txt

You may notice that the output should be 3 columns, first column is just the number of rows, second column is the number of reads for each exon, the last column is the RPKM value.

head blood.rpkm.txt
1   3538    40.5143632023362
2   1756    19.6581297588076
3   1793    20.0723386432472
4   102 15.0855916359498
5   75  4.35114155895529
6   24  5.04036238189381
7   97  8.21430025275033
8   132 15.5741534272
9   81  4.59762784834908
10  66  4.27808535500246

The reason why I write this script is to solve a small bug in conifer(COpy Number Inference From Exome Reads), I will explain it later.

RNA-Seq bam rpkm bedtools Tutorial • 2.7k views
Entering edit mode

Hi, Thank you for this. How can I get RPKM values per gene (name/symbol)? Thank you. Pradeep

Entering edit mode
4 months ago
ATpoint 48k

This tutorial is really an overcomplication for a simple problem. One should just use featureCounts or htseq to get a matrix of raw counts and then load the matrix into R followed by using standard functions such as edgeR::rpkm/cpm, DESeq2::fpkm/vst/rlog to get properly-normalized counts. If using the edgeR function from a DGEList (the edgeR format) after running calcNormFactors one even gets the benefit of TMM-normalized counts which is strongly recommended.

Code suggestion for an edgeR workflow (beyond the manual) are here: Basic normalization, batch correction and visualization of RNA-seq data

Please check the manuals of the mentioned tools for details. On top of this post here A: ATAC-seq sample normalization (quantil normalization) there are some videos that explain why normalization via edgeR or DESeq2 is preferred over naive approaches.


Login before adding your answer.

Traffic: 2277 users visited in the last hour
Help About
Access RSS

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6