Tutorial: bam2rpkm by bedtools
gravatar for jmzeng1314
3.0 years ago by
jmzeng1314100 wrote:

bam2rpkm by bedtools

From alignment files (BAM format), multiple tools (htseq-counts, featureCounts) can count the number of reads located in a specific genomic regions(genes/exons/introns/ur/promoters/enhancers) based on the annotation files.

there are always the case others need RPKM values, instead of raw read counts, and you probably don't want to search public tools again and again.

I will show you a easy way to use bedtools to solve that problem.

I had to mention that PPKM value is not a good way to do downstream analysis: "Measurement of mRNA abundance using RNA-seq data: RPKM measure is inconsistent among samples". Wagner GP, Kin K, Lynch VJ. _Theory Biosci._ 2012. PubMed

Firstly create the coordinate of all of the exons

Download the file: CCDS.20160908.txt , you can choose the versions and species you need.

cat CCDS.20160908.txt |perl -alne '{/\[(.*?)\]/;next unless $1;$gene=$F[2];$exons=$1;$exons=~s/\s//g;$exons=~s/-/\t/g;print "$F[0]\t$_\t$gene" foreach split/,/,$exons;}'|sort -u |bedtools sort -i >exon_probe.hg38.gene.bed
head exon_probe.hg38.gene.bed

The results from the script above would be :

chr1    69090   70007   OR4F5
chr1    450739  451677  OR4F29
chr1    685715  686653  OR4F16
chr1    925941  926012  SAMD11
chr1    930154  930335  SAMD11
chr1    931038  931088  SAMD11
chr1    935771  935895  SAMD11
chr1    939039  939128  SAMD11
chr1    939274  939459  SAMD11
chr1    941143  941305  SAMD11

calculate RPKM value from bam and gtf in batch

The formula of RPKM like below: C = Number of reads mapped to a gene N = Total mapped reads in the experiment L = exon length in base-pairs for a gene Equation = RPKM = (10^9 * C)/(N * L)

So the script is :

for bam in  /home/project/*.bam
file=$(basename $bam )
echo $sample
export total_reads=$(samtools idxstats  $bam|awk -F '\t' '{s+=$3}END{print s}')
echo The number of reads is $total_reads
bedtools multicov -bams  $bam -bed $bed |perl -alne '{$len=$F[2]-$F[1];if($len <1 ){print "$.\t$F[4]\t0" }else{$rpkm=(1000000000*$F[4]/($len* $ENV{total_reads}));print "$.\t$F[4]\t$rpkm"}}' >$sample.rpkm.txt

You may notice that the output should be 3 columns, first column is just the number of rows, second column is the number of reads for each exon, the last column is the RPKM value.

head blood.rpkm.txt
1   3538    40.5143632023362
2   1756    19.6581297588076
3   1793    20.0723386432472
4   102 15.0855916359498
5   75  4.35114155895529
6   24  5.04036238189381
7   97  8.21430025275033
8   132 15.5741534272
9   81  4.59762784834908
10  66  4.27808535500246

The reason why I write this script is to solve a small bug in conifer(COpy Number Inference From Exome Reads), I will explain it later.

rna-seq bam tutorial rpkm bedtools • 2.2k views
ADD COMMENTlink written 3.0 years ago by jmzeng1314100
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1778 users visited in the last hour