Dear Biostars,
I am trying to calculate the fraction of RNA-Seq reads map to a customised dataset of genomic locations. I am using the following steps. Could you please answer whether this is correct or not? Also do you think is there any better or fatser way to do this?
Data: BAM file (Paired-end, unstranded, poly(A)+ RNA-seq) mapped to hg19,
Step1: Obtain total number of unique reads mapped to customised genomic locations
#(samtools view -H file1_mapped.bam; samtools view file1_mapped.bam | grep NH:i:1) | samtools view -Sb - | bedtools coverage -abam - -b ../customised.bed6 -counts -split > file1.uniquelymapped.counts
Step2:
awk '{ sum+=$7} END {print sum}' file1.uniquelymapped.counts
2639265
Step3: Obtain total number of mapped reads
samtools flagstat file1_mapped.bam
84428534 + 4669383 in total (QC-passed reads + QC-failed reads)
34553453 + 0 duplicates
80610721 + 1971670 mapped (95.48%:42.23%)
84428534 + 4669383 paired in sequencing
42352168 + 2345534 read1
42076366 + 2323849 read2
77864131 + 1152188 properly paired (92.22%:24.68%)
77864131 + 1152188 with itself and mate mapped
2746590 + 819482 singletons (3.25%:17.55%)
0 + 0 with mate mapped to a different chr
0 + 0 with mate mapped to a different chr (mapQ>=5)
Step4: Fraction of RNA-Seq reads mapped =(Uniquely mapped reads/Total number of mapped reads)
=2639265/84428534 = 0.031
Does this work ?
I guess not. Think some of your piping is wrong but testing it will tell you immediately.
htseq-count can work on a sam file (by default) as well as on a bam file (specify
-f bam
). You don't need the-f 1
... Depending on how you made your .gtf, you'll have to specify the -t and -i flag of htseq-countbut htseq-count doesn't have -f option?
Yes it does. Perhaps your version is outdated.