Should Samtools Pileup Be Performed On Uniquely Mapped Reads Or All The Reads?
4
6
Entering edit mode
10.5 years ago
Doctoroots ▴ 790

im currently trying to infer the read depth in an area of interest from a bwa alignment sam file on an illumina run.

i am using the samtools pileup tool to get the per base coverage, and it seems to be working great. the problem is that i seem to get different read coverage when i perform the pileup on the original bwa sam file (that includes unmapped and non uniquely mapped reads) and when i perform the pileup on the same sam file only filtered for uniquely mapped reads.

the coverage is slightly higher when i use the unfiltered sam file, but if it includes non relevant reads, maybe i should stick to the uniquely mapped reads.

samtools pileup short read • 15k views
5
Entering edit mode
10.5 years ago
Ian 5.7k

I would work out coverage from the mapped read. Although not necessarily uniquely mapped reads, for example BFAST can output the best uniquely scoring alignment from a read that maps to multiple places, or all equally best scoring alignments of a read.

I found using BedTools useful in determining coverage. The genomeCoverageBed tools can input ordered BAM files via the -ibam flag. The -d flag reports coverage at each base.

1
Entering edit mode

a very nice tool, looks promising.

4
Entering edit mode
10.5 years ago

As the previous posts elude to: think of the biological question and calculate what is needed based on that.

For example I do the following (mainly using my own perl script)

• non-normalised coverage per base
• coverage per base where each sequence counts towards 1/n per base
• where n = the number of times the sequence maps in the genome.
• remove identical sequences to remove PCR amplification error
• combination of the previous two points

Each technique has problems: e.g. when looking at repeats, are all repeats in the assembly and hence can normalisation be trusted. Or are all short reads coming from one repeat only and the rest are non-expressing. ... you get the idea. Looking at combination of the above allows me to build up a picture of expression: especially in context with the rest of the reads in a browser such as IGB later.

1
Entering edit mode

@Alastair - I agree with your considerations and approach. Does your script work from a BAM/SAM file? Would you be able to post this script somewhere?

0
Entering edit mode

I use sam2interval on galaxy and it works on the output of that. I'll try and comment it up so it can be used by others. I'll post back here once its done (probably next week)

0
Entering edit mode

2
Entering edit mode
10.5 years ago

Most publications I have seen so far (for RNA-seq data) discarded non-uniquely mapped reads. That could become a problem mainly for duplicated regions which then will have little to no coverage. So if you see large regions with very low coverage, you could also need to use the unfiltered data again. Depends a bit on your application, e.g. I am not sure if I would filter for copy-number variation data, too.

0
Entering edit mode

is that true? most RNA-Seq papers i've seen do "probabilistic" alignment of reads. because, for example, you can't be sure which transcript a particular read came from but don't want to discard it. cufflinks, splicemap, and supersplat all do some sort of statistical model to assign ambiguously mapped reads.

0
Entering edit mode

if the reads are paired, with the help of mapped insert size, one could tell if they are PCR duplicates. It would be probabilistically very unlikely. I would be interested to know how much of this could be said for single end reads and do people still mark duplicates in SE reads as well?

0
Entering edit mode

Michael, On the other hand, when you want to determine copy number variation, aren't you interested in "unique reads" (=> all fragments per molecule counted once)?

2
Entering edit mode
10.5 years ago
lh3 32k

SAMtools pileup discards unmapped reads, secondary alignments and duplicates. It uses non-unique reads. To show unique mappings only:

samtools view -uq1 aln.bam | samtools pileup -


or

samtools mpileup -q1 aln.bam


For RNA-seq, neither using unique reads only nor using all reads is optimal. I know sophisticated tools redistribute non-unique mappings, or even model read counting to get better estimate.

0
Entering edit mode

That is a great way to pick unique alinments usually. But how to interpret the coexists of MAPQ 0 and XT:A:U in one alignment of bwa output?