How Can I Get All Coordinates With A Coverage Higher Than 5 Reads?
3
0
Entering edit mode
9.3 years ago
sanchezcavani ▴ 220

I have a SAM format file. Would anyone tell me how I should get all coordinates with coverage higher than 5 reads within a specific region (e.g. from chr1 1 to chr 1000)?

coverage • 6.8k views
4
Entering edit mode
9.3 years ago

If you have SAM, first convert to BAM:

samtools view -Sb data.sam > data.bam
samtools sort data.bam data.sorted


Then use the bedtools genomecov tool to create BEDGRAPH output:

bedtools genomecov -ibam data.sorted.bam -bg > data.bedgraph


If you want to isolate solely those regions with coverage > 5:

awk '$4 > 5' data.bedgraph > data.gt5.bedgraph  ADD COMMENT 2 Entering edit mode 9.3 years ago I believe the command "samtools depth" will return a count at every location. It's too much output data, so youll need some kind of streaming parse and filter. I have one in perl somewhere that keeps track of the regions and outputs locations only when thresholds are crossed, collapsing the bed. Others have mentioned bedtools, which might do that for you too. ADD COMMENT 0 Entering edit mode You can pass samtools depth a region and do a quick filtering with awk: samtools depth -r chr1:1-1000 in.bam | awk '$3 > 5'

0
Entering edit mode

Thanks so much!

1
Entering edit mode
9.3 years ago

1) First create a small bam file containing reads of your region of interest.

samtools view -bh input.sorted.bam chr1:100,000-200,000 (region of interest) > regionofinterest.bam

2) Then use older version of samtool (that has a deprecated pileup function) to create a pileup file.

samtools pileup -vcf ref.fa regionofinterest.bam > regionofinterest.pileup

3) You can then write a very simple script that will go through the pileup format and count the number of reads for a coordinate and output one with greater than 5 reads.

There are one-liner solution available with bedtools but I want you to understand basic workflows in NGS analysis.

0
Entering edit mode

Here is a similar post. Check for bedtools usage in this post tools to calculate average coverage for a bam file? if you dont want to follow my advice.

0
Entering edit mode

The fourth column of mpileup output is the read depth, so there's no need to mess with old versions just to get pileup.

The undocumented samtools depth function is even better than that, since it saves time by skipping the variant calling steps (see the answer by karl.stamm).