How Can I Get All Coordinates With A Coverage Higher Than 5 Reads?
3
0
Entering edit mode
9.3 years ago
sanchezcavani ▴ 220

I have a SAM format file. Would anyone tell me how I should get all coordinates with coverage higher than 5 reads within a specific region (e.g. from chr1 1 to chr 1000)?

coverage • 6.8k views
ADD COMMENT
4
Entering edit mode
9.3 years ago

If you have SAM, first convert to BAM:

samtools view -Sb data.sam > data.bam
samtools sort data.bam data.sorted

Then use the bedtools genomecov tool to create BEDGRAPH output:

bedtools genomecov -ibam data.sorted.bam -bg > data.bedgraph

If you want to isolate solely those regions with coverage > 5:

awk '$4 > 5' data.bedgraph > data.gt5.bedgraph
ADD COMMENT
2
Entering edit mode
9.3 years ago

I believe the command "samtools depth" will return a count at every location. It's too much output data, so youll need some kind of streaming parse and filter. I have one in perl somewhere that keeps track of the regions and outputs locations only when thresholds are crossed, collapsing the bed. Others have mentioned bedtools, which might do that for you too.

ADD COMMENT
0
Entering edit mode

You can pass samtools depth a region and do a quick filtering with awk:

samtools depth -r chr1:1-1000 in.bam | awk '$3 > 5'
ADD REPLY
0
Entering edit mode

Thanks so much!

ADD REPLY
1
Entering edit mode
9.3 years ago

1) First create a small bam file containing reads of your region of interest.

samtools view -bh input.sorted.bam chr1:100,000-200,000 (region of interest) > regionofinterest.bam

2) Then use older version of samtool (that has a deprecated pileup function) to create a pileup file.

samtools pileup -vcf ref.fa regionofinterest.bam > regionofinterest.pileup

3) You can then write a very simple script that will go through the pileup format and count the number of reads for a coordinate and output one with greater than 5 reads.

Read more about pileup format here: http://en.wikipedia.org/wiki/Pileup_format

There are one-liner solution available with bedtools but I want you to understand basic workflows in NGS analysis.

ADD COMMENT
0
Entering edit mode

Here is a similar post. Check for bedtools usage in this post tools to calculate average coverage for a bam file? if you dont want to follow my advice.

ADD REPLY
0
Entering edit mode

The fourth column of mpileup output is the read depth, so there's no need to mess with old versions just to get pileup.

The undocumented samtools depth function is even better than that, since it saves time by skipping the variant calling steps (see the answer by karl.stamm).

ADD REPLY

Login before adding your answer.

Traffic: 2007 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6