Filter query coverage
1
0
Entering edit mode
2.2 years ago

Hi,

Could you please provide me with a tool that I could run after bowtie2 that could filter queries from bam files with for example 80% of short reads coverage (meaning short reads cover 80% of the query)?

alignment • 1.1k views
ADD COMMENT
0
Entering edit mode

meaning short reads cover 80% of the query

what is "query" in that context ?

ADD REPLY
0
Entering edit mode

I am aligning the short reads against the CARD database using Bowtie2. After filtering the mapped reads, sorting and indexing, I ran samtools idxstats. This gives me the queries with their abundances. But some of queries were not totally covered by shortreads. Therefore, I want to exclude the queries with low coverage. For example, keep only queries with at least 80% coverage.

ADD REPLY
0
Entering edit mode
2.2 years ago
seidel 11k

You could use R and the GenomicRanges library, if you know R and the principles associated with this library. You can import a BAM file, convert it to a GRanges object, and call coverage(), which will create an RLE vector of coverage across the genome. Then import a BED file (or text file) of coordinates for which you'd like assess coverage, and use those coordinates to access your coverage object to query how many bases have coverage greater than zero, and divide this by the length of your query region to get a percentage. This answer assumes you know R, and are comfortable with the GenomicRanges library, and how to access RLE (Run Length Encoded) vectors, and even within this context, there are several different ways to approach it.

Given that each alignment target is a unit of interest, here's an example showing the percent of each target covered with just a few lines of code:

library(GenomicAlignments)

# import aligned reads to a GRanges object
gr <- as(readGAlignments("out.bam"), "GRanges")

# calculate coverage
cov <- coverage(gr)

# calculate percentage of bases with coverage greater than zero
coveragePercent <- lapply(cov, function(x) 100 * sum(x > 0)/length(x))

# view the results
unlist(coveragePercent)

Results:

CLX001 CLX002 CLX003 CLX004 CLX005 CLX006 CLX007 
41.2   0.0    37.9   69.0   74.7   81.1   0.0 
CLX008 CLX009 CLX010 CLX011 CLX012 CLX013 CLX014
0.0    51.9   60.7   0.0    36.3   55.9   95.9 

You could also use a command-line approach using bedtools and the genomecov option to input a BAM file, report the depth at each genome position, and query how many zeros you get within a query region. Something like:

genomeCoverageBed -d -ibam out.bam

would give you per base coverage for every base, and you'd have to count how many zeros you get for each alignment target using a script or some fancy awk or other parsing tricks.

ADD COMMENT
0
Entering edit mode

Thanks. I will mention my situation in a bit more details.

I am aligning the short reads against the CARD database using Bowtie2. After filtering the mapped reads, sorting and indexing, I ran samtools idxstats. This gives me the queries with their abundances. But some of queries were not totally covered by shortreads. Therefore, I want to exclude the queries with low coverage. For example, keep only queries with at least 80% coverage.

ADD REPLY
0
Entering edit mode

By CARD database I take it you mean: The Comprehensive Antibiotic Resistance Database, and not the Trading Card Database, right? If so, then if each thing you're aligning to is a query target (i.e. each alignment target is an antibiotic resistance gene) then you can query percentage coverage using the example in the answer above. If the alignment targets were chromosomes, and you wanted to query a subsection, a little more parsing would be required.

ADD REPLY

Login before adding your answer.

Traffic: 2574 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6