Question

Finding Regions Of Interest With Minimum Coverage

1

Entering edit mode

10.9 years ago

vilborg.anna ▴ 30

Hi,

I have a bam file of all my accepted hits (tophat output) and an gtf file with my genes of interest for which I am trying to find potential antisense transcripts. I would like to create a list - preferably one that can be visualized in a genome browser - that shows all genes that have antisense reads in the accepted hits.bam file provided that there are more than a certain number of reads mapping to that gene. I can easily use bedtools intersect to get all genes that have antisense reads mapping to them, but this gives me way too many hits. Does anyone have a suggestion for how to set a coverage cutoff?

Thank you! /Anna

bam coverage rna-seq • 2.6k views

ADD COMMENT • link updated 10.4 years ago by Biostar 20 • written 10.9 years ago by vilborg.anna ▴ 30

1

Entering edit mode

Do you have RNAseq data form a strand specific protocol? Otherwise the strand of the alignment is not informative.

ADD REPLY • link 10.9 years ago by Michael 54k

0

Entering edit mode

Yes I do, I have data from HiSeq runs with strand specific libraries. I can definitely find antisense reads using bedtools intersect, as I mentioned, but I get too many low abundant hits that are probably just noise, so I need to refine my search for the few potentially good hits.

ADD REPLY • link 10.9 years ago by vilborg.anna ▴ 30

score 2 · Answer 1 · 2013-06-21

Just in case anyone is curious - I got some great input on this and found a way to solve it, at least well enough for my purposes.

I first did bedtools coverage with my list of genes of interest and the bam file with mapped reads using the -S setting to get things antisense, and writing the output to a gtf file. The default output for bedtools coverage is as follows: After each entry in B (genes of interest in my case), reports: 1) The number of features in A (the bam file, in my case) that overlapped the B interval. 2) The number of bases in B that had non-zero coverage. 3) The length of the entry in B. 4) The fraction of bases in B that had non-zero coverage.

I used this column 1 (column 13 in my dataset) as an indicator of how much coverage each of my genes had. This is definitely not perfect since a longer gene with low coverage can get a higher score than a short gene with high coverage, but in my case my genes are pretty uniform in length, so I don't think this will cause me a great problem.

I sorted my gtf file using parameters -n -k 13 -r to sort the file from largest value in column 13 to smallest. I then used head -X where X is the number of lines above which I have data above the cutoff that I want.