Finding Regions Of Interest With Minimum Coverage
1
1
Entering edit mode
10.9 years ago
vilborg.anna ▴ 30

Hi,

I have a bam file of all my accepted hits (tophat output) and an gtf file with my genes of interest for which I am trying to find potential antisense transcripts. I would like to create a list - preferably one that can be visualized in a genome browser - that shows all genes that have antisense reads in the accepted hits.bam file provided that there are more than a certain number of reads mapping to that gene. I can easily use bedtools intersect to get all genes that have antisense reads mapping to them, but this gives me way too many hits. Does anyone have a suggestion for how to set a coverage cutoff?

Thank you! /Anna

bam coverage rna-seq • 2.6k views
ADD COMMENT
1
Entering edit mode

Do you have RNAseq data form a strand specific protocol? Otherwise the strand of the alignment is not informative.

ADD REPLY
0
Entering edit mode

Yes I do, I have data from HiSeq runs with strand specific libraries. I can definitely find antisense reads using bedtools intersect, as I mentioned, but I get too many low abundant hits that are probably just noise, so I need to refine my search for the few potentially good hits.

ADD REPLY
2
Entering edit mode
10.8 years ago
vilborg.anna ▴ 30

Just in case anyone is curious - I got some great input on this and found a way to solve it, at least well enough for my purposes.

I first did bedtools coverage with my list of genes of interest and the bam file with mapped reads using the -S setting to get things antisense, and writing the output to a gtf file. The default output for bedtools coverage is as follows: After each entry in B (genes of interest in my case), reports: 1) The number of features in A (the bam file, in my case) that overlapped the B interval. 2) The number of bases in B that had non-zero coverage. 3) The length of the entry in B. 4) The fraction of bases in B that had non-zero coverage.

I used this column 1 (column 13 in my dataset) as an indicator of how much coverage each of my genes had. This is definitely not perfect since a longer gene with low coverage can get a higher score than a short gene with high coverage, but in my case my genes are pretty uniform in length, so I don't think this will cause me a great problem.

I sorted my gtf file using parameters -n -k 13 -r to sort the file from largest value in column 13 to smallest. I then used head -X where X is the number of lines above which I have data above the cutoff that I want.

ADD COMMENT

Login before adding your answer.

Traffic: 2712 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6