Question: Intron-aware coverage for paired-end RNA-seq data.
0
gravatar for Carlo Yague
2.3 years ago by
Carlo Yague3.6k
Belgium
Carlo Yague3.6k wrote:

Hi guys,

For visualization purpose (like in a genome browser), I would like to get the per-base genome-wide coverage of paired-end RNA-seq data. I thought this would be easy, turns out it is not. Here is a picture showing what I want to do :

image

And here is where I came so far :

  • I start with mapped paired-end reads (accepted_hits.bam)
  • Then I split my file into two strand specific files according to this tutorial.
  • Then I use bedtools bamtobed with the -bedpe option to get a bed file that summarizes my paired reads. From there I can easily retrieve the start and end positions of the fragments and calculate coverage (see code below). Hurray. However I loose information on the introns...

And it is where I'm stuck. I thought of using bedtools bamtobed with the -split option along with -bedpe, but those two options seems not compatible. A possible solution would be to extract the halfs of the reads spanning introns that are not linked to the mate pair, save them as singleton and add them later to the coverage, but I have no idea how to do that. Any suggestion, even for a completely different method ? Thank you !

samtools sort -no <(samtools view -b -f 2 fwd.bam) temp | bamToBed -i stdin -bedpe | cut -f 1,2,6 | sort -k 1,1 > fwd_merged_mates.bed
bedtools genomecov -bga -i fwd_merged_mates.bed -g chrom_summary  > WT1_fwd.bg

 

rna-seq bam processing • 1.4k views
ADD COMMENTlink modified 2.3 years ago • written 2.3 years ago by Carlo Yague3.6k

Your goal is get genome coverage from only exons ( and stranded ) ?

ADD REPLYlink written 2.3 years ago by geek_y8.1k

No, I want genome coverage for the whole genome. But if a read if split by an intron (mapping with tophat allows it), I need to take that into account when I compute the coverage, i.e, I can't just count +1 between the start and the end of a fragment if one of the read forming the fragment is split by an intron... Hope this clarify my question :)

ADD REPLYlink written 2.3 years ago by Carlo Yague3.6k

I guess you simply needs number of reads per gene/Exon. You could use HTSeq-count for counting the number of reads per gene, which takes care of split reads without double counting them. If you need exon level counts, DEXSeq provides some scripts that will flatten your GTF file and can be used with HTSeq-count to get exon level information.

ADD REPLYlink written 2.3 years ago by geek_y8.1k

No, I thank you for your interest but this is not what I want. Instead I want to generate coverage tracks that cover the whole genome for visualization in a genome browser. We could call it "stranded per base count" if u want, in contrast with exon/gene count. I edited the question to make this point clear.

ADD REPLYlink modified 2.3 years ago • written 2.3 years ago by Carlo Yague3.6k

for me is still unclear what you are trying to do and frankly this: 

"A possible solution would be to extract the halfs of the reads spanning introns that are not linked to the mate pair, save them as singleton and add them later to the coverage"

sounds a bad idea. There is a rule that says: if it is hard to explain it is probably a bad idea.

INHO If all you want is coverage then first compute the whole genome coverage ignoring genes for each strand then slice and dice that file with bedtools.

ADD REPLYlink written 2.3 years ago by Istvan Albert ♦♦ 75k

Ok there seems to be some misunderstanding. When I say "intron", I'm not talking about actual intron, but about reads that are mapped discontinuously by tophat (altough the two concepts are linked). And indeed, all I want is the whole coverage ignoring genes. But because of those discontinuity, I find it hard to compute it :

  • With single end reads and no intron, its easy, you just take the start and end of the reads and count +1 within the interval.
  • With single reads and introns, you split intron-containing reads into two reads and procede as before.
  • With paired end and no intron, its also easy, you just take the start of the read 1, the end of the read 2 and count +1 within the interval.
  • But I'm confused on what to do when you have both paired end and introns to take into account.
ADD REPLYlink modified 2.3 years ago • written 2.3 years ago by Carlo Yague3.6k

There isn't really a good solution for the case you mention, since you can't know with certainty that there's no splicing going on between your mates (yes, you can just assume that the GTF defines all legitimate splicing events, but that is an unjustifiable assumption in many situations).

ADD REPLYlink written 2.3 years ago by Devon Ryan74k

Very nice point.

Currently I'm trying samtools depth. I need little more testing, but it looks like it does what I want.

ADD REPLYlink written 2.3 years ago by Carlo Yague3.6k
1
gravatar for Mike
2.3 years ago by
Mike30
Canada
Mike30 wrote:

It sounds like what you want to do is convert an aligned .sam file to a wiggle/bedgraph file, is this correct?  The STAR RNA alignment program has an option to do this but I haven't used it.

ADD COMMENTlink modified 2.3 years ago • written 2.3 years ago by Mike30

Exactly ! I'll look into STAR RNA. Will update.
 

ADD REPLYlink written 2.3 years ago by Carlo Yague3.6k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1709 users visited in the last hour