Question

Do you have abnormally long introns in your RNA Seq alignment? How do you get rid of them?

1

Entering edit mode

9.0 years ago

cyril-cros ▴ 950

I am doing some RNA Seq (paired-end, 75bp, unstranded, good depth) alignments, focusing on olfactory genes. I see that they are often close together and with similar sequences (lots of pseudo genes), so I often get erroneous reads with long introns. I am using STAR, setting the option --alignIntronMax 25000 (default is much larger).

I am doing de novo assembly afterwards, to map some unclear UTRs. Badly aligned reads can make two close genes appear merged as a single gene.

I decided to plot the closest distance between any two olfactory genes using bedtools; I am also including an IGV screenshot showing my problem: http://imgur.com/1pvmcGd,sz91xhm#0

I see there is few olfactory genes closer than 7000bp, so I am using it as a new limit to the size of the intron. I can always use my previous alignment with --alignIntronMax 25000.

Do you have similar problems, and how do you resolve them? I would like to ditch the most dubious paired end reads.

alignment RNA-Seq STAR • 3.6k views

ADD COMMENT • link updated 15 months ago by Ram 43k • written 9.0 years ago by cyril-cros ▴ 950

0

Entering edit mode

I'm not sure that it's the intron length that's the problem, since there actually are long introns. The problem you're running into is due to olfactory receptors being very similar and clustered, so any disagreement with the reference sequence results in aberrant fusion genes. This might be a case where the tophat2 (or perhaps hisat, I've used it but can't say I'm familiar enough with it yet) method might actually be preferable.

ADD REPLY • link updated 15 months ago by Ram 43k • written 9.0 years ago by Devon Ryan 104k

0

Entering edit mode

Looking at the tophat2 documentation, I get this option.

--read-realign-edit-dist:

Some of the reads spanning multiple exons may be mapped incorrectly as a contiguous alignment to the genome even though the correct alignment should be a spliced one - this can happen in the presence of processed pseudogenes that are rarely (if at all) transcribed or expressed.

STAR is however much faster. It also has an option --alignMatesGapMax which might be of use to me....

ADD REPLY • link updated 15 months ago by Ram 43k • written 9.0 years ago by cyril-cros ▴ 950

Ram · Answer 1 · 2015-05-06

1

Entering edit mode

9.0 years ago

cyril-cros ▴ 950

Running a lower --alignIntronMax helped, but I am losing some 'real' introns from what I can see. There is a trade-off here, which is hard to solve. Will try again...

ADD COMMENT • link updated 15 months ago by Ram 43k • written 9.0 years ago by cyril-cros ▴ 950

0

Entering edit mode

Yeah, you have a rather tricky case. One possible method for you might be to simply align against the transcriptome, since you could then avoid some of these issues. That's often a method of last resort, but depending on what your biological question is it might be helpful.

ADD REPLY • link 9.0 years ago by Devon Ryan 104k

Ram · Answer 2 · 2015-05-07

0

Entering edit mode

9.0 years ago

h.mon 35k

You could start with a small --alignIntronMax, them remove the reads where both pairs mapped from your fastq. With the reduced dataset, map again increasing --alignIntronMax. Wash, rinse, repeat until satisfied.

ADD COMMENT • link updated 15 months ago by Ram 43k • written 9.0 years ago by h.mon 35k