I am doing some RNA Seq (paired-end, 75bp, unstranded, good depth) alignments, focusing on olfactory genes. I see that they are often close together and with similar sequences (lots of pseudo genes), so I often get erroneous reads with long introns. I am using STAR, setting the option --alignIntronMax
25000 (default is much larger).
I am doing de novo assembly afterwards, to map some unclear UTRs. Badly aligned reads can make two close genes appear merged as a single gene.
I decided to plot the closest distance between any two olfactory genes using bedtools; I am also including an IGV screenshot showing my problem: http://imgur.com/1pvmcGd,sz91xhm#0
I see there is few olfactory genes closer than 7000bp, so I am using it as a new limit to the size of the intron. I can always use my previous alignment with --alignIntronMax 25000
.
Do you have similar problems, and how do you resolve them? I would like to ditch the most dubious paired end reads.
I'm not sure that it's the intron length that's the problem, since there actually are long introns. The problem you're running into is due to olfactory receptors being very similar and clustered, so any disagreement with the reference sequence results in aberrant fusion genes. This might be a case where the tophat2 (or perhaps hisat, I've used it but can't say I'm familiar enough with it yet) method might actually be preferable.
Looking at the tophat2 documentation, I get this option.
STAR is however much faster. It also has an option
--alignMatesGapMax
which might be of use to me....