Question: Do you have abnormally long introns in your RNA Seq alignment? How do you get rid of them?
1
gravatar for cyril-cros
4.0 years ago by
cyril-cros890
France
cyril-cros890 wrote:

I am doing some RNA Seq (paired-end, 75bp, unstranded, good depth) alignments, focusing on olfactory genes. I see that they are often close together and with similar sequences (lots of pseudo genes), so I often get erroneous reads with long introns. I am using STAR, setting the option --alignIntronMax 25000 (default is much larger).  

I am doing de novo assembly afterwards, to map some unclear UTRs. Badly aligned reads can make two close genes appear merged as a single gene.

I decided to plot the closest distance between any two olfactory genes using bedtools; I am also including an IGV screenshot showing my problem : http://imgur.com/1pvmcGd,sz91xhm#0

I see there is few olfactory genes closer than 7000bp, so I am using it as a new limit to the size of the intron. I can always use my previous alignment with  --alignIntronMax 25000.

Do you have similar problems, and how do you resolve them? I would like to ditch the most dubious paired end reads.

rna-seq star alignment • 1.6k views
ADD COMMENTlink modified 4.0 years ago by h.mon24k • written 4.0 years ago by cyril-cros890

I'm not sure that it's the intron length that's the problem, since there actually are long introns. The problem you're running into is due to olfactory receptors being very similar and clustered, so any disagreement with the reference sequence results in aberrant fusion genes. This might be a case where the tophat2 (or perhaps hisat, I've used it but can't say I'm familiar enough with it yet) method might actually be preferable.

ADD REPLYlink written 4.0 years ago by Devon Ryan89k

Looking at the tophat2 documentation, I get this option.

--read-realign-edit-dist:

Some of the reads spanning multiple exons may be mapped incorrectly as a contiguous alignment to the genome even though the correct alignment should be a spliced one - this can happen in the presence of processed pseudogenes that are rarely (if at all) transcribed or expressed. 

STAR is however much faster. It also has an option --alignMatesGapMax which might be of use to me....

ADD REPLYlink written 4.0 years ago by cyril-cros890
1
gravatar for cyril-cros
4.0 years ago by
cyril-cros890
France
cyril-cros890 wrote:

Running a lower  --alignIntronMax helped, but I am losing some 'real' introns from what I can see. There is a trade-off here, which is hard to solve. Will try again...

ADD COMMENTlink written 4.0 years ago by cyril-cros890

Yeah, you have a rather tricky case. One possible method for you might be to simply align against the transcriptome, since you could then avoid some of these issues. That's often a method of last resort, but depending on what your biological question is it might be helpful.

ADD REPLYlink written 4.0 years ago by Devon Ryan89k
0
gravatar for h.mon
4.0 years ago by
h.mon24k
Brazil
h.mon24k wrote:

You could start with a small --alignIntronMax, them remove the reads where both pairs mapped from your fastq. With the reduced dataset, map again increasing --alignIntronMax. Wash, rinse, repeat until satisfied.

ADD COMMENTlink modified 4.0 years ago • written 4.0 years ago by h.mon24k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 967 users visited in the last hour