Rescuing Orphaned Reads With Local Alignment Within A Radius Of Mapped Mate
3
6
Entering edit mode
12.8 years ago
Abhi ★ 1.6k

PS: this message has also been cross posted on seqanswers. I just want to reach out to more bioinfo guys so thought of posting it here too.

Problem:

So we have a dataset of variable biological insert library as we are sequencing the 5' and 3' end of transcripts. As a result the distance between the mates( <--- --->) is dependent on the length of transcript. To map the reads initially I am first using Mosaik which i belv does a better job with variable insert mate pair data.

After mapping we still see 40% orphaned reads where one read maps and the other doesn't. Is there a way that I can do a local re-alignment for these orphaned reads and attempt to map the mate within a given radius of the mapped mate.

Anything already out there ?

Thanks! -Abhi

• 3.3k views
ADD COMMENT
0
Entering edit mode

what is your alignment rate if you turn off pairing?

ADD REPLY
0
Entering edit mode

@Jeremy : The alignment rate for read 1 and read 2 independently is > 80%. It is the pairing that is causing problems.

ADD REPLY
0
Entering edit mode

@All : Any way I can know through an email when any updated is posted for a question I am interested in. I current get an email but a day later which doesn't help.

ADD REPLY
0
Entering edit mode

your realignment should be your unpaired alignment. I would suggest loading the subset of read names whose mates are unmapped (samtools view -bf 0x0004 reads.bam_ and then using those to examine where the mates align naturally

ADD REPLY
2
Entering edit mode
12.8 years ago

If I understand correctly, you are sequencing transcripts? It that is the case, it is quite possible that the orphaned reads are due to the lack of alignment across intron-exon boundaries. If I recall, Mosaik is not designed to align RNA-seq reads. Have you considered using an RNA-seq aligner such as GSNAP or tophat?

ADD COMMENT
0
Entering edit mode

@Sean : Sorry I could not reply earlier. We are sequencing only 5' and 3' end of transcripts and not the full transcripts. Tophat doesn't work as after linker removes the reads are of variable length depending on where the linker is found.

ADD REPLY
0
Entering edit mode

GSNAP will happily work with any length reads.

ADD REPLY
0
Entering edit mode

The latest version of TopHat also works with varied read lengths.

ADD REPLY
0
Entering edit mode

+1 for Sean's response. Keep in mind that for many organisms exon 1 is short, thus putting you in Sean's scenario, while the last exon is often long. Long and short are of course relevant but that relevance is also dependent on your read length.

ADD REPLY
1
Entering edit mode
12.8 years ago

See the answer by Sean Davis if you are mapping exons on a genomic reference.

For finding large deletions and insertions, or other types of translocations, you could try Pindel. Its pattern growth algorithm does exactly what you are looking for.

Coincidentally, we found that GSNAP also works fine on DNA to find large deletions.

ADD COMMENT
0
Entering edit mode
12.4 years ago

A common problem with gene expression studies and sequencing - ESTs, RNA-Seq, etc - is contaminating genomic DNA. Some mRNA preps are excellent and some are poor. There is always some amount of genomic DNA, or unspliced or incompletely spliced messages in the mix. These could be a (partial) source of the orphaned reads. So, you can see if any orphans align to a contiguous segment of the genome and if so, if any of that alignment falls within intron. In some cases a retained intron is a legitimate splice variant, but this is rare, and would not be expected without first seeing matches to the known gene models. Thus, too many orphans mapping to introns is likely a sign of issues with the mRNA prep.

ADD COMMENT

Login before adding your answer.

Traffic: 2594 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6