Question

Divide Reads Into Little Pieces

0

Entering edit mode

11.9 years ago

Nicolas Rosewick 11k

Hi,

Is it possible to divide reads into little pieces ?

Here's what I want to do:

Align the reads (paired-end 76bp
With the unmapped reads (a bam file or fastq : divide each read into little pieces (25 bp
Realign these little reads

Is it a tool to do that ?

Thanks,

N.

read • 1.6k views

ADD COMMENT • link updated 11.9 years ago by Andreas ★ 2.5k • written 11.9 years ago by Nicolas Rosewick 11k

1

Entering edit mode

What's your reason for doing this? I can think of many cases where this is going to be a bad idea, and produce alignments that are just wrong.

ADD REPLY • link 11.9 years ago by Chris Miller 22k

0

Entering edit mode

I've to align my reads against a very little sequence (~20kb) . And I wonder how the aligner manage the extremity of this sequence ? Are the reads aligned with the same accuracy at the extremity of the sequence as in the middle ?

ADD REPLY • link 11.9 years ago by Nicolas Rosewick 11k

1

Entering edit mode

As a suggestion, why don't you assemble all your reads (maybe as ESTs if this is the case), and then BLAST the contigs against your ~20kb sequence? I'm not sure if this makes sense for your biological question, but it makes much more sense from the algorithmical one.

ADD REPLY • link 11.9 years ago by Leonor Palmeira 3.9k

0

Entering edit mode

I work on viral genomes which are much smaller (12kb) and have no trouble whatsoever with aligning short reads against it.

ADD REPLY • link 11.9 years ago by Andreas ★ 2.5k

0

Entering edit mode

[Deleted comment.]

ADD REPLY • link 11.9 years ago by Leonor Palmeira 3.9k

0

Entering edit mode

Unless you specifically know of a reason to worry about those first 100 bases or so, dealing with it is generally not worth your time. If you decide to do remapping of sub-reads for any reason, be aware of the possibility that your 25-bp fragments may erroneously map elsewhere in the genome. (The smaller the sequences, the more locations it will map.)

ADD REPLY • link 11.9 years ago by Chris Miller 22k

score 2 · Answer 1 · 2012-06-08

You can get the ID of unmapped reads from your sam/bam file and grep them from your main fastq (with -A3 grep option) or fasta file (with -A1) or just take the sequence along with read ID's of unmapped reads from sam/bam file and make a new fasta file (may be using awk with sequence storing in next line). For fasta file I used -

perl -nwle '
     $i=0; 
     for my $add (<>=~/.{25}/g) { 
         printf "%s%s\n%s\n", $_, $i++ ? "#$i":"", $add; 
     }' inputfile.fasta

This will cut your reads in fasta file and have "#" with every part -eg -

if you have this read ID-

>ZQMK36301EDYQE

then you wil get

>ZQMK36301EDYQE

>ZQMK36301EDYQE#2

>ZQMK36301EDYQE#3

So this way you can keep the track of parent read ID.

Now you can realign these reads.

score 0 · Answer 2 · 2012-06-08

I'm pretty sure there is no tool to do this, but it should be fairly straight-forward to implement if you know a bit of scripting. As a first step you could align your reads with e.g. BWA. Then you extract the reads that didn't map based on the corresponding samtools flag. Those you then chop up and realign. That's where the trouble starts though: Do you only want the first 25bp or all until read-length? Do you want overlapping reads? And most importantly, why do you want to do that? Are you planning to get the most out of your reads? If so, incrementally shorten reads and trying to realign might be a better idea, but it still sounds like overkill.

Andreas