Divide Reads Into Little Pieces
2
0
Entering edit mode
11.9 years ago

Hi,

Is it possible to divide reads into little pieces ?

Here's what I want to do:

  1. Align the reads (paired-end 76bp
  2. With the unmapped reads (a bam file or fastq : divide each read into little pieces (25 bp
  3. Realign these little reads

Is it a tool to do that ?

Thanks,

N.

read • 1.6k views
ADD COMMENT
1
Entering edit mode

What's your reason for doing this? I can think of many cases where this is going to be a bad idea, and produce alignments that are just wrong.

ADD REPLY
0
Entering edit mode

I've to align my reads against a very little sequence (~20kb) . And I wonder how the aligner manage the extremity of this sequence ? Are the reads aligned with the same accuracy at the extremity of the sequence as in the middle ?

ADD REPLY
1
Entering edit mode

As a suggestion, why don't you assemble all your reads (maybe as ESTs if this is the case), and then BLAST the contigs against your ~20kb sequence? I'm not sure if this makes sense for your biological question, but it makes much more sense from the algorithmical one.

ADD REPLY
0
Entering edit mode

I work on viral genomes which are much smaller (12kb) and have no trouble whatsoever with aligning short reads against it.

ADD REPLY
0
Entering edit mode

[Deleted comment.]

ADD REPLY
0
Entering edit mode

Unless you specifically know of a reason to worry about those first 100 bases or so, dealing with it is generally not worth your time. If you decide to do remapping of sub-reads for any reason, be aware of the possibility that your 25-bp fragments may erroneously map elsewhere in the genome. (The smaller the sequences, the more locations it will map.)

ADD REPLY
2
Entering edit mode
11.9 years ago
Vikas Bansal ★ 2.4k

You can get the ID of unmapped reads from your sam/bam file and grep them from your main fastq (with -A3 grep option) or fasta file (with -A1) or just take the sequence along with read ID's of unmapped reads from sam/bam file and make a new fasta file (may be using awk with sequence storing in next line). For fasta file I used -

perl -nwle '
     $i=0; 
     for my $add (<>=~/.{25}/g) { 
         printf "%s%s\n%s\n", $_, $i++ ? "#$i":"", $add; 
     }' inputfile.fasta

This will cut your reads in fasta file and have "#" with every part -eg -

if you have this read ID-

>ZQMK36301EDYQE

then you wil get

>ZQMK36301EDYQE

>ZQMK36301EDYQE#2

>ZQMK36301EDYQE#3

So this way you can keep the track of parent read ID.

Now you can realign these reads.

ADD COMMENT
0
Entering edit mode
11.9 years ago
Andreas ★ 2.5k

I'm pretty sure there is no tool to do this, but it should be fairly straight-forward to implement if you know a bit of scripting. As a first step you could align your reads with e.g. BWA. Then you extract the reads that didn't map based on the corresponding samtools flag. Those you then chop up and realign. That's where the trouble starts though: Do you only want the first 25bp or all until read-length? Do you want overlapping reads? And most importantly, why do you want to do that? Are you planning to get the most out of your reads? If so, incrementally shorten reads and trying to realign might be a better idea, but it still sounds like overkill.

Andreas

ADD COMMENT

Login before adding your answer.

Traffic: 1662 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6