Question

How To Efficiently Pair The Pair-End Sequencing Reads?

2

Entering edit mode

12.9 years ago

Gangcai ▴ 230

Dear all, Currently, I have two files of aligned data in bed format, and each file contain one part of paired-end sequencing results,let's say s11.bed,s22.bed. s11.bed and s22.bed contain different number of reads,some of which can be paired and some not. Usually, I will first combine the two files,then sort the combined file(by linux sort command), and then search one line by one line. Because the paired-reads id is near to each other in the sorted file(such as DBV2SVN164712082000517676201, DBV2SVN164712082000517676202),I can just record previous line id to do the pairing. However, when the alignment files grow larger and larger, sorting can be quite time and memory cosuming. Does anybody here have other good ways to do such kind of stuff? Thanks in advance~

paired next-gen sequencing sequence algorithm retrieval • 4.2k views

ADD COMMENT • link updated 12.9 years ago by Leszek 4.2k • written 12.9 years ago by Gangcai ▴ 230

0

Entering edit mode

By the way, the genome is yeast genome, so even I divide the file into subfiles according to chromosome information, the size of each subfile still very large.

ADD REPLY • link 12.9 years ago by Gangcai ▴ 230

score 1 · Answer 1 · 2011-06-04

Why do you use bed for that purpose, anyway? Wouldn't that be easier with sam/bam? Samtools offer fast sorting algorithms, easy retrieval of properly paired reads, plus you can store your reads as compressed .bam (drastically reduce file size, and still fast access, as it's hashed). Here is a summary of my pipeline (I'm working with 10-20Mb genomes as well):

[?]

A bit of code for sorting & compressing to bam: [?] echo "Converting to BAM, sorting and indexing..." samtools view -but indexed_reference.fa.fai sample.sam -o sample.unsorted.bam samtools sort sample.unsorted.bam sample samtools index sample.bam

echo "Cleaning-up..."
rm sample.unsorted.bam sample.sam

[?]

Then you can retrieve properly paired reads with samtools and specifying flag (see manual), i.e. samtools view sample.bam -f 147 or -f 99 (here is nice bitwise flag expanation) or use samtools mpileup - that will summarize no of reads and all read calls for each base in the reference taking into account only properly paired reads.

Finally, you can visualise algs in .bam with IGV.

Hope, that helps

score 0 · Answer 2 · 2011-05-20

0

Entering edit mode

12.9 years ago

Nibua ▴ 70

If I were you, I would program by using associative structures (hashtables in perl or map in C++ for example). It's very easy to integrate limitations to save memory usage. I can give you the code for this if you are interested but I need to know the exact format of your input files)

ADD COMMENT • link 12.9 years ago by Nibua ▴ 70

0

Entering edit mode

Thanks for your reply. The format of input files is blat psl format. One more questions,do you mean put the id into hash table directly? If so, it will still consume a lot of memory for dozen millions of ids.

ADD REPLY • link 12.9 years ago by Gangcai ▴ 230

score 0 · Answer 3 · 2011-06-03

0

Entering edit mode

12.9 years ago

Ketil 4.1k

I don't think you can do much better than sorting - which is done using O(n log n) time and in linear space. If you sort the files individually (probably using 'sort -k 14'), you can perhaps use 'join' to merge the two files by query sequence, so that each line contains the hits for both ends?

ADD COMMENT • link 12.9 years ago by Ketil 4.1k