Question

fasta filtering & sorting for SAGE [String-overlap Assembly of GEnomes]

0

Entering edit mode

8.6 years ago

adhb • 0

Hi all,

I have a single fasta file with paired-end reads intended for mitochondrial SAGE de novo assembly [String-overlap Assembly of Genomes, not Serial Analysis of Gene Expression]. I've gotten it through the correction software RACER already, but there are some lingering format issues I need to clear up to run SAGE.

Unix/perl solutions preferred.

(1) Remove all reads that aren't 90 bases long (discard or write into new file)

(2) Remove unpaired reads - i.e., remove those reads for which the ID does not exactly match any other ID in the file (discard or write into new file)

(3) Reorder reads alphabetically so the forward and reverse reads are interleaved

Sorry to post a multi-part problem, but I think it's a set of simple tasks that I can't find leads for in other posts. Help on one or more task would be greatly appreciated.

next-gen Assembly genome • 1.5k views

ADD COMMENT • link updated 19 months ago by Ram 43k • written 8.6 years ago by adhb • 0

score 0 · Answer 1 · 2015-09-09

0

Entering edit mode

8.6 years ago

h.mon 35k

Use programs that are aware and respect paired reads, so you do not have to worry about (2) and (3). My current Swiss-knife is BBTools - reformat.sh should do all you want.

ADD COMMENT • link 8.6 years ago by h.mon 35k