Question: Split fasta/fastq into individual fasta files (ubuntu)
0
gravatar for omer.k
2.4 years ago by
omer.k40
omer.k40 wrote:

Hi, I have several fastq files from a MinION sequencing run, each containing some 4,000 reads, and a total of close to 20,000 reads. I first consolidated these fastq files into one fastq file. When google-ing commands for splitting to individual fasta files, I found that the input should be a large fasta file (but, let me know if there's no need for that). Therefor I used a command to create a fasta file from that fastq file. The seperator there is a '>' obviously.

Now I'd like to split this file into individual fasta files (again, I have close to 20,000 reads in this file). I found this following command and tried it:

 awk '/^>/{s=++d".fasta"} {print > s}' file.fa

But it failed after around a 1,000 files which it successful converted.

Can anyone suggest a quick command which I may use to extract single fasta file from the huge fastq/fasta files?

linux script minion ngs • 1.3k views
ADD COMMENTlink modified 2.4 years ago by Pierre Lindenbaum126k • written 2.4 years ago by omer.k40

Do you want to split the file or just convert the large file to fasta format? Are are you asking for each read to be put into its own file?

ADD REPLYlink written 2.4 years ago by genomax78k

Indeed split one huge fasta/fsatq file into multiple, individual fasta files.

ADD REPLYlink written 2.4 years ago by omer.k40

In that case, either use the solution posted by @Manu below or use faSplit from Jim Kent's utilities.

ADD REPLYlink written 2.4 years ago by genomax78k

Of course, it should not be a nightmare to write a script that takes a fastq file in input, and deliver per-read fasta files in output. An issue you may encounter, depending on your filesystem I guess, is the number of inodes management. An idea could be to separate the fasta files into different folders.

If you start with a fasta file, you may want to take a look to my answer here: A: Pyfasta Split By Header

ADD REPLYlink modified 2.4 years ago • written 2.4 years ago by Manu Prestat3.9k
0
gravatar for Pierre Lindenbaum
2.4 years ago by
France/Nantes/Institut du Thorax - INSERM UMR1087
Pierre Lindenbaum126k wrote:

use awk to linearize, split to split the input fasta, loop over the created files to convert back to fasta:

awk '/^>/ {printf("%s%s\t",(N>0?"\n":""),$0);N++;next;} {printf("%s",$0);} END {printf("\n");}' ~/nsp3.fasta | split -l 1000 - splitseq  && ls splitseq* | grep -v fasta | while read F ; do tr "\t" "\n" < "${F}" > "${F}.fasta" && rm "${F}" ; done
ADD COMMENTlink modified 2.4 years ago • written 2.4 years ago by Pierre Lindenbaum126k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1162 users visited in the last hour