Question: How to safely split a fasta file with concatenated multiple fasta sequences
2
gravatar for akhst7
6.2 years ago by
akhst740
United States
akhst740 wrote:

Hi 

I have a fasta file which contains concatenated multiple ref human virus sequences from NCBI and each sequence has a usual NCBI header (starts with 'gi)  as follows;

  • >gi|109390382|ref|NC_008188.1| Human papillomavirus type 103, complete genome
  • >gi|109390389|ref|NC_008189.1| Human papillomavirus type 101, complete genome
  • >gi|110645916|ref|NC_001401.2| Adeno-associated virus - 2, complete genome
  • >gi|134133206|ref|NC_009225.1| Torque teno midi virus 1, complete genome
  • >gi|134288556|ref|NC_009238.1| KI polyomavirus Stockholm 60, complete genome
  • >gi|139424470|ref|NC_009334.1| Human herpesvirus 4, complete genome
  • >gi|139472801|ref|NC_009333.1| Human herpesvirus 8, complete genome
  • >gi|148724565|ref|NC_009539.1| WU Polyomavirus, complete genome
  • >gi|155573622|ref|NC_006273.2| Human herpesvirus 5 strain Merlin, complete genome
  • >gi|165973999|ref|NC_010277.1| Merkel cell polyomavirus, complete genome
  • >gi|167600365|ref|NC_010329.1| Human papillomavirus type 88, complete genome

A size of this file is about 3.2MB and I'd like to split this file into 2 or more smaller files without breaking a sequence of the virus at the end/bottom of the files.  Is there any easy or clever ways to accomplish this ? 

 

Thanks in advance. 

 

 

genome • 2.6k views
ADD COMMENTlink modified 6.2 years ago by Giovanni M Dall'Olio27k • written 6.2 years ago by akhst740

Thanks for the posts. Any scripts using sed/awk, which may not to be a simpler solution than others?

ADD REPLYlink modified 7 months ago by RamRS28k • written 6.2 years ago by akhst740
2
gravatar for Vivek
6.2 years ago by
Vivek2.4k
Denmark
Vivek2.4k wrote:

faSplit from Jim Kent's resources is a suitable tool for the job.

http://hgdownload.cse.ucsc.edu/admin/exe/linux.x86_64/

ADD COMMENTlink written 6.2 years ago by Vivek2.4k
2
gravatar for Caddymob
6.2 years ago by
Caddymob980
United States
Caddymob980 wrote:

Check out bioawk from Heng Li: https://github.com/lh3/bioawk - and the great tutorial from Vince Buffalo https://github.com/vsbuffalo/bioawk-tutorial

ADD COMMENTlink modified 7 months ago by RamRS28k • written 6.2 years ago by Caddymob980
1
gravatar for Giovanni M Dall'Olio
6.2 years ago by
London, UK
Giovanni M Dall'Olio27k wrote:

Try pyfasta:

pyfasta split -n 2 original.fasta
ADD COMMENTlink modified 10 months ago by RamRS28k • written 6.2 years ago by Giovanni M Dall'Olio27k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 712 users visited in the last hour