How to split fasta into seperate files by chromosome (in the header)
3
3
Entering edit mode
5.5 years ago
ethanagbaker ▴ 30

I have 1000 fasta files that have simulated reads, and I want to split each of these 1000 files into separate files (one per chromosome) as I need this for some further analysis. The header lines look like this >chr1:startpos:endpos.

I wrote this code (https://gist.github.com/ethanagbaker/6e40c58127b7ca8b9242) in Python, and it works, but it is very slow (ie it has been running for more than 24 hours on a cluster). This seems like it should be a really quick thing to do. Is there a better way to be doing this?

Thanks!

fasta split python genome sequence • 11k views
ADD COMMENT
2
Entering edit mode
ADD REPLY
0
Entering edit mode

faSplit from Jim Kent @UCSC. Link is for Linux but OS X version available along with source.

ADD REPLY
10
Entering edit mode
5.5 years ago
$ pip install pyfaidx
$ faidx -x sequences.fa

Or you can use the Fasta class and write your own script to do the same thing. Your code is slow because it is opening a bunch of files in a loop, and then opening (the same files?) and reading them completely to get the sequences. Indexed file access will be orders of magnitude faster.

ADD COMMENT
3
Entering edit mode
5.5 years ago

There are three reasons why your Python code works very slow:

  1. You unnecessarily open and close each outut file million of times. Opening/closing files is very expensive. Most likely, you'll reduce the running time greatly if you open all the files for writing at the beginning. I suggest that you keep these files in a dictionary (d = {(name1, chr1): open(pwd+name+"_"+chr+".fa", 'w'), ...}. Then, in your forloop where you iterate over fasta records, put a variable output = d[(name, chrom)] and then write a record to that file by output.write(record.format('fasta')).
  2. You're calling SeqIO.write(...) many times instead of once. Instead, use: output.write(record.format('fasta')).
  3. Parsing FASTA using Biopython checks for some input errors and this might take some time. It does more than you need in your case.
ADD COMMENT
0
Entering edit mode
5.5 years ago
novice ★ 1.0k

Hi Ethan,

Try this:

#!/usr/bin/perl

while (<>) {
  open $fh, '>>', $1 if /\A>(.+?):/; 
  print $fh $_;
}

Use with a glob to include all your reads, like:

$ perl chr_split.pl reads*.fasta
ADD COMMENT

Login before adding your answer.

Traffic: 1248 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6