Question

Breaking down the reference genome chromosome-wise

0

Entering edit mode

9.5 years ago

Dataman ▴ 380

I am trying to run controlFREEC (a CNA tool) on some .bam files that I have downloaded from TCGA. These .bam files are created using the HG19_Broad_Variant reference genome (which is a .fasta file containing all the chromosomes from UCSC HG19 primary assembly plus EPV).

In order to run the tool, I need to break down the .fasta file into files containing only the sequence for one chromosome.

I am about to use the following command in unix, however, I am not sure if this would work and of the consequences of doing so. Any suggestion would be much appreciated!

awk '/>/{n++}{print >"chr" n ".fa" }' Homo_sapiens_assembly19_Broad_Variant.fasta

P.S:

I have downloaded the reference genome from: ftp://ftp.ncbi.nlm.nih.gov/sra/reports/Assembly/GRCh37-HG19_Broad_variant/Homo_sapiens_assembly19.fasta
I could not find this reference genome in chromosome-wise format anywhere.

reference_genome fasta • 5.7k views

ADD COMMENT • link updated 2.2 years ago by Ram 43k • written 9.5 years ago by Dataman ▴ 380

0

Entering edit mode

Isn't this question the same as https://www.biostars.org/p/105388/?

ADD REPLY • link 9.5 years ago by dariober 14k

1

Entering edit mode

showing similar posts dynamically while typing the question would be good to avoid repeated questions, like stackoverflow.

ADD REPLY • link 9.5 years ago by GouthamAtla 12k

3

Entering edit mode

9.5 years ago

Jonathan Dursi ▴ 270

This question is correctly answered, but I just want to make a performance point about file I/O.

Closing and opening files is much more expensive than even large writes, particularly in the middle of repeated writes. It takes a lot of time both for what it causes - flushing of all buffers (io library and file system) and fsyncs - but also for what it prevents, which is buffering and write coallecense. It's often advantageous if you can move such operations outside of tight loops - particularly if you're going to be rewriting to the same file later (which isn't the case in this particular use, but often can be). The difference in performance can be seen on your laptop, but the results are even more dire on a large shared file system as one might see on a cluster. So consider the following toy example:

#!/usr/bin/env python

import os
import os.path

filename="test.out"
teststring="This is a test of file I/O.\n"
nwrites=100

def with_closes():
    if os.path.isfile(filename):
        os.remove(filename)
    for writen in xrange(nwrites):
        of=open(filename,"a")
        of.write(teststring)
        of.close()

def without_closes():
    if os.path.isfile(filename):
        os.remove(filename)
    of=open(filename,"a")
    for writen in xrange(nwrites):
        of.write(teststring)
    of.close()

if __name__ == '__main__':
    import timeit
    print("With opens/closes interleaved:")
    print(timeit.timeit("__main__.with_closes()", 
            setup="import __main__",
            number=50))

    print("Without opens/closes interleaved:")
    print(timeit.timeit("__main__.without_closes()", 
            setup="import __main__",
            number=50))

Running this gives:

$ ./timings.py
With opens/closes interleaved:
9.02220511436
Without opens/closes interleaved:
0.398212909698

That is, putting closes in there slowed things down by nearly a factor of 20. (It's actually a bit worse than that, because we're including the file deletion in the timings). This is worst for frequent small writes to the same file.

For an assembled reference genome, since there are a comparitively small number of large sequences, and each write is going to be unique (that is, once we write to a file and close it, we're done with it), it's not as big an issue, but still, every little bit helps. Since there are guaranteed to be a modest number of files written to, we can just open them as they're needed and close them when everything's done:

#!/usr/bin/env python
from Bio import SeqIO
import sys

def filelookup(chrom, file_dict):
    if not chrom in file_dict:
        file_dict[chrom] = open("%s.fa" % (rec.id), "w")

    return file_dict[chrom]

def closefiles(file_dict):
    for open_file in file_dict.values():
        open_file.close()

if(len(sys.argv) != 2) :
    sys.exit("Usage: %s file.fa" % sys.argv[0])

files={}
f=SeqIO.parse(sys.argv[1], "fasta")

for rec in f :
    of=filelookup(rec.id, files);
    SeqIO.write(rec, of, "fasta")

closefiles(files)

Running gives

$ time ./open-close.py Homo_sapiens_assembly19.fasta

real    1m49.739s
user    1m8.400s
sys    0m4.212s

$ time ./cache-files.py Homo_sapiens_assembly19.fasta

real    1m19.610s
user    1m7.556s
sys    0m3.324s

So even in this case (with few large writes, and only one write per file) we save about 30% of the time (and note that user+sys time now comes pretty close to equalling real time; the remainder is time spent waiting for I/O).

If this only is done once, of course, it doesn't really matter, but if it becomes a regular part of a pipeline it starts to add up. And in the more general case where one's pulling a file into separate components where there _will_ be multiple outputs to the same file, it can make a much more substantial difference.

ADD COMMENT • link updated 2.2 years ago by Ram 43k • written 9.5 years ago by Jonathan Dursi ▴ 270

0

Entering edit mode

While nice, your timing comparisons are irrelevant. The IDs are unique within these files; there will be no appending (there's a reason I used a mode of w).

ADD REPLY • link 9.5 years ago by Devon Ryan 104k

1

Entering edit mode

This is certainly true for the OPs case; I just want to make sure people coming to this question in the future under different circumstances know that this can be an issue. And the timing for parsing even this case is noticably and repeatedly faster, but for somewhat more subtle reasons (and it might easily not make any difference on a laptop, say).

ADD REPLY • link updated 2.2 years ago by Ram 43k • written 9.5 years ago by Jonathan Dursi ▴ 270

0

Entering edit mode

Right, you might want to specify at the beginning that this is only relevant in cases where you have duplicate IDs within a multifasta file (this would be fairly atypical). That usually only occurs if someone mucked up construction of the input file (in which case, that's the real problem).

ADD REPLY • link 9.5 years ago by Devon Ryan 104k

1

Entering edit mode

Changes made; let me know if you have other suggestions.

ADD REPLY • link 9.5 years ago by Jonathan Dursi ▴ 270

2

Entering edit mode

Cool. Of course if performance were really critical then one also wouldn't use python :)

ADD REPLY • link 9.5 years ago by Devon Ryan 104k

0

Entering edit mode

There is that :)

ADD REPLY • link 9.5 years ago by Jonathan Dursi ▴ 270

0

Entering edit mode

I'll make the appropriate updates.

ADD REPLY • link 9.5 years ago by Jonathan Dursi ▴ 270

0

Entering edit mode

6.5 years ago

re.ardekani • 0

You can do this using an AWK one liner

awk '{ r = match($1, "^>"); if (r != 0) {filename= substr( $1, 2, length($1))".fasta"; print filename} else {print $1 >> filename} }' INPUT_GENOME_FILE

ADD COMMENT • link updated 2.2 years ago by Ram 43k • written 6.5 years ago by re.ardekani • 0

score 7 · Accepted Answer · 2014-11-03

As RamRS suggested, this is simple with biopython (or even bioperl):

#!/usr/bin/env python
from Bio import SeqIO
import sys

#This would be nicer with argparse
if(len(sys.argv) != 2) :
    sys.exit("Usage: %s file.fa" % sys.argv[0])

f=SeqIO.parse(sys.argv[1], "fasta") #We should ensure this exists
for rec in f :
    of=open("%s.fa" % (rec.id), "w")
    SeqIO.write(rec, of, "fasta")
    of.close()
f.close()

score 2 · Accepted Answer · 2014-11-03

2

Entering edit mode

9.5 years ago

GouthamAtla 12k

You can use this script.

https://github.com/gouthamatla/fasta_File_Manipulation/blob/master/SplitFastaFile.py

It will split the mult-fasta file into individual files with the same name. You need to edit the script to to change the input fasta file.

ADD COMMENT • link 9.5 years ago by GouthamAtla 12k

0

Entering edit mode

Hi,

Quick suggestion: I'd change the "data.fa" to sys.argv[1] in your script. That way, you can supply the fasta file at runtime instead of editing your script each time you run it.

Also, using BioPython should help you parse the input file and write the output more efficiently.

ADD REPLY • link 9.5 years ago by Ram 43k

1

Entering edit mode

Thanks for the suggestion.Done. Bioperl/pythonare definitely useful. But if you do not have them installed, this quick and dirty script will do the job.

ADD REPLY • link 9.5 years ago by GouthamAtla 12k

2

Entering edit mode

Check out Devon's answer. Using BioPerl and BioPython ensure edge cases are covered and also give your code a cleaner feel. Quick and dirty does not do great when you're writing stuff you plan on using 3 months down the line.

Also, as a Bioinformatics person (or a Biologist that dabbles in computing), if you do not have BioPython/BioPerl installed, you're probably gonna face a LOT of issues.

ADD REPLY • link 9.5 years ago by Ram 43k