Question: Breaking down the reference genome chromosome-wise
0
gravatar for Dataman
5.0 years ago by
Dataman310
Finland
Dataman310 wrote:

I am trying to run controlFREEC (a CNA tool) on some .bam files that I have downloaded from TCGA. These .bam files are created using the HG19_Broad_Variant reference genome (which is a .fasta file containing all the chromosomes from UCSC HG19 primary assembly plus EPV). 

In order to run the tool, I need to break down the .fasta file into files containing only the sequence for one chromosome.

I am about to use the following command in unix, however, I am not sure if this would work and of the consequences of doing so. Any suggestion would be much appreciated!

    awk '/>/{n++}{print >"chr" n ".fa" }' Homo_sapiens_assembly19_Broad_Variant.fasta 

 

P.S:

1. I have downloaded the reference genome from: 

    ftp://ftp.ncbi.nlm.nih.gov/sra/reports/Assembly/GRCh37-HG19_Broad_variant/Homo_sapiens_assembly19.fasta

2. I could not find this reference genome in chromosome-wise format anywhere.

reference_genome fasta • 3.4k views
ADD COMMENTlink modified 2.1 years ago by re.ardekani0 • written 5.0 years ago by Dataman310

Isn't this question the same as https://www.biostars.org/p/105388/?

ADD REPLYlink written 5.0 years ago by dariober10k
1

showing similar posts dynamically while typing the question would be good to avoid repeated questions, like stackoverflow.

ADD REPLYlink written 5.0 years ago by geek_y10.0k
7
gravatar for Devon Ryan
5.0 years ago by
Devon Ryan92k
Freiburg, Germany
Devon Ryan92k wrote:

As RamRS suggested, this is simple with biopython (or even bioperl):

#!/usr/bin/env python
from Bio import SeqIO
import sys

#This would be nicer with argparse
if(len(sys.argv) != 2) :
    sys.exit("Usage: %s file.fa" % sys.argv[0])

f=SeqIO.parse(sys.argv[1], "fasta") #We should ensure this exists
for rec in f :
    of=open("%s.fa" % (rec.id), "w")
    SeqIO.write(rec, of, "fasta")
    of.close()
f.close()
ADD COMMENTlink written 5.0 years ago by Devon Ryan92k
2
gravatar for geek_y
5.0 years ago by
geek_y10.0k
Barcelona
geek_y10.0k wrote:

You can use this script.

https://github.com/gouthamatla/fasta_File_Manipulation/blob/master/SplitFastaFile.py

It will split the mult-fasta file into individual files with the same name. You need to edit the script to to change the input fasta file.

ADD COMMENTlink written 5.0 years ago by geek_y10.0k

Hi,

Quick suggestion: I'd change the "data.fa" to sys.argv[1] in your script. That way, you can supply the fasta file at runtime instead of editing your script each time you run it.

Also, using BioPython should help you parse the input file and write the output more efficiently.

ADD REPLYlink modified 5.0 years ago • written 5.0 years ago by RamRS24k
1

Thanks for the suggestion.Done. Bioperl/pythonare definitely useful. But if you do not have them installed, this quick and dirty script will do the job.

ADD REPLYlink modified 5.0 years ago • written 5.0 years ago by geek_y10.0k
2

Check out Devon's answer. Using BioPerl and BioPython ensure edge cases are covered and also give your code a cleaner feel. Quick and dirty does not do great when you're writing stuff you plan on using 3 months down the line.

Also, as a Bioinformatics person (or a Biologist that dabbles in computing), if you do not have BioPython/BioPerl installed, you're probably gonna face a LOT of issues.

ADD REPLYlink written 5.0 years ago by RamRS24k
3
gravatar for Jonathan Dursi
5.0 years ago by
Toronto
Jonathan Dursi270 wrote:

This question is correctly answered, but I just want to make a performance point about file I/O.

Closing and opening files is much more expensive than even large writes, particularly in the middle of repeated writes.  It takes a lot of time both for what it causes - flushing of all buffers (io library and file system) and fsyncs - but also for what it prevents, which is buffering and write coallecense.  It's often advantageous if you can move such operations outside of tight loops - particularly if you're going to be rewriting to the same file later (which isn't the case in this particular use, but often can be).  The difference in performance can be seen on your laptop, but the results are even more dire on a large shared file system as one might see on a cluster. So consider the following toy example:

#!/usr/bin/env python

import os
import os.path

filename="test.out"
teststring="This is a test of file I/O.\n"
nwrites=100

def with_closes():
    if os.path.isfile(filename):
        os.remove(filename)
    for writen in xrange(nwrites):
        of=open(filename,"a")
        of.write(teststring)
        of.close()

def without_closes():
    if os.path.isfile(filename):
        os.remove(filename)
    of=open(filename,"a")
    for writen in xrange(nwrites):
        of.write(teststring)
    of.close()

if __name__ == '__main__':
    import timeit
    print("With opens/closes interleaved:")
    print(timeit.timeit("__main__.with_closes()", 
            setup="import __main__",
            number=50))
    
    print("Without opens/closes interleaved:")
    print(timeit.timeit("__main__.without_closes()", 
            setup="import __main__",
            number=50))

Running this gives:

$ ./timings.py
With opens/closes interleaved:
9.02220511436
Without opens/closes interleaved:
0.398212909698

That is, putting closes in there slowed things down by nearly a factor of 20.  (It's actually a bit worse than that, because we're including the file deletion in the timings).  This is worst for frequent small writes to the same file.

For an assembled reference genome, since there are a comparitively small number of large sequences, and each write is going to be unique (that is, once we write to a file and close it, we're done with it), it's not as big an issue, but still, every little bit helps.  Since there are guaranteed to be a modest number of files written to, we can just open them as they're needed and close them when everything's done:

#!/usr/bin/env python
from Bio import SeqIO
import sys

def filelookup(chrom, file_dict):
    if not chrom in file_dict:
        file_dict[chrom] = open("%s.fa" % (rec.id), "w")

    return file_dict[chrom]

def closefiles(file_dict):
    for open_file in file_dict.values():
        open_file.close()

if(len(sys.argv) != 2) :
    sys.exit("Usage: %s file.fa" % sys.argv[0])

files={}
f=SeqIO.parse(sys.argv[1], "fasta")

for rec in f :
    of=filelookup(rec.id, files);
    SeqIO.write(rec, of, "fasta")

closefiles(files)

Runing gives

$ time ./open-close.py Homo_sapiens_assembly19.fasta

real    1m49.739s
user    1m8.400s
sys    0m4.212s

$ time ./cache-files.py Homo_sapiens_assembly19.fasta

real    1m19.610s
user    1m7.556s
sys    0m3.324s

So even in this case (with few large writes, and only one write per file) we save about 30% of the time (and note that user+sys time now comes pretty close to equalling real time; the remainder is time spent waiting for I/O). 

If this only is done once, of course, it doesn't really matter, but if it becomes a regular part of a pipeline it starts to add up.  And in the more general case where one's pulling a file into seperate components where there _will_ be multiple outputs to the same file, it can make a much more substantial difference.

ADD COMMENTlink modified 5.0 years ago • written 5.0 years ago by Jonathan Dursi270

While nice, your timing comparisons are irrelevant. The IDs are unique within these files; there will be no appending (there's a reason I used a mode of w).

ADD REPLYlink modified 5.0 years ago • written 5.0 years ago by Devon Ryan92k
1

This is certainly true for the OPs case; I just want to make sure people coming to this question in the future under different circumstances know that this can be an issue.  And the timing for parsing even this case is noticably and repeatedly faster, but for somewhat more subtle reasons (and it might easily not make any difference on a laptop, say).

ADD REPLYlink modified 5.0 years ago • written 5.0 years ago by Jonathan Dursi270

Right, you might want to specify at the beginning that this is only relevant in cases where you have duplicate IDs within a multifasta file (this would be fairly atypical). That usually only occurs if someone mucked up construction of the input file (in which case, that's the real problem).

ADD REPLYlink written 5.0 years ago by Devon Ryan92k
1

Changes made; let me know if you have other suggestions.

ADD REPLYlink written 5.0 years ago by Jonathan Dursi270
2

Cool. Of course if performance were really critical then one also wouldn't use python :)

ADD REPLYlink written 5.0 years ago by Devon Ryan92k

There is that :)

ADD REPLYlink written 5.0 years ago by Jonathan Dursi270

I'll make the appropriate updates.

ADD REPLYlink modified 5.0 years ago • written 5.0 years ago by Jonathan Dursi270
0
gravatar for re.ardekani
2.1 years ago by
re.ardekani0 wrote:

you can do this using an AWK one liner

awk '{ r = match($1, "^>"); if (r != 0) {filename= substr( $1, 2, length($1))".fasta"; print filename} else {print $1 >> filename} }' INPUT_GENOME_FILE

ADD COMMENTlink written 2.1 years ago by re.ardekani0
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1122 users visited in the last hour