Question: Reading colon delimited description from Fasta file possible?
0
gravatar for bfeeny
4.6 years ago by
bfeeny20
United States
bfeeny20 wrote:

I am working with human_g1k_v37.fasta which is found on the 1000genomes site, specifically: ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/technical/reference/

I have parsed this into files of individual chromosomes.


The header for chromosome 1 looks like so:

>1 dna:chromosome chromosome:GRCh37:1:1:249250621:1

So I have an index of 1 with the rest of the data as part of the description

My understanding is this header can be explained as so:

coord_system_name = chromosome
coord_system_version = GRCh37
seq_region.name = 1
seq_region.start = 1
seq_region.length = 249250621
seq_region. strand = 1

My question is, is there anything I can do in Biopython to read these values in?  I am just identifying the file I am reading as a file of type "fasta".  I am wondering if I must manually parse this out splitting on colon or if functions already exist in Biopython that can do this for me?

Here is an example of the code I use to read in this file:
 

def read_fasta_file(filename):
    handle = open(filename, "rU")
    for record in SeqIO.parse(handle, fileFormat):
        print("ID %s" % record.id)
        print("Sequence length %i" % len(record))
        print("Sequence desc %s" % record.description)
        print("Sequence alphabet %s" % record.seq.alphabet)
    handle.close()​

 

biopython 1000genomes fasta • 1.4k views
ADD COMMENTlink modified 4.6 years ago • written 4.6 years ago by bfeeny20
1
gravatar for Devon Ryan
4.6 years ago by
Devon Ryan93k
Freiburg, Germany
Devon Ryan93k wrote:

The question is what you actually want to get from the description. There's no standard formatting for it, it can be free text (though it's structured in this case), so it won't be automagically parsed.
 

ADD COMMENTlink written 4.6 years ago by Devon Ryan93k

Mainly to ensure its GRCh37 and which chromosome.  I have seen multiple file types store "annotations", sort of like key/value pairs.  I figured there may be some tools in Biopython to help me get this information from any FASTA file.

 
ADD REPLYlink written 4.6 years ago by bfeeny20

You'll have to just parse the description with a regex. If you read through the README file that describes what you downloaded, you'll see that it's mostly GRCh37, with the MT sequence changed.

ADD REPLYlink written 4.6 years ago by Devon Ryan93k

Thanks.   I want my program to be able to read a FASTA file and identify what chromosome(s) are in it, so it can do proper sequencing to the reference chromosome(s).  Obviously my reference chromosome has this colon delimited header.  Would a typical, if there is such a term, FASTA header have a fairly standard way to identify which chromosome is being passed in?  I guess I could make the assumption that anything my program is using, is Human, and I could just read the index and totally forget about the header, does that sound right?

ADD REPLYlink written 4.6 years ago by bfeeny20
1

The chromosome name is what follows the ">", so chromosome 1 in your case. The remainder of what you showed is typically not present. Don't expect anything other than a chromosome name. There is no general way to tell from a fasta file what organism it came from or what version it is.

ADD REPLYlink written 4.6 years ago by Devon Ryan93k

Devon if you type in your basic reply as an answer I will mark it as answered, thank you for your help.

ADD REPLYlink written 4.6 years ago by bfeeny20

I just moved this stack of comments to an answer.

ADD REPLYlink written 4.6 years ago by Devon Ryan93k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 842 users visited in the last hour