I am working with human_g1k_v37.fasta which is found on the 1000genomes site, specifically: ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/technical/reference/
I have parsed this into files of individual chromosomes.
The header for chromosome 1 looks like so:
>1 dna:chromosome chromosome:GRCh37:1:1:249250621:1
So I have an index of 1 with the rest of the data as part of the description
My understanding is this header can be explained as so:
coord_system_name = chromosome coord_system_version = GRCh37 seq_region.name = 1 seq_region.start = 1 seq_region.length = 249250621 seq_region. strand = 1
My question is, is there anything I can do in Biopython to read these values in? I am just identifying the file I am reading as a file of type "fasta". I am wondering if I must manually parse this out splitting on colon or if functions already exist in Biopython that can do this for me?
Here is an example of the code I use to read in this file:
def read_fasta_file(filename): handle = open(filename, "rU") for record in SeqIO.parse(handle, fileFormat): print("ID %s" % record.id) print("Sequence length %i" % len(record)) print("Sequence desc %s" % record.description) print("Sequence alphabet %s" % record.seq.alphabet) handle.close()