Parsing coordinates of a RefSeq cds_from_genomic.fna file
1
0
Entering edit mode
3.5 years ago

Hi there,

I'm trying to parse this fna file (found here) into a bed file.

I get that the fna coordinats are 1-based and inclusive. So by joining the following coordinates:

>lcl|NC_029649.1_cds_XP_015810653.1_3 [db_xref=GeneID:107382813] [protein=ras-related protein Rab-33A-like] [protein_id=XP_015810653.1] [location=join(70623..70994,88563..88703,95063..95143,95215..95439)] [gbkey=CDS]
ATGACCAAGGATTCCCCGGAGGAGAGCCGCGCTGCGAGCGGAGGAGGAGGAAAAATCAGGAGAAACCGAGCCGACGACAA
TGTGACCATCCTGACATCCTCCATGGACTTCCACAGAGCCAGCAGGAGCCGAGCCAGCAGCGGCACTGATGCCGCCCCTA
GCGTCACCTCCTCCGTGGACCTGAGCACCTCCTCCCTGGAGATGAGCATCCAGACGCGGATCTTTAAGATCATCGTCATC
GGGGACTCCAACGTGGGGAAGACTTGCCTAACCTTCCGCTTCACCGGGGGAAGCTTCCCTGACAAGACCGAGGCCACGAT
CGGAGTGGATTTCAGGGAGAAGGCGGTGGAGATAGAAGGAGAAACCATTAAGGTGCAGGTGTGGGACACAGCAGGCCAGG
AACGCTTTCGGAAGTCCATGGTGGAACACTACTACCGGAATGTCCATGCTGTGGTCTTCGTGTACGATGTTACTAAGATG
GCTTCCTTCCGCAACCTGCAGACATGGATAGAGGAGTGTAACGGCCATCGGGTGTCTGCGTCTGTGCCTCGGGTTCTTGT
GGGAAACAAGTGTGACCTCGTGGATCAAATACAGGTGCCTTCCAACATGGCGCTGAAGTTCGCCGACGCCCACAACATGC
TGCTGTTCGAGACGTCCGCGAAGGACCCAAAGGAGACCCAGAATGTGGATTCCATCTTCATGTCGCTGGCCTGCCGCCTG
AAAGCCCAGAAGTCTCTGCTCTACAGAGACGTGGAGCGGGAGGACGGGAGGGTCAGGCTCTCACAGGAGACTGAAACCAA
AAGTAACTGTCCCTGTTGA

I can extract an identical sequence.

But lines like this fail me:

>lcl|NC_029649.1_cds_XP_015810238.1_10 [db_xref=GeneID:107382541] [protein=m7GpppX diphosphatase] [frame=2] [partial=5'] [exception=annotated by transcript or proteomic data] [protein_id=XP_015810238.1] [location=join(<612582..612786,613676..613826,613918..614063,614140..614253,614642..614752,614833..615099)] [gbkey=CDS]
GTGAGAGAAGGCGCTACAAACATGGCGGACGCTGTAGAAAGTGTTTATAGAGAAAAAGACGAGTTCTGTCAGGAGGCCAA
GAGATCGAAACCCGCCGACAGGGACGGACCAGAGTCTGAGAGTGAAAATATTTTAGCTGGATTTAAAACACCCAACGTGT
TGAGCGATTCTGCCCGGGAGAAAATCATCTTCATCCATGGAAAGATTGCAGATCAGGATGCTGTGGTCATCCTGGAGAAG
ACCCCCATCAGAGAAGATACGCTCCCTGAGCTCTTCAGTTGTTCCTCGCTCAGACTGGAGACGAGAAACGACATCTACGG
CTCCTATCGTCTCCAGGCCCCGCCCCACCTAAATGAGATTAAGACCACCGTTATTTATCCAGCCACAGAAAAGCACGTCA
AGAAGTATCAGCGTCAGGAGAACTTCCTGGTGGAGGAGACGGGAGAGGACTACGAGTCCATCACGCTGCCTTATATTCAG
CAGCAGAGTTTGAGCCTGCAGTGGGTTTACAACATCCTGGACAAGAAGGCTGAGGCCGACCGCGTCGTTTATGAAGACCC
AGACCCAAAGCTCGGCTTTGTCCTTCTCCCTGATTTAAAGTGGAACCAAAAGCAGGTGGACGACTTGTATCTGATTGCCG
TTGCTCATCAGAGAGACGTCAGAAGTCTTCGTGACCTGACGTCAGAGCACCTACCTTTGCTGCAGAACATCTTCCAGAAA
GGAAAGGAAGCCATCCTGCAGCGCTACAACCTTCCAAGCAGCAAGCTGAGGGTCTACCTGCACTACCAGCCCTCCTACTA
CCATCTCCACGTCCACTTCACCAAGTTGGGCTACGAGGCACCGGGCTGCGGCGTGGAGCGAGCCCACCTTCTGTCAGACG
TCATCCAGAACCTCCAGGCCAACCCGCAGTTCTACAAAACCCGAACAATGTACTTCCCTCTGAGGGCCGACGACGGGCTG
CTCGACAGGTTCAGAGAGGCGGGCAGGATGTGA

If (hypothetically of course) I ignore the left angle bracket preceding the starting base in the first coordinates, <612582..612786, I get an extracted sequence which is one base longer then the reported one.

So maybe due to frame=2 in the header, it means that the ORF starts one base downstream, at 612583.

But then, there are also cases when frame=3, or when there's a right angle bracket following the ending base, or when there are angle brackets but no annotation of frame different than 1.

So, how should coordinates and angle brackets be read in a fna file?

Thanks in advance.

fna sequence assembly • 1.3k views
ADD COMMENT
1
Entering edit mode

Please use the formatting bar (especially the code option) to present your post better. You can use backticks for inline code (`text` becomes text), or select a chunk of text and use the highlighted button to format it as a code block. I've done it for you this time.
code_formatting

ADD REPLY
0
Entering edit mode

An educated guess is that brackets are telling you that either start or stop codons are missing. Other than that, the second sequence has expected length and translates into the annotated protein. I think maybe you try ignoring the brackets and see if that creates frame-shifting in your codons. If not, chances are that brackets are only telling you that the expected codons are missing at either end.

ADD REPLY
0
Entering edit mode

Pretty educated guess :)

Found today this documentation in the README.txt file in the above-mentioned folder:

*_cds_from_genomic.fna.gz & *_rna_from_genomic.fna.gz
-----------------------------------------------------
FASTA sequences of individual features annotated on the genomic records. The 
sequences are based solely on the genome sequence and annotated feature at a
particular location. They may differ from the product sequences found in the 
*_rna.fna.gz and *_protein.faa.gz files which may be based on transcript or 
other data sources and include mismatches, indels, or additional sequence not 
found at a particular genomic location.

...

For CDS features that begin in frame 2 or 3, the first 1 or 2 bp of sequence
are trimmed from the CDS FASTA so that it always begins with the first complete
codon. The location and frame qualifiers are left unaltered; consequently, the 
length of the ranges in the location string may be 1-2 bp longer than the FASTA 
sequence.

For RefSeq assemblies annotated by NCBI's Eukaryotic Genome Annotation 
Pipeline, a gene may have a frameshifting indel(s) in the genome that is 
thought to result from a genome sequencing error; in these cases, the gene is 
still considered to be protein-coding and annotated with mRNA and CDS features, 
but the genome sequence won't translate correctly downstream from the 
frameshift. To compensate, the FASTA sequence of the genomic CDS and RNA 
features is modified with 1-2 bp gaps (aka "micro-introns") in order to 
restore the predicted reading frame. This modification is reflected by 1-2 bp 
micro-introns in the location qualifier. An equivalent modification is also
made in the *_genomic.gff.gz file. A protein-coding gene may also be annotated
with a CDS feature containing an in-frame stop codon that is translated as a
selenocysteine, subject to stop-codon readthrough, or thought to result from a
genome sequencing error; in these cases, a transl_except qualifier is provided
indicating the genomic location of the stop codon and its proposed translation.
For more details, see the section on "Annotation accommodations for putative 
assembly errors" in:
ftp://ftp.ncbi.nlm.nih.gov/genomes/README_GFF3.txt

...
ADD REPLY
0
Entering edit mode
3.5 years ago
GenoMax 141k

You may want to get the feature table file for this genome instead and then create your BED from the information in that file. An excerpt from file below.

# feature   class   assembly    assembly_unit   seq_type    chromosome  genomic_accession   start   end strand  product_accession   non-redundant_refseq    related_accession   name    symbol  GeneID  locus_tag   feature_interval_length product_length  attributes
gene    protein_coding  GCF_001465895.1 Primary Assembly    chromosome  sgr01   NC_029649.1 4898    59338   +                   LOC107382895    107382895       54441       partial
mRNA        GCF_001465895.1 Primary Assembly    chromosome  sgr01   NC_029649.1 4898    59338   +   XM_015955279.1      XP_015810765.1  matrix-remodeling-associated protein 5-like, transcript variant X1  LOC107382895    107382895       8396    8517    partial
mRNA        GCF_001465895.1 Primary Assembly    chromosome  sgr01   NC_029649.1 4898    59338   +   XM_015955357.1      XP_015810843.1  matrix-remodeling-associated protein 5-like, transcript variant X2  LOC107382895    107382895       8366    8487    partial
CDS with_protein    GCF_001465895.1 Primary Assembly    chromosome  sgr01   NC_029649.1 21904   58498   +   XP_015810765.1      XM_015955279.1  matrix-remodeling-associated protein 5-like isoform X1  LOC107382895    107382895       7350    2449    
CDS with_protein    GCF_001465895.1 Primary Assembly    chromosome  sgr01   NC_029649.1 21904   58498   +   XP_015810843.1      XM_015955357.1  matrix-remodeling-associated protein 5-like isoform X2  LOC107382895    107382895       7320    2439    
gene    protein_coding  GCF_001465895.1 Primary Assembly    chromosome  sgr01   NC_029649.1 70489   96597   +                   LOC107382813    107382813       26109       
mRNA        GCF_001465895.1 Primary Assembly    chromosome  sgr01   NC_029649.1 70489   96597   +   XM_015955167.1      XP_015810653.1  ras-related protein Rab-33A-like    LOC107382813    107382813       2111    2111    
CDS with_protein    GCF_001465895.1 Primary Assembly    chromosome  sgr01   NC_029649.1 70623   95439   +   XP_015810653.1      XM_015955167.1  ras-related protein Rab-33A-like    LOC107382813    107382813       819 272 
gene    protein_coding  GCF_001465895.1 Primary Assembly    chromosome  sgr01   NC_029649.1 75549   78242   -                   LOC107382745    107382745       2694

Something like:

$ cat GCF_001465895.1_Nfu_20140520_feature_table.txt | grep "^CDS" |awk -F '\t' '{OFS="\t"}{print $11,$8,$9,$14}' | head -4
XP_015810765.1  21904   58498   matrix-remodeling-associated protein 5-like isoform X1
XP_015810843.1  21904   58498   matrix-remodeling-associated protein 5-like isoform X2
XP_015810653.1  70623   95439   ras-related protein Rab-33A-like
XP_015810537.1  76308   78189   THAP domain-containing protein 4-like
ADD COMMENT
0
Entering edit mode

Thanks!
Do you think that by extracting each RNA feature (e.g. mRNA or ncRNA) the same approach could be used in order to define the Transcriptome?
The same 5 terms - misc_RNA, mRNA, ncRNA, rRNA, tRNA - appear both in feature_table.txt and in rna_from_genomic.fna.

ADD REPLY

Login before adding your answer.

Traffic: 2555 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6