Question: map refseq to identical genbank
0
gravatar for cmo
9 weeks ago by
cmo60
United States
cmo60 wrote:

Some databases provide Genbank coordinates. Others provide RefSeq coordinates. I am looking for a table of pairwise associations between identical Genbank and Refseq records.

The ultimate goal is: if I have e.g. a BED file with tens of thousands of annotations across thousands of Genbank genomes, I would like to replace the Genbank accession ID with the identical Refseq accesion id for each line in the BED fle, provided such a corresponding Refseq ID exists.

Yes, RefSeq is a curated subset of Genbank that has been copied and so the records are technically distinct. However, the RefSeq geome pages on NCB provide a link to "Identical Genbank Sequence" For example: RefSeq genome page for E.coli MG1655 provides a link to "Identical Genbank Sequence"

genbank refseq assembly ncbi • 147 views
ADD COMMENTlink modified 7 weeks ago • written 9 weeks ago by cmo60
3
gravatar for cmo
7 weeks ago by
cmo60
United States
cmo60 wrote:

This information is in the /ASSEMBLY_REPORTS/ directory on the Genomes FTP site:

ftp://ftp.ncbi.nlm.nih.gov/genomes/ASSEMBLY_REPORTS/

I contacted NCBI, and NLM Support (nlm-support@nlm.nih.gov) provided the following answer:

As you will learn in the README file, columns 18 and 19 in the assembly_summary files will give you such pairing:

Column 18: "gbrs_paired_asm" GenBank/RefSeq paired assembly: the accession.version of the GenBank assembly that is paired to the given RefSeq assembly, or vice-versa. "na" is reported if the assembly is unpaired.

Column 19: "paired_asm_comp" Paired assembly comparison: whether the paired GenBank & RefSeq assemblies are identical or different. Values: identical - GenBank and RefSeq assemblies are identical different - GenBank and RefSeq assemblies are not identical na - not applicable since the assembly is unpaired

And it actually works.

ADD COMMENTlink written 7 weeks ago by cmo60
2
gravatar for genomax
9 weeks ago by
genomax75k
United States
genomax75k wrote:

Using Entrezdirect:

$ esearch -db nuccore -query "NC_000913" | efetch -format docsum | xtract -pattern DocumentSummary -element Caption, Title, AssemblyAcc
NC_000913       Escherichia coli str. K-12 substr. MG1655, complete genome      U00096
ADD COMMENTlink written 9 weeks ago by genomax75k

good idea, but this will not scale to thousands of accessions, as indicated in the question.

ADD REPLYlink written 9 weeks ago by cmo60
1
gravatar for ctseto
9 weeks ago by
ctseto250
ctseto250 wrote:

FastANI GenBank vs RefSeq? Though I imagine NCBI has the structured relationships encoded somewhere, which would save a bunch of computer cycles.

ADD COMMENTlink written 9 weeks ago by ctseto250

yes, i imagined the relationship is encoded somewhere. interesting idea, though.

ADD REPLYlink written 9 weeks ago by cmo60

Looks like NCBI already did this?

wget https://ftp.ncbi.nih.gov/genomes/ASSEMBLY_REPORTS/ANI_report_bacteria.txt

head ANI_report_bacteria.txt

genbank-accession     refseq-accession        annot-date      taxid   species-taxid   organism-name   species-name    assembly-name   ANI-species-name        ANI-type-assembly       ANI-type-category       Typestrain-ANI  ANI-QCoverage   ANI-SCoverage   ANI-status      Submitted-species-name  Submitted-type-assembly Submitted-type-category Submitted-ANI   Submitted-QCoverage     Submitted-SCoverage     contig-count    genome-length   contig-N50      contig-L50      species-asm-count       species-avg-cds-count

GCA_000006625.1 GCF_000006625.1 2017/04/06      273119  134821  Ureaplasma parvum serovar 3 str. ATCC 700970    Ureaplasma parvum       ASM662v1        Ureaplasma parvum       GCA_000019345.1 type    99.9918 99.99   99.99   species-match   Ureaplasma parvum serovar 3 str. ATCC 700970    GCA_000019345.1 type    99.9918 99.99   99.99   1.00    751719.00       751719  1       13      590.308

Looks like column1 and column2 are GenBank and RefSeq, might float your boat?

wget https://ftp.ncbi.nih.gov/genomes/ASSEMBLY_REPORTS/assembly_summary_genbank.txt

head assembly_summary_genbank.txt

See ftp://ftp.ncbi.nlm.nih.gov/genomes/README_assembly_summary.txt for a description of the columns in this file.
assembly_accession    bioproject      biosample       wgs_master      refseq_category taxid   species_taxid   organism_name   infraspecific_name      isolate version_status  assembly_level  release_type    genome_rep   seq_rel_date    asm_name        submitter       gbrs_paired_asm paired_asm_comp ftp_path        excluded_from_refseq    relation_to_type_material
GCA_000001215.4 PRJNA13812      SAMN02803731            reference genome        7227    7227    Drosophila melanogaster                 latest  Chromosome      Major   Full    2014/08/01      Release 6 plus ISO1 MT       The FlyBase Consortium/Berkeley Drosophila Genome Project/Celera Genomics       *GCF_000001215.4* identical       ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/000/001/215/GCA_000001215.4_Release_6_plus_ISO1_MT

GCA_000001405.28        PRJNA31257                      reference genome        9606    9606    Homo sapiens                    latest  Chromosome      Patch   Full    2019/02/28      GRCh38.p13      Genome Reference Consortium  *GCF_000001405.39*        different       ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/000/001/405/GCA_000001405.28_GRCh38.p13
ADD REPLYlink modified 9 weeks ago by genomax75k • written 9 weeks ago by ctseto250
1
gravatar for vkkodali
9 weeks ago by
vkkodali1.5k
United States
vkkodali1.5k wrote:

For the "Identical Genbank Sequence" link, you can use edirect as shown below:

elink -db nucleotide -id NC_000913.3 -target nucleotide -name nuccore_nuccore_rsgb | efetch -format acc
ADD COMMENTlink written 9 weeks ago by vkkodali1.5k

good idea, but this will not scale to thousands of accessions, as indicated in the question.

ADD REPLYlink written 9 weeks ago by cmo60
1

This should be fine for a few thousand accessions. What is the scale here? Tens of thousands? Hundreds of thousands? And scope? Bacteria only, higher eukaryotes, etc?

The NCBI Genomes FTP path has an assembly_report.txt file for each RefSeq assembly that contains RefSeq and GenBank mapping. It may make more sense to download all of the assembly_report.txt files first from FTP, concatenate them and make your own mapping database locally.

ADD REPLYlink written 9 weeks ago by vkkodali1.5k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1884 users visited in the last hour