How to obtain the chromosome out of an accession number?
3
0
Entering edit mode
6.0 years ago
eidriangm ▴ 60

Hello Community.

My problem is the following, I have some bed files whose genomic regions are annotated using the chromosome (chr__ start end ... ...), and I want to use the ncbi gff3 to extract the info but this file is annotated using accession.version numbers. Bedtools oblige me to use the same location nomencaluture thus I need to transform the accession to chr base.

So far I know that the number of the "NC_" prefixed accessions id specify the chromosme, (i.e: NC_000001.11: chr1, NC_000002.12: chr2, ..., NC_000023.11: chrX, NC_000024.10:chrY, NC_012920.1: chrM ). Nevertheless, how can I know which is the chromosome of the accessions prefixed with NW_ or NT_?

Some "NT_ , NW_" are alternative assemblies of NC_ and the info contained is "the same" being placed lines below that NC_, but some others do not and contains genes of interest which I could be loosing when using bedtools i.e https://www.ncbi.nlm.nih.gov/gene/3806. Some do not have a known location but that gene is known to be in the chromosome 19 and I can not deduce it from its accession number.

Is there a way of getting the chromosome from the accession number? Or shall I extract the info from another annotation file?

Thanks

genome refseq chromosome accession ncbi • 8.0k views
ADD COMMENT
0
Entering edit mode

Have you tried potential way(s) of linking chromosomes to accession number mentioned in this post: How to get the chromosome numbers from RefSeq accession IDs ?

ADD REPLY
0
Entering edit mode

I saw it but all the links provided there are not working and the answer with awk + sed only applies with NC_ (already under control). Thanks anyway

ADD REPLY
0
Entering edit mode

you may want to give some example data and expected output.

ADD REPLY
0
Entering edit mode

Well that is already given in the the question, with the Entrez ID gene 3806, which is annotated in the accession NT_113949 and I want to obtain the chromosome which is number 19. I could look for more examples but the idea is basically that, from an accession number prefixed with NT_ NW_ obtain its chromose if it is known.

ADD REPLY
0
Entering edit mode
4.8 years ago

You could find the chromosomes of the alternative accession numbers (NT_... / NW_...) in this directory.
Download the files with the name :
1. alts_accessions_GRCh38.p12
2. chr_NC_gi
3. chr_accessions_GRCh38.p12
4. unplaced_accessions_GRCh38.p12
5. unlocalized_accessions_GRCh38.p12

Once you download them, you might be prompted to enter some 'Keychain Access' password. The workaround which I found for this is that to convert the downloaded file to a '.txt' format and you'll be able to view whats inside the file.

An extract from the file is given below :

Chromosome RefSeq Accession.version

1 NW_012132914.1
1 NW_015495298.1
9 NW_009646201.1
10 NW_011332692.1
11 NW_015148966.1
Reference : This article.

ADD COMMENT
0
Entering edit mode
4.8 years ago
Solowars ▴ 70

Perhaps you could do it in R, using rentrez package. Take a look here.

I'm doing something kinda similar, and it is possible to input those identifiers and ask for a summary (using entrez_summary function). In that summary should appear chromosome number/name.

Let me know if you need some more help.

Cheers,

ADD COMMENT
0
Entering edit mode
4.5 years ago
vkkodali_ncbi ★ 3.7k

An assembly_report.txt file accompanies NCBI RefSeq genome assemblies that can be downloaded either from the NCBI Assembly portal by searching for the genome of interest and picking the Assembly structure report from the big blue downloads button menu or by going to the NCBI genomes FTP path for the assembly of interest.

For example, you can find the human assembly_report.txt file here: https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/001/405/GCF_000001405.39_GRCh38.p13/GCF_000001405.39_GRCh38.p13_assembly_report.txt

This file has the following columns:

                # Sequence-Name [  1]: 1
                  Sequence-Role [  2]: assembled-molecule
              Assigned-Molecule [  3]: 1
Assigned-Molecule-Location/Type [  4]: Chromosome
                   GenBank-Accn [  5]: CM000663.2
                   Relationship [  6]: =
                    RefSeq-Accn [  7]: NC_000001.11
                  Assembly-Unit [  8]: Primary Assembly
                Sequence-Length [  9]: 248956422
                UCSC-style-name [ 10]: chr1

You can use the data in columns 7 and 10 to map acc.ver to UCSC-style chromosome names. If you don't want to bother with coming up with all of the relevant logic and just need to quickly convert the seq-ids in an NCBI RefSeq GFF3 file, you can use my script cthreepo (https://github.com/vkkodali/cthreepo) for this purpose.

ADD COMMENT

Login before adding your answer.

Traffic: 2223 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6