How to change chromosome names in assembly fasta downloaded from NCBI?
0
0
Entering edit mode
19 months ago
magnolia ▴ 20

Hi,

Assemblies downloaded from NCBI (GCF_000001405.25 for GRCh37.13 for example) have RefSeq-Accn (NC_000001.10, NC_000002.11, NC_000003.11) as chromosome names.

I want to change the names with Sequence-Name (1, 2, 3) and UCSC-style-name (chr1, chr2, chr3).

Is there a reliable method to do this?

Thank you in advance!

grch38 ncbi grch37 fasta • 2.1k views
ADD COMMENT
1
Entering edit mode

sort both files (ncbi file and chrom-change.tsv) on chromosome names and use join

ADD REPLY
0
Entering edit mode

Thank you. I'm not sure how to apply this to fasta file but I'll try to figure it out.

ADD REPLY
1
Entering edit mode

ha , it's a fasta file. I thought it was a TSV file. Then you could use sed -f pattern.txt < in.fa with pattern.txt:

s/^>ABC/^>X1/
s/^>AB/^>X2/
s/^>A/^>X3/
ADD REPLY
0
Entering edit mode

This is wonderful, thank you so much! By the way, second ^ was added to new name. So I removed.

Kinda related question: Is there source other than NCBI that I can download GRCh37.p13 that has 'normal' chromosome names?

The reason I'm looking for the latest version is that PAR regions are missing on chromosome Y in previous versions.

ADD REPLY
1
Entering edit mode

Not from NCBI since they always use NC* nomenclature. You can download the assembly from UCSC which should have the Chr names. Take a look at the notes on the page to understand slight differences in Mitochondrial genomes in UCSC assembly.

https://hgdownload.soe.ucsc.edu/goldenPath/hg19/bigZips/

ADD REPLY
0
Entering edit mode

Thank you. Too bad that I can't get latest GRCh37 from anywhere else.

ADD REPLY
1
Entering edit mode

That is the latest release at UCSC. It is in https://hgdownload.soe.ucsc.edu/goldenPath/hg19/bigZips/latest/

In 2020 we added a few additional sequences, new sequences from GRC patch
release GRCh37.p13 (GCA_000001405.14) plus the revised Cambridge Reference
Sequence (rCRS) mitochondrial sequence. These can be found in the subdirectory
"p13.plusMT/" or its alias "latest/".  See the section "Patches" below.  Most
users looking at this text are looking for the file "latest/hg19.fa.gz".
ADD REPLY
0
Entering edit mode

Thank you! Yeah actually it seems like they have the things I'm looking for. I just can't be sure to use 'hg' assemblies for everything. If I have a txt file that contains chromosome, position, genotype which is from a bam mapped to GRCh37, is it safe to use with UCSC assemblies?

ADD REPLY
1
Entering edit mode

Should be. Patches never change chromosome co-ordinates. They remain stable for each major genome release.

ADD REPLY
0
Entering edit mode

Great to hear! Thanks a lot for your help.

ADD REPLY

Login before adding your answer.

Traffic: 1564 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6