10 weeks ago
magnolia ▴ 20

Hi,

Assemblies downloaded from NCBI (GCF_000001405.25 for GRCh37.13 for example) have RefSeq-Accn (NC_000001.10, NC_000002.11, NC_000003.11) as chromosome names.

I want to change the names with Sequence-Name (1, 2, 3) and UCSC-style-name (chr1, chr2, chr3).

Is there a reliable method to do this?

grch38 ncbi grch37 fasta • 458 views
sort both files (ncbi file and chrom-change.tsv) on chromosome names and use join

Thank you. I'm not sure how to apply this to fasta file but I'll try to figure it out.

ha , it's a fasta file. I thought it was a TSV file. Then you could use sed -f pattern.txt < in.fa with pattern.txt:

s/^>ABC/^>X1/
s/^>AB/^>X2/
s/^>A/^>X3/

This is wonderful, thank you so much! By the way, second ^ was added to new name. So I removed.

Kinda related question: Is there source other than NCBI that I can download GRCh37.p13 that has 'normal' chromosome names?

The reason I'm looking for the latest version is that PAR regions are missing on chromosome Y in previous versions.

Not from NCBI since they always use NC* nomenclature. You can download the assembly from UCSC which should have the Chr names. Take a look at the notes on the page to understand slight differences in Mitochondrial genomes in UCSC assembly.

Thank you. Too bad that I can't get latest GRCh37 from anywhere else.

In 2020 we added a few additional sequences, new sequences from GRC patch
release GRCh37.p13 (GCA_000001405.14) plus the revised Cambridge Reference
Sequence (rCRS) mitochondrial sequence. These can be found in the subdirectory
"p13.plusMT/" or its alias "latest/".  See the section "Patches" below.  Most
users looking at this text are looking for the file "latest/hg19.fa.gz".

Thank you! Yeah actually it seems like they have the things I'm looking for. I just can't be sure to use 'hg' assemblies for everything. If I have a txt file that contains chromosome, position, genotype which is from a bam mapped to GRCh37, is it safe to use with UCSC assemblies?

Should be. Patches never change chromosome co-ordinates. They remain stable for each major genome release.

Great to hear! Thanks a lot for your help.