Human dna reference file with no prefix 'chr'
3
9
Entering edit mode
8.1 years ago
mangfu100 ▴ 780

Hi.

I have usually been using hg19 reference file from ucsc.

However, ucsc has a prefix 'chr' in its files.

Of course, I am able to exclude 'chr' prefix by script programming, but I don't want to.

Instead, I need another human reference file which does not contain 'chr' prefix itself.

Do any another reference files exist not ucsc?

next-gen-sequencing • 23k views
0
Entering edit mode

The GATK bundle has what you need

0
Entering edit mode

Is there a reason why you wanna bypass something as simple as a sed 's/chr//g' ucsc.hg19.fa >ucsc.hg19.nochr.fa?

1
Entering edit mode

I think that just simply excluding of 'chr' would make any problem later for processing.

and I also though that users must match the reference file that was used for mapping program such as bwa with others variation calling program.

19
Entering edit mode
8.1 years ago

It's never as simple as "remove chr-prefix with a script". hg19 is UCSC's variant of the official GRCh37 assembly. Early releases of GRCh37 like GRCh37-lite, did not use chr-prefixes, but newer releases like GRCh37.p13 adopted the chr-prefix, and use a newer mitochondrial (MT) sequence than hg19 does. Note also how chrM in hg19 is named MT in GRCh37. And all the unplaced contigs have very different names. So simply removing the chr-prefix in hg19 does not make it GRCh37. It makes it a wholly other chromosome naming convention, which is the last thing we need right now, since everyone has agreed to use chr-prefixes in GRCh38/hg38, except Ensembl.

Update (Nov 4, 2016): Here is a UCSC Chain mapping UCSC's hg19 to Ensembl's GRCh37.p13 (no chr-prefix), compatible with tools like CrossMap, Remap, or liftOver. Notice how all chromosomes/contigs except chrM only require renaming. Users of vcf2maf with hg19 VCFs as input, can pass this into the --remap-chain argument.

0
Entering edit mode

It's never as simple as "remove chr-prefix with a script". That doesn't make sense, because sometimes it really is that simple. There have been plenty of occasions where a simple find/replace is all I've needed to do to get a BED file working in tandem with a BAM file in a downstream-analysis step (after ensuring I was indeed working from the same reference).

My pedantry aside, the info you provided was very useful.

0
Entering edit mode

Agreed... it's often simple to just "get something working", night before the lab meeting. ;) But it's a whole other ballgame when you're trying to develop robust bioinformatics pipelines that can handle every pedantic detail.

15
Entering edit mode
8.1 years ago
Dan D 7.3k

Why would you go through all the trouble of finding/downloading/cat'ing another reference? Getting rid of that chr designation is dead simple.

For FASTA files:

cat hg19.fa | sed 's/>chr/>/g' > hg19_new.fa


For BED files:

cat myhg19Genes.bed sed 's/^chr//' > myhg19Genes_new.bed


For BAM files:

samtools view -h hg19Alignments.bam | sed 's/chr//g' | samtools view -Shb - -o hg19Alignments_new.bam

2
Entering edit mode

For BAM files, it might be faster to modify the header with sed, write that to a file, and then use samtools reheader (there's no need to then parse everything to/from sam). Of course, that doesn't fit as nicely into a single command as your solution :)

1
Entering edit mode

Modifying only the header is also safer, so that you don't risk truncating base quality values that happen to form the word chr (very unlikely, but still possible). But even in the header, you risk changing comment lines in the @CO tag, or file names or commands in the @PG tag, containing words like "chrom", "chromosome", or "jesus christ! why is this so complicated?!"

1
Entering edit mode
3.2 years ago
Max Masnick ▴ 10

For hg38/GRCh38, the answer appears to be to use the Ensembl reference sequence. This does not have the chr prefix.

The GATK reference sequence from their Resource Bundle has the chr prefix.

Note that if you are using the Ensembl reference sequence with GATK, you'll need to do some extra steps to generate .dict and .fai files. These are available for the GATK Resource Bundle FASTAs, but not for Ensembl.

Do not try to use sed or other find/replace tools to fix this problem yourself (unless you are extremely bandwidth-constrained or no option to download a properly-formatted reference FASTA). Cyriac Kandoth is absolutely right that this is a can of worms in many cases, and just not worth it when a reference file without chr is readily available.

1
Entering edit mode

Just a follow up to this question- I'm trying to transition over to GRCh38 (for everything in my pipeline; align, exome variant call, vep annotation, rna-seq, etc). I'm stuck at the GATK resource bundle (which I use for the BQSR), as it seems they only have hg38 resource bundle released (https://console.cloud.google.com/storage/browser/broad-references/hg38/v0), but this is following the ucsc format (with the "chr" prefix), as you've mentioned. From https://lh3.github.io/2017/11/13/which-human-reference-genome-to-use, Heng suggested to use a no-alt set with hard mask for the PARs. Is this still recommended, and does the ensembl have versions of this (along with possibly also decoys)? I also see that 1000 genomes have a reference as well, that has the PARs and the decoys, but they also have the chr prefix.