Question: Human dna reference file with no prefix 'chr'
8
gravatar for mangfu100
5.1 years ago by
mangfu100720
Korea, Republic Of
mangfu100720 wrote:

Hi.

I have usually been using hg19 reference file from ucsc.

However, ucsc has a prefix 'chr' in its files.

Of course, I am able to exclude 'chr' prefix by script programming, but I don't want to.

Instead, I need another human reference file which does not contain 'chr' prefix itself.

Do any another reference files exist not ucsc?

 

sequencing next-gen • 13k views
ADD COMMENTlink modified 9 weeks ago by Max Masnick10 • written 5.1 years ago by mangfu100720

The GATK bundle has what you need

ADD REPLYlink modified 9 weeks ago by RamRS25k • written 5.1 years ago by Zev.Kronenberg11k

Is there a reason why you wanna bypass something as simple as a sed 's/chr//g' ucsc.hg19.fa >ucsc.hg19.nochr.fa?

ADD REPLYlink modified 9 weeks ago • written 5.1 years ago by RamRS25k
1

I think that just simply excluding of 'chr' would make any problem later for processing.

and I also though that users must match the reference file that was used for mapping program such as bwa with others variation calling program.

ADD REPLYlink modified 9 weeks ago by RamRS25k • written 5.1 years ago by mangfu100720
16
gravatar for Cyriac Kandoth
5.1 years ago by
Cyriac Kandoth5.4k
Memorial Sloan Kettering, New York, USA
Cyriac Kandoth5.4k wrote:

It's never as simple as "remove chr-prefix with a script". hg19 is UCSC's variant of the official GRCh37 assembly. Early releases of GRCh37 like GRCh37-lite, did not use chr-prefixes, but newer releases like GRCh37.p13 adopted the chr-prefix, and use a newer mitochondrial (MT) sequence than hg19 does. Note also how chrM in hg19 is named MT in GRCh37. And all the unplaced contigs have very different names. So simply removing the chr-prefix in hg19 does not make it GRCh37. It makes it a wholly other chromosome naming convention, which is the last thing we need right now.

Update (Nov 4, 2016): Here is a UCSC Chain mapping UCSC's hg19 to Ensembl's GRCh37.p13 (no chr-prefix), compatible with tools like CrossMap, Remap, or liftOver. Notice how all chromosomes/contigs except chrM only require renaming. Users of vcf2maf with hg19 VCFs as input, can pass this into the --remap-chain argument.

ADD COMMENTlink modified 22 months ago • written 5.1 years ago by Cyriac Kandoth5.4k

It's never as simple as "remove chr-prefix with a script". That doesn't make sense, because sometimes it really is that simple. There have been plenty of occasions where a simple find/replace is all I've needed to do to get a BED file working in tandem with a BAM file in a downstream-analysis step (after ensuring I was indeed working from the same reference).

My pedantry aside, the info you provided was very useful.

ADD REPLYlink modified 9 weeks ago by RamRS25k • written 5.1 years ago by Dan D7.0k

Agreed... it's often simple to just "get something working", night before the lab meeting. ;) But it's a whole other ballgame when you're trying to develop robust bioinformatics pipelines that can handle every pedantic detail.

ADD REPLYlink modified 9 weeks ago by RamRS25k • written 5.1 years ago by Cyriac Kandoth5.4k
14
gravatar for Dan D
5.1 years ago by
Dan D7.0k
Tennessee
Dan D7.0k wrote:

Why would you go through all the trouble of finding/downloading/cat'ing another reference? Getting rid of that chr designation is dead simple.

For FASTA files:

cat hg19.fa | sed 's/>chr/>/g' > hg19_new.fa

For BED files:

cat myhg19Genes.bed sed 's/^chr//' > myhg19Genes_new.bed

For BAM files:

samtools view -h hg19Alignments.bam | sed 's/chr//g' | samtools view -Shb - -o hg19Alignments_new.bam
ADD COMMENTlink modified 9 weeks ago by RamRS25k • written 5.1 years ago by Dan D7.0k
2

For BAM files, it might be faster to modify the header with sed, write that to a file, and then use samtools reheader (there's no need to then parse everything to/from sam). Of course, that doesn't fit as nicely into a single command as your solution :)

ADD REPLYlink modified 9 weeks ago by RamRS25k • written 5.1 years ago by Devon Ryan93k
1

Modifying only the header is also safer, so that you don't risk truncating base quality values that happen to form the word chr (very unlikely, but still possible). But even in the header, you risk changing comment lines in the @CO tag, or file names or commands in the @PG tag, containing words like "chrom", "chromosome", or "jesus christ! why is this so complicated?!"

ADD REPLYlink modified 9 weeks ago by RamRS25k • written 5.1 years ago by Cyriac Kandoth5.4k
1
gravatar for Max Masnick
9 weeks ago by
Max Masnick10
Max Masnick10 wrote:

For hg38/GRCh38, the answer appears to be to use the Ensembl reference sequence. This does not have the chr prefix.

The GATK reference sequence from their Resource Bundle has the chr prefix.

Note that if you are using the Ensembl reference sequence with GATK, you'll need to do some extra steps to generate .dict and .fai files. These are available for the GATK Resource Bundle FASTAs, but not for Ensembl.

Do not try to use sed or other find/replace tools to fix this problem yourself (unless you are extremely bandwidth-constrained or no option to download a properly-formatted reference FASTA). Cyriac Kandoth is absolutely right that this is a can of worms in many cases, and just not worth it when a reference file without chr is readily available.

ADD COMMENTlink written 9 weeks ago by Max Masnick10

Just a follow up to this question- I'm trying to transition over to GRCh38 (for everything in my pipeline; align, exome variant call, vep annotation, rna-seq, etc). I'm stuck at the GATK resource bundle (which I use for the BQSR), as it seems they only have hg38 resource bundle released (https://console.cloud.google.com/storage/browser/broad-references/hg38/v0), but this is following the ucsc format (with the "chr" prefix), as you've mentioned. From https://lh3.github.io/2017/11/13/which-human-reference-genome-to-use, Heng suggested to use a no-alt set with hard mask for the PARs. Is this still recommended, and does the ensembl have versions of this (along with possibly also decoys)? I also see that 1000 genomes have a reference as well, that has the PARs and the decoys, but they also have the chr prefix.

ADD REPLYlink modified 7 weeks ago • written 7 weeks ago by leptonsonic0
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1855 users visited in the last hour