Where can I download GRCh38-lite.fa file and all_sequences.fa file for hg38 version
1
0
Entering edit mode
4.8 years ago
a.james ▴ 240

Dear All,

I would like to download GRCh38-lite.fa and all_sequence.fa file for the hg38 version to run an analysis pipeline for whole exome sequencing data.

I have used the following for hg19 version for the same. But I cannot find a similar for the hg38 version.

from here

Could someone help me with finding the files? Thank you.

exome next-gen alignment • 3.5k views
ADD COMMENT
0
Entering edit mode

What are the chromosomes inside those two files ? Is lite primary alignment ?

Can you copy to output of :

grep "^>" GRCh38-lite.fa
grep "^>" all_sequence.fa
ADD REPLY
0
Entering edit mode

Sorry, my question is about where can I find the GRCH38-lite.fa to download. I do not have it downloaded yet. I have only GRCH37-lite.fa if you are asking me to grep GRCH37-lite.fa. Then this is how it looks like,

What are the chromosomes inside those two files? DNA chromosome Is lite primary alignment? yes

Description from the README of GRCH37-lite.fa:

 GRCh37-lite is a subset of the full GRCh37 human genome assembly (assembly accession GCA_000001405.1) plus the human mitochondrial genome reference sequence (the "rCRS") from Mitomap.org. This set of sequences excludes all the
    alternate loci scaffolds of the full GRCh37 assembly, and has the pseudo-autosomal regions (PARs) on chromosome Y masked with Ns. This haploid representation of the genome is provided as a convenience for use in alignment pipelines that cannot handle the multiple placements expected in the PARs and in regions of the genome that are represented by the alternate loci.

The header

>1 CM000663.1 Homo sapiens chromosome 1, GRCh37 primary reference assembly
>2 CM000664.1 Homo sapiens chromosome 2, GRCh37 primary reference assembly
>3 CM000665.1 Homo sapiens chromosome 3, GRCh37 primary reference assembly

And the grep "^>" all_sequences.fa | head looks as the following:

>1 dna:chromosome chromosome:GRCh37:1:1:249250621:1
>2 dna:chromosome chromosome:GRCh37:2:1:243199373:1
>3 dna:chromosome chromosome:GRCh37:3:1:198022430:1
ADD REPLY
0
Entering edit mode

Yes I meant GRCh37-lite.fa sorry,

I want the list of all entries in these files, just remove the head in your command please

If you only want primary assemblies you can take this one :

ftp://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_human/release_30/GRCh38.primary_assembly.genome.fa.gz

From : https://www.gencodegenes.org/human/

ADD REPLY
0
Entering edit mode

Thanks for the links. It is a big file. Do you want me to copy all lines here?

ADD REPLY
0
Entering edit mode

The lines are more than 5000 could you please tell me where should I post it or upload them?

ADD REPLY
0
Entering edit mode

5000 lines means 5000 chromosomes, alternatives, unplaced... That is a lot, I do not understand the difference between your 2 files

Could you try this :

grep -c "^>" GRCh37-lite.fa
grep -c "^>" all_sequence.fa
grep "^>" GRCh37-lite.fa | tail -10
grep "^>" all_sequence.fa | tail -10
ADD REPLY
0
Entering edit mode

grep -c "^>" GRCh37-lite.fa

84

grep -c "^>" all_sequence.fa

123

grep "^>" GRCh37-lite.fa | tail -10

>GL000224.1 Homo sapiens unplaced genomic contig, GRCh37 reference primary assembly
>GL000223.1 Homo sapiens unplaced genomic contig, GRCh37 reference primary assembly
>GL000195.1 Homo sapiens chromosome 7 unlocalized genomic contig, GRCh37 reference primary assembly
>GL000212.1 Homo sapiens unplaced genomic contig, GRCh37 reference primary assembly
>GL000222.1 Homo sapiens unplaced genomic contig, GRCh37 reference primary assembly
>GL000200.1 Homo sapiens chromosome 9 unlocalized genomic contig, GRCh37 reference primary assembly
>GL000193.1 Homo sapiens chromosome 4 unlocalized genomic contig, GRCh37 reference primary assembly
>GL000194.1 Homo sapiens chromosome 4 unlocalized genomic contig, GRCh37 reference primary assembly
>GL000225.1 Homo sapiens unplaced genomic contig, GRCh37 reference primary assembly
>GL000192.1 Homo sapiens chromosome 1 unlocalized genomic contig, GRCh37 reference primary assembly

grep "^>" all_sequences.fa | tail -10

>gi|82503188|ref|NC_007605.1| Human herpesvirus 4 type 1, complete genome 
>gi|9626053|ref|NC_001355.1| Human papillomavirus type 6b, complete genome  
>gi|9626069|ref|NC_001357.1| Human papillomavirus - 18, complete genome  
>gi|9627305|ref|NC_001583.1| Human papillomavirus type 26, complete genome  
>gi|9627377|ref|NC_001593.1| Human papillomavirus type 53, complete genome  
>gi|9628437|ref|NC_001676.1| Human papillomavirus 54, complete genome  
>gi|9628574|ref|NC_001694.1| Human papillomavirus - 61, complete genome  
>gi|9628642|ref|NC_001699.1| JC polyomavirus, complete genome 
>gi|9629378|ref|NC_001806.1| Human herpesvirus 1, complete genome 
>gi|62006071|dbj|AP007264.1| HBV genotype G DNA, complete genome, isolate: HB-JI444GF
ADD REPLY
0
Entering edit mode

You told me there were 5000 lines in your output in last message, I just see 84 and 123 count there... 84 entries should be 24 chromosomes + some unplaced/unlocated chromosomes so the file GRCh37-lite.fa is what is called primary file for GRCh38, the link I sent to you will be good.

I do not know what is inside all_sequences.fa, where did you download this one ? Seems like you have some viruses in there

ADD REPLY
0
Entering edit mode

The README looks like this

GRCh37-lite-+-HPV_Redux-build consists of GRCh37-lite appended with a nonredundant subset of common human papilloma viruses.  Included are HPV types 1 (NC_001356.1), 2 (NC_001352.1), 4 (NC_001457.1), 5 (NC_001531.1), 6 (NC_001355.1), 7 (NC_001595.1), 9 (NC_001596.1), 10 (NC_001576.1), 16 (NC_001526.2), 18 (NC_001357.1), 26 (NC_001583.1), 31 (J04353.1), 32 (NC_001586.1), 33 (M12732.1), 34 (NC_001587.1), 41 (NC_001354.1), 45 (EF202167.1), 48 (NC_001690.1), 49 (NC_001591.1), 50 (NC_001691.1), 53 (NC_001593.1), 54 (NC_001676.1), 60 (NC_001693.1), 61 (NC_001694.1), 63 (NC_001458.1), 88 (NC_010329.1), 90 (NC_004104.1), 92 (NC_004500.1), 96 (NC_005134.2), 101 (NC_008189.1), 103 (NC_008188.1), 108 (NC_012213.1), 109 (NC_012485.1), 112 (NC_012486.1), 116 (NC_013035.1), 121 (NC_014185.1), 128 (NC_014952.1), 129 (NC_014953.1), 131 (NC_014954.1), 132 (NC_014955.1), 134 (NC_014956.1), and 148 (NC_014835.1).

I downloaded it from :

ftp://ftp.ncbi.nih.gov/genbank/genomes/Eukaryotes/vertebrates_mammals/Homo_sapiens/GRCh37/special_requests/

ADD REPLY
2
Entering edit mode
4.8 years ago
GenoMax 141k

What you are looking for is "Primary" assembly file.

Primary assembly contains all toplevel sequence regions excluding haplotypes and patches. This file is best used for performing sequence similarity searches where patch and haplotype sequences would confuse analysis.

You can find those sequences here at Ensembl (large download) or NCBI (large download).

NCBI sequence contains the following:

GCA_000001405.15_GRCh38_no_alt_analysis_set.fna.gz

A gzipped file that contains FASTA format sequences for the following:
1. chromosomes from the GRCh38 Primary Assembly unit.    Note: the two PAR regions on chrY have been hard-masked with Ns.     The chromosome Y sequence provided therefore has the same     coordinates as the GenBank sequence but it is not identical to the    GenBank sequence. Similarly, duplicate copies of centromeric arrays    and WGS on chromosomes 5, 14, 19, 21 & 22 have been hard-masked     with Ns (locations of the unmasked copies are given below). 
2. mitochondrial genome from the GRCh38 non-nuclear assembly unit.
3. unlocalized scaffolds from the GRCh38 Primary Assembly unit.
4. unplaced scaffolds from the GRCh38 Primary Assembly unit.
5. Epstein-Barr virus (EBV) sequence     Note: The EBV sequence is not part of the genome assembly but is     included in the analysis set as a sink for alignment of reads that    are often present in sequencing samples.
ADD COMMENT
0
Entering edit mode

No, I am not looking for primary assembly, I am looking for genome-lite assembly file. The GRCH37-lite.fa equivalent in hg38 version. And all_sequence.fa file's equivalent in hg38 version

ADD REPLY
0
Entering edit mode

Based on what you wrote above:

GRCh37-lite is a subset of the full GRCh37 human genome assembly (assembly accession GCA_000001405.1) plus the human mitochondrial genome reference sequence (the "rCRS") from Mitomap.org. This set of sequences excludes all the alternate loci scaffolds of the full GRCh37 assembly, and has the pseudo-autosomal regions (PARs) on chromosome Y masked with Ns. This haploid representation of the genome is provided as a convenience for use in alignment pipelines that cannot handle the multiple placements expected in the PARs and in regions of the genome that are represented by the alternate loci.

this is the file you are looking for.

If you need the "all_sequences" i.e. including alt haplotypes then you should get the full sequence file from NCBI.

Note: Don't go on the fact hg38 files are not called lite or full. If you need those other viral sequences in new file then append them to the hg38 reference.

ADD REPLY
0
Entering edit mode

Thank you for the description. I will try with this file. @genomax, May I know where could I get the viral sequences

ADD REPLY
1
Entering edit mode

You can get them from accession numbers listed below or from your GRCh37-lite file.

GRCh37-lite-+-HPV_Redux-build consists of GRCh37-lite appended with a nonredundant subset of common human papilloma viruses. Included are HPV types 1 (NC_001356.1), 2 (NC_001352.1), 4 (NC_001457.1), 5 (NC_001531.1), 6 (NC_001355.1), 7 (NC_001595.1), 9 (NC_001596.1), 10 (NC_001576.1), 16 (NC_001526.2), 18 (NC_001357.1), 26 (NC_001583.1), 31 (J04353.1), 32 (NC_001586.1), 33 (M12732.1), 34 (NC_001587.1), 41 (NC_001354.1), 45 (EF202167.1), 48 (NC_001690.1), 49 (NC_001591.1), 50 (NC_001691.1), 53 (NC_001593.1), 54 (NC_001676.1), 60 (NC_001693.1), 61 (NC_001694.1), 63 (NC_001458.1), 88 (NC_010329.1), 90 (NC_004104.1), 92 (NC_004500.1), 96 (NC_005134.2), 101 (NC_008189.1), 103 (NC_008188.1), 108 (NC_012213.1), 109 (NC_012485.1), 112 (NC_012486.1), 116 (NC_013035.1), 121 (NC_014185.1), 128 (NC_014952.1), 129 (NC_014953.1), 131 (NC_014954.1), 132 (NC_014955.1), 134 (NC_014956.1), and 148 (NC_014835.1).

ADD REPLY

Login before adding your answer.

Traffic: 2021 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6