Question

Heading format for GRCh38 reference file.

1

Entering edit mode

5.4 years ago

marongiu.luigi ▴ 710

Dear all,

I have downloaded the GRCh38 reference human genome in fasta format from the ftp.ncbi.nlm.nih.gov following the suggestions of this post and this. Now I need to generate a fusion genome thus I need the headers right. The headers of the GRCh38 are in this form:

>chr1  AC:CM000663.2  gi:568336023  LN:248956422  rl:Chromosome  M5:6aef897c3d6ff0c78aff06ac189178dd  AS:GRCh38
>chr2  AC:CM000664.2  gi:568336022  LN:242193529  rl:Chromosome  M5:f98db672eb0993dcfdabafe2a882905c  AS:GRCh38
>chr3  AC:CM000665.2  gi:568336021  LN:198295559  rl:Chromosome  M5:76635a41ea913a405ded820447d067b0  AS:GRCh38
[...]
>chrUn_GL000218v1  AC:GL000218.1  gi:224183305  LN:161147  rl:unplaced  M5:1d708b54644c26c7e01c2dad5426d38c  AS:GRCh38
>chrEBV  AC:AJ507799.2  gi:86261677  LN:171823  rl:decoy  M5:6743bd63b3ff2b5b8985d8933c53290a  SP:Human_herpesvirus_4  tp:circular

May I ask what is the format of the header? What the individual fields refer to?

Thank you

genome alignment fasta header • 3.1k views

ADD COMMENT • link updated 5.4 years ago by ATpoint 81k • written 5.4 years ago by marongiu.luigi ▴ 710

score 4 · Accepted Answer · 2018-11-21

>chr1  AC:CM000663.2  gi:568336023  LN:248956422  rl:Chromosome  M5:6aef897c3d6ff0c78aff06ac189178dd  AS:GRCh38

AC accession number https://www.ncbi.nlm.nih.gov/nuccore/CM000663.2

gi: ncbi global identfier https://www.ncbi.nlm.nih.gov/nuccore/568336023

LN: length

rl: Assigned-Molecule-Location/Type (chromosome, mitochondrial...)

M5 : md5 cheksum https://en.wikipedia.org/wiki/Md5sum

AS: assembly

score 4 · Accepted Answer · 2018-11-21

AC: Locus identifier / Accession number at Genbank & NCBI

gi: GI number (sometimes written in lower case, "gi") is simply a series of digits that are assigned consecutively to each sequence record processed by NCBI. The GI number bears no resemblance to the Accession number of the sequence record.

LN: Length of the contig

rl: role in the assembly, (chromosome, unplaced, EBV decoy)

M5: MD5 checksum (file integrity)

AS: Assemby version

SP: Species

tp: probably type, indicates if linear or circular

Anyway, to keep things compact I always delete everything after the first whitespace prior to building a genome.fa or index.