Heading format for GRCh38 reference file.
2
1
Entering edit mode
5.4 years ago

Dear all,

I have downloaded the GRCh38 reference human genome in fasta format from the ftp.ncbi.nlm.nih.gov following the suggestions of this post and this. Now I need to generate a fusion genome thus I need the headers right. The headers of the GRCh38 are in this form:

>chr1  AC:CM000663.2  gi:568336023  LN:248956422  rl:Chromosome  M5:6aef897c3d6ff0c78aff06ac189178dd  AS:GRCh38
>chr2  AC:CM000664.2  gi:568336022  LN:242193529  rl:Chromosome  M5:f98db672eb0993dcfdabafe2a882905c  AS:GRCh38
>chr3  AC:CM000665.2  gi:568336021  LN:198295559  rl:Chromosome  M5:76635a41ea913a405ded820447d067b0  AS:GRCh38
[...]
>chrUn_GL000218v1  AC:GL000218.1  gi:224183305  LN:161147  rl:unplaced  M5:1d708b54644c26c7e01c2dad5426d38c  AS:GRCh38
>chrEBV  AC:AJ507799.2  gi:86261677  LN:171823  rl:decoy  M5:6743bd63b3ff2b5b8985d8933c53290a  SP:Human_herpesvirus_4  tp:circular

May I ask what is the format of the header? What the individual fields refer to?

Thank you

genome alignment fasta header • 3.1k views
ADD COMMENT
4
Entering edit mode
5.4 years ago
>chr1  AC:CM000663.2  gi:568336023  LN:248956422  rl:Chromosome  M5:6aef897c3d6ff0c78aff06ac189178dd  AS:GRCh38

AC accession number https://www.ncbi.nlm.nih.gov/nuccore/CM000663.2

gi: ncbi global identfier https://www.ncbi.nlm.nih.gov/nuccore/568336023

LN: length

rl: Assigned-Molecule-Location/Type (chromosome, mitochondrial...)

M5 : md5 cheksum https://en.wikipedia.org/wiki/Md5sum

AS: assembly

ADD COMMENT
4
Entering edit mode
5.4 years ago
ATpoint 81k
AC: Locus identifier / Accession number at Genbank & NCBI

gi: GI number (sometimes written in lower case, "gi") is simply a series of digits that are assigned consecutively to each sequence record processed by NCBI. The GI number bears no resemblance to the Accession number of the sequence record.

LN: Length of the contig

rl: role in the assembly, (chromosome, unplaced, EBV decoy)

M5: MD5 checksum (file integrity)

AS: Assemby version

SP: Species

tp: probably type, indicates if linear or circular

Anyway, to keep things compact I always delete everything after the first whitespace prior to building a genome.fa or index.

ADD COMMENT

Login before adding your answer.

Traffic: 2496 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6