Question: Heading format for GRCh38 reference file.
1
gravatar for marongiu.luigi
2.3 years ago by
Germany, Mannheim, UMM
marongiu.luigi520 wrote:

Dear all,

I have downloaded the GRCh38 reference human genome in fasta format from the ftp.ncbi.nlm.nih.gov following the suggestions of this post and this. Now I need to generate a fusion genome thus I need the headers right. The headers of the GRCh38 are in this form:

>chr1  AC:CM000663.2  gi:568336023  LN:248956422  rl:Chromosome  M5:6aef897c3d6ff0c78aff06ac189178dd  AS:GRCh38
>chr2  AC:CM000664.2  gi:568336022  LN:242193529  rl:Chromosome  M5:f98db672eb0993dcfdabafe2a882905c  AS:GRCh38
>chr3  AC:CM000665.2  gi:568336021  LN:198295559  rl:Chromosome  M5:76635a41ea913a405ded820447d067b0  AS:GRCh38
[...]
>chrUn_GL000218v1  AC:GL000218.1  gi:224183305  LN:161147  rl:unplaced  M5:1d708b54644c26c7e01c2dad5426d38c  AS:GRCh38
>chrEBV  AC:AJ507799.2  gi:86261677  LN:171823  rl:decoy  M5:6743bd63b3ff2b5b8985d8933c53290a  SP:Human_herpesvirus_4  tp:circular

May I ask what is the format of the header? What the individual fields refer to?

Thank you

alignment fasta header genome • 1.2k views
ADD COMMENTlink modified 2.3 years ago by ATpoint46k • written 2.3 years ago by marongiu.luigi520
4
gravatar for Pierre Lindenbaum
2.3 years ago by
France/Nantes/Institut du Thorax - INSERM UMR1087
Pierre Lindenbaum134k wrote:
>chr1  AC:CM000663.2  gi:568336023  LN:248956422  rl:Chromosome  M5:6aef897c3d6ff0c78aff06ac189178dd  AS:GRCh38

AC accession number https://www.ncbi.nlm.nih.gov/nuccore/CM000663.2

gi: ncbi global identfier https://www.ncbi.nlm.nih.gov/nuccore/568336023

LN: length

rl: Assigned-Molecule-Location/Type (chromosome, mitochondrial...)

M5 : md5 cheksum https://en.wikipedia.org/wiki/Md5sum

AS: assembly

ADD COMMENTlink modified 2.3 years ago • written 2.3 years ago by Pierre Lindenbaum134k
4
gravatar for ATpoint
2.3 years ago by
ATpoint46k
ATpoint46k wrote:
AC: Locus identifier / Accession number at Genbank & NCBI

gi: GI number (sometimes written in lower case, "gi") is simply a series of digits that are assigned consecutively to each sequence record processed by NCBI. The GI number bears no resemblance to the Accession number of the sequence record.

LN: Length of the contig

rl: role in the assembly, (chromosome, unplaced, EBV decoy)

M5: MD5 checksum (file integrity)

AS: Assemby version

SP: Species

tp: probably type, indicates if linear or circular

Anyway, to keep things compact I always delete everything after the first whitespace prior to building a genome.fa or index.

ADD COMMENTlink modified 2.3 years ago • written 2.3 years ago by ATpoint46k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 2604 users visited in the last hour
_