ANNOVAR. How to prepare the database with RefSeq.hg38.gff?
Entering edit mode
9 months ago
avrora5901 ▴ 10


I study the Annovar program and try to make my own base for further variant annotation. I've downloaded the data from NCBI in .gff, .gtf and .fna formats. Next I filtered the data and gained the reference sequence for hg38 version. Also I've downloaded the refGene.txt file from the UCSC. However, the gtfToGenePred and gff3ToGenePred didn't help as well as the fasta file .fna to perform this command correctly

perl --format refGene --seqfile RefSeq2.hg38.gtf refGene.txt --out refGeneMrna.fa

The result in all cases was many "WARNING: Cannot identify sequence for rna-NR..." and finally "NOTICE: Finished writting FASTA for 0 genomic regions to refGeneMrna.fa"

Does anybody know how to fix this or have any idea that could help? Maybe I missed something? Maybe Annovar is working only with databases which had been included in it?

Thanks much everybody who will try to help.

annovar database .gff .fa file formats • 634 views
Entering edit mode
7 months ago
desouzareis.r ▴ 250


I was able to prepare the RefSeq database with the following bash script. I hope it helps!

gff3ToGenePred=gff3ToGenePred # from # from annovar
fasta=Homo_sapiens_assembly38.fasta # path to reference genome from broadinstitute bucket

## Get the UCSC style name for contigs
wget --quiet -O - \
    | grep -v "^#" \
    | tr -d $'\r' \
    | sed 's/ \+//g' \
    | awk -F $'\t' -v OFS='\t' 'BEGIN {print "#Table Info:\n#Assembly-Unit\tUCSC-style-name"}{if($7!="na" && $10!="na") print $7,$10}' \
     > GRCh38_latest_assembly_report.txt 

## Get the RefSeq file
wget --quiet -O - \
    | gzip -dc > GRCh38_latest_genomic.gff

# gff3 to GenePred format
$gff3ToGenePred -refseqHacks GRCh38_latest_genomic.gff ncbiRefGene.txt

# convert assembly unity name to UCSC style name
while IFS=$'\t' read k v; do sudo sed -e "s,\b$k\b,$v,g" -i ncbiRefGene.txt; done < GRCh38_latest_assembly_report.txt 

# keep only NM_, NR_ e YP_ (mitochrondria) and add a random first field #bin
grep "^NM_\|^NR_\|^YP_" ncbiRefGene.txt | grep -v "_alt\|_fix\|_random\|NW_\|chrUn_" >ncbiRefGeneMrna.txt
nl ncbiRefGeneMrna.txt >hg38_ncbiRefSeq.txt

# generate a transcript FASTA file using annovar script
perl $retrieve_seq_from_fasta --format refGene --seqfile $fasta hg38_ncbiRefSeq.txt --out hg38_ncbiRefSeqMrna.fa
Entering edit mode

Thank you very much. I used your script to make refseq for ncbi. I wish to ask two questions. First, How can I make enGene.txt from ensemble gff file? Actually I download gff and make genephred, but its format is different from Annovar's one. If you could help me, I would appriciate it. Second, while I'm annotating using reseq created by your pipeline, I got several warnings: WARNING: cannot find annotation for NM_001353151.1 in the genefile /media/ilyome/data/ngs-bundle/annovar/humandb//hg38_ncbiRefSeq.txt or cannot infer the transcription start site Are they big problems or can I ignore them? Thank You Masoud


Login before adding your answer.

Traffic: 1081 users visited in the last hour
Help About
Access RSS

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6