ANNOVAR. How to prepare the database with RefSeq.hg38.gff?
1
1
Entering edit mode
10 months ago
avrora5901 ▴ 10

Hello!

I study the Annovar program and try to make my own base for further variant annotation. I've downloaded the data from NCBI in .gff, .gtf and .fna formats. Next I filtered the data and gained the reference sequence for hg38 version. Also I've downloaded the refGene.txt file from the UCSC. However, the gtfToGenePred and gff3ToGenePred didn't help as well as the fasta file .fna to perform this command correctly

perl retrieve_seq_from_fasta.pl --format refGene --seqfile RefSeq2.hg38.gtf refGene.txt --out refGeneMrna.fa

The result in all cases was many "WARNING: Cannot identify sequence for rna-NR..." and finally "NOTICE: Finished writting FASTA for 0 genomic regions to refGeneMrna.fa"

Does anybody know how to fix this or have any idea that could help? Maybe I missed something? Maybe Annovar is working only with databases which had been included in it?

Thanks much everybody who will try to help.

annovar database .gff .fa file formats • 704 views
ADD COMMENT
1
Entering edit mode
8 months ago
desouzareis.r ▴ 250

Hello!

I was able to prepare the RefSeq database with the following bash script. I hope it helps!

#!/bin/bash
gff3ToGenePred=gff3ToGenePred # from http://hgdownload.soe.ucsc.edu/admin/exe/linux.x86_64/
retrieve_seq_from_fasta=retrieve_seq_from_fasta.pl # from annovar
fasta=Homo_sapiens_assembly38.fasta # path to reference genome from broadinstitute bucket


## Get the UCSC style name for contigs
wget --quiet -O - ftp://ftp.ncbi.nlm.nih.gov/refseq/H_sapiens/annotation/GRCh38_latest/refseq_identifiers/GRCh38_latest_assembly_report.txt \
    | grep -v "^#" \
    | tr -d $'\r' \
    | sed 's/ \+//g' \
    | awk -F $'\t' -v OFS='\t' 'BEGIN {print "#Table Info:https://www.ncbi.nlm.nih.gov/genome/guide/human/\n#Assembly-Unit\tUCSC-style-name"}{if($7!="na" && $10!="na") print $7,$10}' \
     > GRCh38_latest_assembly_report.txt 


## Get the RefSeq file
wget --quiet -O - ftp://ftp.ncbi.nlm.nih.gov/refseq/H_sapiens/annotation/GRCh38_latest/refseq_identifiers/GRCh38_latest_genomic.gff.gz \
    | gzip -dc > GRCh38_latest_genomic.gff


# gff3 to GenePred format
$gff3ToGenePred -refseqHacks GRCh38_latest_genomic.gff ncbiRefGene.txt


# convert assembly unity name to UCSC style name
while IFS=$'\t' read k v; do sudo sed -e "s,\b$k\b,$v,g" -i ncbiRefGene.txt; done < GRCh38_latest_assembly_report.txt 


# keep only NM_, NR_ e YP_ (mitochrondria) and add a random first field #bin
grep "^NM_\|^NR_\|^YP_" ncbiRefGene.txt | grep -v "_alt\|_fix\|_random\|NW_\|chrUn_" >ncbiRefGeneMrna.txt
nl ncbiRefGeneMrna.txt >hg38_ncbiRefSeq.txt

# generate a transcript FASTA file using annovar script retrieve_seq_from_fasta.pl
perl $retrieve_seq_from_fasta --format refGene --seqfile $fasta hg38_ncbiRefSeq.txt --out hg38_ncbiRefSeqMrna.fa
ADD COMMENT
0
Entering edit mode

Thank you very much. I used your script to make refseq for ncbi. I wish to ask two questions. First, How can I make enGene.txt from ensemble gff file? Actually I download gff and make genephred, but its format is different from Annovar's one. If you could help me, I would appriciate it. Second, while I'm annotating using reseq created by your pipeline, I got several warnings: WARNING: cannot find annotation for NM_001353151.1 in the genefile /media/ilyome/data/ngs-bundle/annovar/humandb//hg38_ncbiRefSeq.txt or cannot infer the transcription start site Are they big problems or can I ignore them? Thank You Masoud

ADD REPLY

Login before adding your answer.

Traffic: 1451 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6