Question

ANNOVAR. How to prepare the database with RefSeq.hg38.gff?

1

Entering edit mode

3.4 years ago

avrora5901 ▴ 10

Hello!

I study the Annovar program and try to make my own base for further variant annotation. I've downloaded the data from NCBI in .gff, .gtf and .fna formats. Next I filtered the data and gained the reference sequence for hg38 version. Also I've downloaded the refGene.txt file from the UCSC. However, the gtfToGenePred and gff3ToGenePred didn't help as well as the fasta file .fna to perform this command correctly

perl retrieve_seq_from_fasta.pl --format refGene --seqfile RefSeq2.hg38.gtf refGene.txt --out refGeneMrna.fa

The result in all cases was many "WARNING: Cannot identify sequence for rna-NR..." and finally "NOTICE: Finished writting FASTA for 0 genomic regions to refGeneMrna.fa"

Does anybody know how to fix this or have any idea that could help? Maybe I missed something? Maybe Annovar is working only with databases which had been included in it?

Thanks much everybody who will try to help.

fasta gff annovar • 2.9k views

ADD COMMENT • link updated 13 months ago by Ram 43k • written 3.4 years ago by avrora5901 ▴ 10

score 1 · Answer 1 · 2021-02-04

Hello!

I was able to prepare the RefSeq database with the following bash script. I hope it helps!

#!/bin/bash
gff3ToGenePred=gff3ToGenePred # from http://hgdownload.soe.ucsc.edu/admin/exe/linux.x86_64/
retrieve_seq_from_fasta=retrieve_seq_from_fasta.pl # from annovar
fasta=Homo_sapiens_assembly38.fasta # path to reference genome from broadinstitute bucket


## Get the UCSC style name for contigs
wget --quiet -O - ftp://ftp.ncbi.nlm.nih.gov/refseq/H_sapiens/annotation/GRCh38_latest/refseq_identifiers/GRCh38_latest_assembly_report.txt \
    | grep -v "^#" \
    | tr -d $'\r' \
    | sed 's/ \+//g' \
    | awk -F $'\t' -v OFS='\t' 'BEGIN {print "#Table Info:https://www.ncbi.nlm.nih.gov/genome/guide/human/\n#Assembly-Unit\tUCSC-style-name"}{if($7!="na" && $10!="na") print $7,$10}' \
     > GRCh38_latest_assembly_report.txt 


## Get the RefSeq file
wget --quiet -O - ftp://ftp.ncbi.nlm.nih.gov/refseq/H_sapiens/annotation/GRCh38_latest/refseq_identifiers/GRCh38_latest_genomic.gff.gz \
    | gzip -dc > GRCh38_latest_genomic.gff


# gff3 to GenePred format
$gff3ToGenePred -refseqHacks GRCh38_latest_genomic.gff ncbiRefGene.txt


# convert assembly unity name to UCSC style name
while IFS=$'\t' read k v; do sudo sed -e "s,\b$k\b,$v,g" -i ncbiRefGene.txt; done < GRCh38_latest_assembly_report.txt 


# keep only NM_, NR_ e YP_ (mitochrondria) and add a random first field #bin
grep "^NM_\|^NR_\|^YP_" ncbiRefGene.txt | grep -v "_alt\|_fix\|_random\|NW_\|chrUn_" >ncbiRefGeneMrna.txt
nl ncbiRefGeneMrna.txt >hg38_ncbiRefSeq.txt

# generate a transcript FASTA file using annovar script retrieve_seq_from_fasta.pl
perl $retrieve_seq_from_fasta --format refGene --seqfile $fasta hg38_ncbiRefSeq.txt --out hg38_ncbiRefSeqMrna.fa

score 0 · Answer 2 · 2022-08-05

for the reason of bug "WARNING: Cannot identify sequence for rna-NR..." and finally "NOTICE: Finished writting FASTA for 0 genomic regions to refGeneMrna.fa", it was your account have no permission to write the origin fasta file. Perhaps the genome.fa was built by "root" and with the permission "-rw-r--r--", and your account was just common user. To solve the problem, you can copy the genome.fa to you path, then you will have a genome.fa with "-rwxrwxr-x". Then the bug will be solved anyway.

cp /path/to/genome.fa ./

Hope it will help you.