I study the Annovar program and try to make my own base for further variant annotation. I've downloaded the data from NCBI in .gff, .gtf and .fna formats. Next I filtered the data and gained the reference sequence for hg38 version. Also I've downloaded the refGene.txt file from the UCSC. However, the gtfToGenePred and gff3ToGenePred didn't help as well as the fasta file .fna to perform this command correctly
perl retrieve_seq_from_fasta.pl --format refGene --seqfile RefSeq2.hg38.gtf refGene.txt --out refGeneMrna.fa
The result in all cases was many
"WARNING: Cannot identify sequence for rna-NR..." and finally
"NOTICE: Finished writting FASTA for 0 genomic regions to refGeneMrna.fa"
Does anybody know how to fix this or have any idea that could help? Maybe I missed something? Maybe Annovar is working only with databases which had been included in it?
Thanks much everybody who will try to help.
Thank you very much. I used your script to make refseq for ncbi. I wish to ask two questions. First, How can I make enGene.txt from ensemble gff file? Actually I download gff and make genephred, but its format is different from Annovar's one. If you could help me, I would appriciate it. Second, while I'm annotating using reseq created by your pipeline, I got several warnings: WARNING: cannot find annotation for NM_001353151.1 in the genefile /media/ilyome/data/ngs-bundle/annovar/humandb//hg38_ncbiRefSeq.txt or cannot infer the transcription start site Are they big problems or can I ignore them? Thank You Masoud