Hello, everyone! I am new to bioinformatics with biology-and-genetics background. So I desperately need a good advice.
I am trying to retrieve a "spike protein" gene from SRA file. Please let me know what I am doing wrong. And if you have suggestions on how to correct the idea, I would be happy to hear it.
I thought about the following idea:
- if I have a reference sequence of a gene (spike protein nucleotide sequence, in fasta), I will be able to find its position in the raw genomic info (from SRA), if I convert raw data to fasta and then create local blast database, and, finally, perform blast between gene and genome
Here is what I came up with to retrieve a gene sequence: 1) Downloaded S surface glycoprotein in fasta. Saved as "sequence.fasta".
2) Downloaded genome sequencing data. Saved as "SRR17592110".
3) Converted SRR file to fasta by typing:
**fastq-dump --fasta 70 SRR17592110**
4) Created a local database using:
**makeblastdb -in SRR17592110.fasta -dbtype nucl**
5) Performed blast by typing:
**blastn -query sequence.fasta -db SRR17592110.fasta -evalue 1e-6 -num_threads 4 -out blastn_out.txt**
6) Examined blastn_out.txt.
It showed me a lot of hits, but they do not correspond expected positions of the gene in the genome, and there are so many hits I am struggling to understand how to interpret it and find the gene sequence in the given genome.
Could you please give me an advice? How is it usually done when you have a raw genome file and you want to find a desired gene, reference sequence of which you have?