How to retrieve a gene sequence from SRA file?
2
0
Entering edit mode
3 days ago

Hello, everyone! I am new to bioinformatics with biology-and-genetics background. So I desperately need a good advice.

I am trying to retrieve a "spike protein" gene from SRA file. Please let me know what I am doing wrong. And if you have suggestions on how to correct the idea, I would be happy to hear it.

I thought about the following idea:

• if I have a reference sequence of a gene (spike protein nucleotide sequence, in fasta), I will be able to find its position in the raw genomic info (from SRA), if I convert raw data to fasta and then create local blast database, and, finally, perform blast between gene and genome

Here is what I came up with to retrieve a gene sequence: 1) Downloaded S surface glycoprotein in fasta. Saved as "sequence.fasta".

3) Converted SRR file to fasta by typing:

 **fastq-dump --fasta 70 SRR17592110**


4) Created a local database using:

 **makeblastdb -in SRR17592110.fasta -dbtype nucl**


5) Performed blast by typing:

**blastn -query sequence.fasta -db SRR17592110.fasta -evalue 1e-6 -num_threads 4 -out blastn_out.txt**


6) Examined blastn_out.txt.

It showed me a lot of hits, but they do not correspond expected positions of the gene in the genome, and there are so many hits I am struggling to understand how to interpret it and find the gene sequence in the given genome.

Could you please give me an advice? How is it usually done when you have a raw genome file and you want to find a desired gene, reference sequence of which you have?

protein fasta sra spike gene • 336 views
2
Entering edit mode
3 days ago
GenoMax 111k

You probably should do this using a next-gen sequence aligner rather than blast. BLAST is meant to find local similarities and that is not what you want with short read data. Once you get the BAM file you can extract reads in the area of interest using samtools view.

You could also try Magic-BLAST (LINK) since you have the sequence data.

0
Entering edit mode

Thank you very much for your suggestion! I will study the options and try them out.

Could you please explain how I can get BAM file out of SRA file? This is unclear for me, to be honest

2
Entering edit mode

When you use a next-gen sequence aligner like bwa (or even Magic-BLAST) you will be able to get a SAM file after aligning the SRA data to reference genome. SAM file can then be easily converted to BAM (binary representation).

0
Entering edit mode

0
Entering edit mode

I don't mean to be intrusive, but I think I need to keep track of the process for those who may be interested in such question.

I performed the Magic-BLAST with S protein DNA sequence against my SRR fasta blast database and I struggle with interpreting the results. I know maybe there are too many questions, but why are here 6 sequences in the output? The interesting thing is 4 of them are the same, as well as the other 2 are the same too.

The command I used:

magicblast -query sequence.fasta -db SRR17592110.fasta -out magicblast_out.txt


The output:

@PG ID:magicblast   PN:magicblast   CL:magicblast -query sequence.fasta -db SRR17592110.fasta -out magicblast_out.txt
NC_045512.2:21563-25384 16  909666  1   255 2743S156M923S   *   0   0   TTATGTGTAA...CAT    *   NH:i:6  AS:i:156    NM:i:0
NC_045512.2:21563-25384 272 909688  1   255 2743S156M923S   *   0   0   TTATGTGTAA...CAT    *   NH:i:6  AS:i:156    NM:i:0
NC_045512.2:21563-25384 256 1161343 147 255 2255S156M1411S  *   0   0   ATGTTTGTTT...TAA    *   NH:i:6  AS:i:156    NM:i:0
NC_045512.2:21563-25384 256 1548699 147 255 2255S156M1411S  *   0   0   ATGTTTGTTT...TAA    *   NH:i:6  AS:i:156    NM:i:0
NC_045512.2:21563-25384 256 1639638 147 255 2255S156M1411S  *   0   0   ATGTTTGTTT...TAA    *   NH:i:6  AS:i:156    NM:i:0
NC_045512.2:21563-25384 256 1746301 147 255 2255S156M1411S  *   0   0   ATGTTTGTTT...TAA    *   NH:i:6  AS:i:156    NM:i:0


Interesting thing is the last 4 are exactly the same as my "sequence.fasta", reference DNA sequence of S protein.

"..." are added for brevity, so as to not share the whole sequence: it is a bit too long.

How do I interpret such an output? Is there any way I can think there is a gene finally found in the SRR?

2
Entering edit mode
3 days ago

The most important thing is to get the terminology right.

Do you want to assembly the S protein? Do you want to align to an existing S protein?

In the latter case build your index against the S protein and align the SRR file to it. Currently you are doing it backwards.

bwa index sequence.fa
bwa mem sequence.fa SRR.fq | samtools sort > alignment.bam

0
Entering edit mode

Thank you very much!

I will definitely try this out.

I want to find a DNA sequence of S protein in the SRR file. I thought having a reference DNA sequence of S protein would help to do so. And then I would like to retrieve this S protein sequence from SRR file, and store it for further comparisons.

P. S. I've just noticed you wrote "SRR.fq". Do you mean SRR file should be converted to fastq format in order to perform the second command?

1
Entering edit mode

And then I would like to retrieve this S protein sequence from SRR file,

There is only DNA sequence in SRA data so you will not be able to find protein sequence. You will need to either create a consensus from your BAM and then translate that to find the protein sequence.

That said this is a survey sample if I recall so this may not be the best strategy to get all possible variants present.

0
Entering edit mode

What I mean is I want to retrieve a DNA sequence of S-protein gene from SRR. So I am about to work with not translated sequence, but with the DNA sequences all along the way.

I mean that I have S-protein DNA sequence from ncbi as reference sequence, and SRR file with all the sequences inside. What I need as result is to find a DNA sequence of S-protein in this SRR file I have.