Question

data from SRA against blast databases method.

1

Entering edit mode

9.9 years ago

arronar ▴ 280

Hello.

I want to find a specific gene inside SRA files and I am wondering if i am using a correct way to align those sequences because its the first time that I am "playing" with something like that.

So, here is what I've done till now.

Download .sra files from NCBI and using fastq-dump program, convert .sra files into .fastq using:

fastq --split-spot

Then those .fastq files convert them into .fasta using:

awk 'NR % 4 == 1 || NR % 4 == 2' myfile.fastq | sed -e 's/@/>/' > myfile.fasta

Finally, convert .fasta into blast readable database using:

makeblastdb -in myfile.fasta -dbtype nucl -out myNewdb

Now having ready the Database i can run a blast using:

blastn -query query.fa -db myNewdb -task blastn -dust no -outfmt 7 -num_alignments 2 -num_descriptions 2 nucl, prot

So, is this approach right/correct? Or I have to try it somehow different?

Thanks in advance.

alignment next-gen blast • 4.9k views

ADD COMMENT • link updated 2.5 years ago by Ram 43k • written 9.9 years ago by arronar ▴ 280

score 0 · Answer 1 · 2014-05-28

That's not a great idea, you'll likely get a lot of false hits (not to mention that it seems a bit silly to blast a whole gene against a bunch of short reads). The normal process would be to extract the reads in fastq format from the SRA file and then map those to the genome containing the gene you're interested in (not that I can think of a reason to look for a single gene inside a HTS experiment, unless you're interested in variants only in it and there's no VCF file available).

Ram · Answer 2 · 2014-05-28

0

Entering edit mode

9.9 years ago

mikhail.shugay 3.5k

Have you tried the SRA-BLAST tool?

PS If you're going to do blast search directly, you should consider filtering redundant sequences from fastq and place their counts to header. Or use this family of tools: http://weizhong-lab.ucsd.edu/cd-hit/. For high over-sequencing this could speed-up a lot

ADD COMMENT • link updated 4.3 years ago by Ram 43k • written 9.9 years ago by mikhail.shugay 3.5k

score 0 · Answer 3 · 2014-05-28

In principal, your approach is right/correct, but you will need a powerfull filter to deal with the blast output and to make sense of it efficiently. How long are your query sequences? For short queries the size of a single gene I would convert the blast output into a multi sequence alignment and inspect that multi sequence alignment with a viewer like clustalx. Unfortunately I am not aware of a common tool for doing such a conversion. But you can get nearly the same result if you use bwa instead of blast and tablet as your multi sequence viewer.

Let's say, you have fully sequenced clinical bacterial isolate and you want to find out, if this isolate is carrying the antibiotic resistance gene oxa40. This gene is located on a mobil genomic element and only present in some isolates. You just download an arbitrary oxa40 sequence from Genbank (size is about 900 nt) and map all your reads to this sequence with bwa. Then you convert the output from SAM to BAM and view it with tablet. In tablet you will immediately recognize, if the oxa40 gene is present at all and also if your reads show any significant SNP with respect to the query sequence.

Downside is that bwa is much slower than blast, since blast uses an indexed database of your reads while bwa cycles through all your reads sequentially. But for bacterial genomes bwa is presumably fast enough to be useful. I have been wondering for some while if there exists a tool to convert blast output to SAM format. This should be doable in principle, but you have to cope with all the hairy details of the SAM format.