Question

querying blast database for rna seq reads

0

Entering edit mode

6.7 years ago

Varun Gupta ★ 1.3k

Hi,

I downloaded all the bacterial genomes from ncbi (ftp://ftp.ncbi.nlm.nih.gov/genomes/genbank/bacteria/assembly_summary.txt)

I concatenated all the fasta files into one file and created a database using makeblastdb. I have reads in fastq file which I believe has been contaminated with bacterial sequences. Since I have made the database, how can I use blastn to query this database and will this give me read counts for all the reads mapping to a particular specific bacterial species?

Thanks for the help.

Regards Varun

blast blastn RNA-Seq • 3.2k views

ADD COMMENT • link updated 6.7 years ago by GenoMax 141k • written 6.7 years ago by Varun Gupta ★ 1.3k

0

Entering edit mode

Are they paired-end fastq? Are you going to do any QC prior to looking for contamination?

You can align your reads to the reference fasta using bwa.

ADD REPLY • link 6.7 years ago by st.ph.n ★ 2.7k

0

Entering edit mode

Hi, for that I have index my combined genome which is around 25G and has 12690 contigs. Do you think aligner would be fine indexing it and then mapping it?

ADD REPLY • link 6.7 years ago by Varun Gupta ★ 1.3k

score 0 · Answer 1 · 2017-08-16

0

Entering edit mode

6.7 years ago

GenoMax 141k

While feasible using Blast for this will be a fairly compute intensive task (and can take a significant amount of time). If you believe that your data is contaminated then I suggest that you use BBSplit from BBMap suite with known genome and collect reads that map to that genome leaving the contamination behind in a different pool.

There are more efficient ways of assigning bacterial species (e.g. kraken) if you want to continue on present path.

ADD COMMENT • link 6.7 years ago by GenoMax 141k

0

Entering edit mode

Do you think indexing and mapping with an aligner would be an option to look into, although I think the genome to be indexed is very large 25G??

ADD REPLY • link 6.7 years ago by Varun Gupta ★ 1.3k

0

Entering edit mode

Can you be specific about mapping what to what? BBMap should handle a genome that large AFAIK. You will need to have enough RAM available (for reference human genome generally needs about 30G).

ADD REPLY • link 6.7 years ago by GenoMax 141k

0

Entering edit mode

I created a reference fasta file for all the bacterial species and combined them into one fasta file. Taking this fasta file as genome, I want to map my paired-end RNA-Seq reads to this fasta file to see how many reads are coming from which bacterial species. Since I combined all the bacterial species, total size of the fasta file is 25G which is very big(bigger than human genome)

ADD REPLY • link 6.7 years ago by Varun Gupta ★ 1.3k

0

Entering edit mode

See my comment above for feasibility of using that fasta for analysis. Should be possible as long as you have enough RAM available.

Otherwise this is a question of priorities. Are you interested in moving forward with the reads that map to the genome of your interest (which I assume is not of bacterial origin) or is your interest the bacterial reads themselves. There is also the 16S microbial blast index that NCBI makes available. Presumably your data will have 16S reads. Blasting a sample of reads can provide you with some information about the bacterial contamination.

ADD REPLY • link 6.7 years ago by GenoMax 141k

0

Entering edit mode

This microbial blast index might give me some idea quickly. To use this microbial blast index, I should just use blastn with my fastq reads on this index? Thanks genomax for the help

ADD REPLY • link 6.7 years ago by Varun Gupta ★ 1.3k

0

Entering edit mode

You can convert your fastq reads to fasta using reformat.sh in=your_reads.fq out=your_reads.fa from BBMap suite and then on to blast. (use in1= in2= out1= out2= for a PE dataset).

ADD REPLY • link 6.7 years ago by GenoMax 141k