Excluding specific sequences from a Trinity assembly
1
0
Entering edit mode
15 months ago
jen ▴ 10

Hi all,

I've performed a Trinity assembly using a code similar to this one:

nohup /usr/local/bin/trinityrnaseq-v2.13.2/Trinity --seqType fq --max_memory 50G --left /usr/local/bin/trinityrnaseq-v2.13.2/sample_data/test_Trinity_Assembly/reads.left.fq.gz  --right /usr/local/bin/trinityrnaseq-v2.13.2/sample_data/test_Trinity_Assembly/reads.right.fq.gz --CPU 6 &

The RNA sample is derived from blue crab tissue; however, it is enriched for RNA viruses with a special extraction protocol. In an ideal world, we would like for our resulting sample to contain only viral RNA. Nevertheless, about 25% of the sample still consists of blue crab (host) RNA. We figured this out by aligning the reads received from the sequencing company to the blue crab reference genome. Thus, I want to remove anything that resembles blue crab RNA sequences from the trinity_out_dir.Trinity.fasta file that Trinity produces (that contains all of the contigs within the assembly). Again, I have a blue crab reference genome, and I also have information about all of the reads that mapped to the blue crab reference genome using the Bowtie2 aligner (for example, I have a sorted.bam file for the alignment).

Given what I have, is there a way to remove most if not all of the RNA sequences that map back to the blue crab? I know that the Bowtie2 --un-conc argument allows you to write pairs that did not align to a reference, but I do not know how to incorporate this same concept in Trinity. Removing ALL host RNA or at least most from my assembly is the ultimate goal. Hope this makes sense- thanks in advance.

Relatively new to bioinformatics so sorry if I don't explain this too well

Assembly RNA-seq Trinity Metagenomics • 1.0k views
ADD COMMENT
0
Entering edit mode
15 months ago
shelkmike ★ 1.2k

To remove contamination I use the following procedure:
1) Download NCBI nt and NCBI taxonomy databases.
2) Align transcripts by BLASTN to NCBI nt.
3) Look at the top five matches.
4) If most of these top five matches belong to a wrong taxon, remove the transcript. For example, you may want to remove all transcripts where most of the top five matches belong to a taxon other than Viruses (which is a superkingdom in NCBI Taxonomy).

ADD COMMENT
0
Entering edit mode

Hello, thank you so much for your response. I am now running a blastn of the sequences to NCBI nt. Here is an example of my code:

blastn -query /nas2/blue_crab_chunglab/mingli_data/RNA/trinity_out_dir.Trinity.fasta
-db /nas2/blast_db/nt -outfmt ‘6 std qlen slen sframe stitle’ -evalue 1e-6 -max_target_seqs 1 -num_threads 24 -out FLRNAtrinity_nt.txt

In an effort to reduce the tremendous amount of data that we have, I've limited the search to the #1 match. In the past, another student in my lab aligned the sequences using blastx to the nr database so I have code for that as well. She did the top 5 matches:

blastx -query /data1/minglizhao/Trinity.tmp.fasta -db /nas2/blast_db/nr -outfmt '6 std qlen slen qframe stitle' -evalue 1e-6 -max_target_seqs 5 -num_threads 24 >~/trinitynr.txt

I have an excel sheet of the blastx results (blastn is still running) and when I sort based on percent identity, my best hits are to bacterial organisms or different crab species. Given that I don't want any host RNA in there or even any other microbes that are not viruses, I'll likely want to remove these transcripts. The thing is, I don't know how to automate the process. I know that there are plenty of non-viral sequences in my assembly, and I'd like to remove contamination by starting off with host RNA (and then bacteria, protozoans, and so on) but I do not know how to incorporate that into my code. I've never used NCBI taxonomy but can try to work with it now. Any example codes will be greatly appreciated, thank you so much.

ADD REPLY
0
Entering edit mode

Blastn, as well as blastx, has an option "sskingdom". It will output the superkingdom that the match belongs to. Then, you can open the produced table in Excel, sort it by superkingdoms and make a list of sequences that don't belong to Viruses. Then, you can remove these sequences from the FASTA file as suggested at How to delete seq_id + sequence from FASTA file matching specific seq_id .
Some additional notes:
1) I recommend to use "-max_hsps 1" in addition to "-max_target_seqs 1". Otherwise, if a transcript has several matches to different regions of a single sequence in NCBI nt, all these matches will be written separately.
2) You may accelerate the alignment if instead of taking all isoforms for each transcript you take only the major isoform. For example, you can find CDSs by Transdecoder and call the isoform with the longest CDS the major.
3) Considering top 5 matches instead of the best match may increase the accuracy of contamination removal. See my recent update to the first comment at How can I remove contaminants from an assembled genome? . Sometimes, though not often, I see sequences where the first match belongs to one taxon, but the other four matches to a very different taxon. This probably means incorrect taxonomic annotations of some sequences in NCBI nt.

ADD REPLY

Login before adding your answer.

Traffic: 1403 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6