best blast strategy: read vs cluster?
0
0
Entering edit mode
3.0 years ago

Hello,

I have a series of reads coming from a WGS experiment; these reads are not human because they mapped on an additional chromosome containing virus reference genomes I added to the original human reference genome. I can tell wheter the reads belong to the same viral species because of their localization, but some of the reads do not overlap. I used clustal x to generate a consensus sequence as a proxy for a contig and then concatenated the contigs in a single cluster, as in this scheme:

REF GENE ==================================================================
CONT 1          *******************
CONT 2                                   *************
CONT 3                                                         ***********
CLUST           ...................NNNNNN.............NNNNNNNNN...........


Where = are the nucleotides of the viral gene of interest, - the nucleotides from the sequencing, * and . the nucleotide obtained from the consensus, N a spacer.

My questions is (assuming that this procedure is correct):

Is it better to run Blastn (or BlastX) on (a) the reads, (b) the contigs or (c) the cluster?

I would say (c) because it gives more data to BLAST to work upon, thus it should have less chances of failure, but I might be wrong...

Thank you.

alignment blast search strategy • 934 views
0
Entering edit mode

How does blast (in post title and tags) comes into this?

0
Entering edit mode

because I ran blastn or blastx on the sequences I obtained. If the output confirmed the mapping I kept the reads.

0
Entering edit mode

Oops, I wrote it wrongly on the post, sorry. Amended

0
Entering edit mode

Are those N's representing single bases that are missing? If so it seems odd that you don't have coverage in those areas. You may just need to sequence more.

You are also assuming that those bases are intact in your sample. They could have been deleted during whatever experiment that was done here?

0
Entering edit mode

I placed the Ns to separate the contigs one another and try and tell blast that there was a break there. I could have used - but, for instance, when testing primers I was told to separate the forward-reverse with Ns then blast; I followed the same concept...

0
Entering edit mode

Do those represent actual number of bases is what I am asking. How big a break do they represent? have you considered why they are absent in your data?

0
Entering edit mode

I used a standard 10 Ns as a separator

0
Entering edit mode

Since you have a reference (and those contigs) you must know how long the region is that is not present?

0
Entering edit mode

but then I have to go into thousands of reads to check that. What I have done was to sort the reads by starting point and to call the reads into separate contigs if the starting point of the read n+1 was greater than the end point of the read n. I then used 10 Ns to separate the contigs

0
Entering edit mode

If you have contigs as you depicted them in the diagram above then putting them into clustal or any other MSA program should show you the precise length of those gaps with respect to the reference. Unless I am not following the diagram in original post.

0
Entering edit mode

In the graph, as in reality, I used reads 1,2,3 to make the consensus (contig) 1, then read 4 for contig 2 and reads 5,6 for contig 3, then simply concatenated everything with the spacer 10xN into cluster 1. Shal I re-run making a further consensus with the reference sequence and the three contigs? Wouldn't clustal omega try and overlap all contigs? Does the number Ns really matter? Shall I use - instead of N to force opening a gap in that position?

0
Entering edit mode

I am not sure why you would use anything else but the reference to map the reads (since you have that available), if your only aim to confirm mapping of reads.

If you are seeing regions that are missing in your data (when compared to the available reference) then you need to consider the possibility that parts of the reference may have been deleted in this experiment OR you don't have enough sequence data OR those regions are hard to sequence and are under/not represented in the data.

0
Entering edit mode

Because, in mapping viral sequences, there is the risk of calling viral what is either human or bacterial, thus one needs to run multiple rounds of blast on top of BWA MEM to remove false positives. I have tried to replicate the pipeline that is reported for instance in Norman et al. Cell 2015:

Sequence Processing and Analysis Adapters and low-quality bases were trimmed, and reads were clustered at 95% identity. Unique sequences were queried against a customized viral data-base using BLASTx.

But I am not sure if I understood the procedure correctly, anyway.

0
Entering edit mode

Is it better to run Blastn (or BlastX) on (a) the reads, (b) the contigs or (c) the cluster?

I would say (c) because it gives more data to clustal to work upon, thus it should have less chances of failure, but I might be wrong..

You are referring to blastn in first sentence but then to clustal in second. What is the aim of this analysis?

0
Entering edit mode

the aim is to confirm the mapping of the reads. Apologies for the term clustal I always get confused. I only used clustal omega to generate the consensuses. All the rest is about BLAST.

0
Entering edit mode

Ok. So we are talking about blast. What is the query? Original reads?

0
Entering edit mode

No, the contigs or clusters I generated.

0
Entering edit mode

In that case you should use contigs (because that is real data you have) to blast against original reference, if you don't want to use original reads. I would not trust the consensus you made since one can't be sure of the following possibilities :

parts of the reference may have been deleted in this experiment OR you don't have enough sequence data OR those regions are hard to sequence and are under/not represented in the data.

0
Entering edit mode

The rationale of generating a consensus was exactly that of overcoming possible divergences against the reference sequence, given the high variability of viruses. Using hard clipped reads to make the consensus would be better? And are the 'N' introducing an error in the blast search? Thank you