Hello,
I have a series of reads coming from a WGS experiment; these reads are not human because they mapped on an additional chromosome containing virus reference genomes I added to the original human reference genome. I can tell wheter the reads belong to the same viral species because of their localization, but some of the reads do not overlap. I used clustal x to generate a consensus sequence as a proxy for a contig and then concatenated the contigs in a single cluster, as in this scheme:
REF GENE ==================================================================
READ 1 --------
READ 2 ---------
READ 3 -------------
READ 4 -------------
READ 5 --------
READ 6 --------
CONT 1 *******************
CONT 2 *************
CONT 3 ***********
CLUST ...................NNNNNN.............NNNNNNNNN...........
Where =
are the nucleotides of the viral gene of interest, -
the nucleotides from the sequencing, *
and .
the nucleotide obtained from the consensus, N
a spacer.
My questions is (assuming that this procedure is correct):
Is it better to run Blastn (or BlastX) on (a) the reads, (b) the contigs or (c) the cluster?
I would say (c) because it gives more data to BLAST to work upon, thus it should have less chances of failure, but I might be wrong...
Thank you.
How does blast (in post title and tags) comes into this?
because I ran blastn or blastx on the sequences I obtained. If the output confirmed the mapping I kept the reads.
Oops, I wrote it wrongly on the post, sorry. Amended
Are those N's representing single bases that are missing? If so it seems odd that you don't have coverage in those areas. You may just need to sequence more.
You are also assuming that those bases are intact in your sample. They could have been deleted during whatever experiment that was done here?
I placed the Ns to separate the contigs one another and try and tell blast that there was a break there. I could have used
-
but, for instance, when testing primers I was told to separate the forward-reverse with Ns then blast; I followed the same concept...Do those represent actual number of bases is what I am asking. How big a break do they represent? have you considered why they are absent in your data?
I used a standard 10 Ns as a separator
Since you have a reference (and those contigs) you must know how long the region is that is not present?
but then I have to go into thousands of reads to check that. What I have done was to sort the reads by starting point and to call the reads into separate contigs if the starting point of the read n+1 was greater than the end point of the read n. I then used 10 Ns to separate the contigs
If you have contigs as you depicted them in the diagram above then putting them into clustal or any other MSA program should show you the precise length of those gaps with respect to the reference. Unless I am not following the diagram in original post.
In the graph, as in reality, I used reads 1,2,3 to make the consensus (contig) 1, then read 4 for contig 2 and reads 5,6 for contig 3, then simply concatenated everything with the spacer 10xN into cluster 1. Shal I re-run making a further consensus with the reference sequence and the three contigs? Wouldn't clustal omega try and overlap all contigs? Does the number Ns really matter? Shall I use
-
instead ofN
to force opening a gap in that position?I am not sure why you would use anything else but the reference to map the reads (since you have that available), if your only aim to
confirm mapping of reads
.If you are seeing regions that are missing in your data (when compared to the available reference) then you need to consider the possibility that parts of the reference may have been deleted in this experiment OR you don't have enough sequence data OR those regions are hard to sequence and are under/not represented in the data.
Because, in mapping viral sequences, there is the risk of calling viral what is either human or bacterial, thus one needs to run multiple rounds of blast on top of BWA MEM to remove false positives. I have tried to replicate the pipeline that is reported for instance in Norman et al. Cell 2015:
But I am not sure if I understood the procedure correctly, anyway.
You are referring to
blastn
in first sentence but then toclustal
in second. What is the aim of this analysis?the aim is to confirm the mapping of the reads. Apologies for the term
clustal
I always get confused. I only usedclustal omega
to generate the consensuses. All the rest is aboutBLAST
.Ok. So we are talking about blast. What is the query? Original reads?
No, the contigs or clusters I generated.
In that case you should use contigs (because that is real data you have) to blast against original reference, if you don't want to use original reads. I would not trust the consensus you made since one can't be sure of the following possibilities :
The rationale of generating a consensus was exactly that of overcoming possible divergences against the reference sequence, given the high variability of viruses. Using hard clipped reads to make the consensus would be better? And are the 'N' introducing an error in the blast search? Thank you