How to retrieve the sample origin of contigs after assembly of pooled reads?
0
1
Entering edit mode
9.1 years ago
lqdo2000 ▴ 20

I have 2 different samples for which I sequenced the transcriptomes (Illumina): sample A and sample B.

I pooled all the single-end reads (up to 300 bp) from the 2 samples together and did a de novo assembly with CLC.

Now I am trying to retrieve from which sample the contigs (up to 3,000 bp) come from.

To do that I aligned (bowtie2):

  • reads from sample A vs indexed contigs
  • reads from sample B vs indexed contigs

For each samples independently, I filtered the respective SAM files with samtools (samtools view -F 4 <sample-specific_SAM.tab>) to obtain the IDs and then the sequences of the contigs for which reads have been mapped to.

As a control, I know that some contigs are specific to each samples (sample A has 50 unique contigs that cannot be found in sample B; sample B has 100 unique contigs that cannot be found in sample A).

However, when I looked at the sample-specific contigs, I can retrieve the 150 unique contigs (50 from A + 100 from B) in both samples.

I tried different bowtie2 parameters, but obtain nearly the same statistics every time (alignment rate; length interval; N50; N90) for both samples:

  • using contigs with / without exact duplicate sequences
  • end_to_end / local alignment
  • allowing gaps or not (to disable gaps I use --gbar <length of the longest reads>)
  • keeping / removing secondary alignments from the SAM files

I don't know what to try anymore. It would be very helpful if someone could share his/her expertise or suggest another strategy.

Thanks!

Assembly reads contigs alignment • 2.8k views
ADD COMMENT
0
Entering edit mode

Are you sure that all those contigs are 100% different from each other? If not, it is still possible for some of the reads from sample A to map onto unique contigs of sample B and vis versa due to small mismatches.

ADD REPLY
0
Entering edit mode

Yes, I also tried by removing the exact duplicate sequences in the contig dataset. But some unique sample A and sample B contigs could be similar. So, what you mean is that the contigs (or at least the read/contig matching regions) are not different enough for segregating them precisely? What about discriminating them by alignment score or other metrics?

ADD REPLY
0
Entering edit mode

Is there a way/metric to assess the similarity between 2 datasets (here transcriptomes)?

ADD REPLY
0
Entering edit mode

So if your sequence are not completely different, there will still be situations where the reads can still align to the other contig. If you want to find the similarity between two sequences, you can use the edit distance to see how similar they are

ADD REPLY

Login before adding your answer.

Traffic: 1301 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6