Question: How to retrieve the sample origin of contigs after assembly of pooled reads?
1
gravatar for lqdo2000
3.9 years ago by
lqdo200020
Australia
lqdo200020 wrote:

I have 2 different samples for which I sequenced the transcriptomes (Illumina): sample A and sample B.

I pooled all the single-end reads (up to 300 bp) from the 2 samples together and did a de novo assembly with CLC.

Now I am trying to retrieve from which sample the contigs (up to 3,000 bp) come from.

To do that I aligned (bowtie2):

- reads from sample A vs indexed contigs

- reads from sample B vs indexed contigs

For each samples independently, I filtered the respective SAM files with samtools (samtools view -F 4 <sample-specific_SAM.tab>) to obtain the IDs and then the sequences of the contigs for which reads have been mapped to.

As a control, I know that some contigs are specific to each samples (sample A has 50 unique contigs that cannot be found in sample B; sample B has 100 unique contigs that cannot be found in sample A).

However, when I looked at the sample-specific contigs, I can retrieve the 150 unique contigs (50 from A + 100 from B) in both samples.

I tried different bowtie2 parameters, but obtain nearly the same statistics every time (alignment rate; length interval; N50; N90) for both samples:

- using contigs with / without exact duplicate sequences

- end_to_end / local alignment

- allowing gaps or not (to disable gaps I use --gbar <length of the longest reads>)

- keeping / removing secondary alignments from the SAM files

I don't know what to try anymore. It would be very helpful if someone could share his/her expertise or suggest another strategy.

 

Thanks !

contigs alignment assembly reads • 1.6k views
ADD COMMENTlink modified 3.8 years ago by Biostar ♦♦ 20 • written 3.9 years ago by lqdo200020

Are you sure that all those contigs are 100% different from each other? If not, it is still possible for some of the reads from sample A to map onto unique contigs of sample B and vis versa due to small mismatches.

ADD REPLYlink written 3.9 years ago by Sam2.2k

Yes, I also tried by removing the exact duplicate sequences in the contig dataset. But some unique sample A and sample B contigs could be similar. So, what you mean is that the contigs (or at least the read/contig matching regions) are not different enough for segregating them precisely? What about discriminating them by alignment score or other metrics?

ADD REPLYlink modified 3.9 years ago • written 3.9 years ago by lqdo200020

Is there a way/metric to assess the similarity between 2 datasets (here transcriptomes)? 

ADD REPLYlink written 3.9 years ago by lqdo200020

So if your sequence are not completely different, there will still be situations where the reads can still align to the other contig. If you want to find the similarity between two sequences, you can use the edit distance to see how similar they are

ADD REPLYlink written 3.9 years ago by Sam2.2k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1747 users visited in the last hour