I have 2 different samples for which I sequenced the transcriptomes (Illumina): sample A and sample B.
I pooled all the single-end reads (up to 300 bp) from the 2 samples together and did a de novo assembly with CLC.
Now I am trying to retrieve from which sample the contigs (up to 3,000 bp) come from.
To do that I aligned (bowtie2):
- reads from sample A vs indexed contigs
- reads from sample B vs indexed contigs
For each samples independently, I filtered the respective SAM files with samtools (samtools view -F 4 <sample-specific_SAM.tab>) to obtain the IDs and then the sequences of the contigs for which reads have been mapped to.
As a control, I know that some contigs are specific to each samples (sample A has 50 unique contigs that cannot be found in sample B; sample B has 100 unique contigs that cannot be found in sample A).
However, when I looked at the sample-specific contigs, I can retrieve the 150 unique contigs (50 from A + 100 from B) in both samples.
I tried different bowtie2 parameters, but obtain nearly the same statistics every time (alignment rate; length interval; N50; N90) for both samples:
- using contigs with / without exact duplicate sequences
- end_to_end / local alignment
- allowing gaps or not (to disable gaps I use --gbar <length of the longest reads>)
- keeping / removing secondary alignments from the SAM files
I don't know what to try anymore. It would be very helpful if someone could share his/her expertise or suggest another strategy.