I am aligning my reads against a reference databae using soap2 using the following command line:
./soap2.21release/soap -a R1.fa -b R2.fa -D Database.fa.index -o PE_output -2 SE_output -u UM_output -m 100 -x 2000 -p 30 -v 15 -M 4 -r 1
Reagrding the -2 parameter the manual says "Output file contains mapped but unpaired reads when do PE alignment"
However, in my SE_output, I can see both paired end reads as well as single end reads.
Why these paired reads are coming in the SE output. Are these read pairs the second best hits and the best hit is included in the PE_output ??
Few lines from Se output (I have trimmed the read length and quality values to make it easy to understand, only the first column is important here)
NS500187:41:HMJGTAFXX:1:11101:2599:1091/1 GCATCTATGCGCCTAC hhhhhhhhhhhhhhhhh 1 a 98 + Gene_ID_A7_57347 278 2 C->67T40 C->40T40 98M 40C26C30
NS500187:41:HMJGTAFXX:1:11101:2599:1091/2 GCATCTATGCGCCTAC hhhhhhhhhhhhhhhhh 1 b 98 - Gene_ID_A7_57347 278 2 C->67T40 C->40T40 98M 40C26C30
NS500187:41:HMJGTAFXX:1:11101:6204:1109/2 GCTGGTGACGGTAGG hhhhhhhhhhhhhhhhh 1 b 150 - Gene_ID__935968 90 1 T->24C40 150M 24T125
NS500187:41:HMJGTAFXX:1:11101:9368:1117/1 TCTCTCGGCCGTACCC hhhhhhhhhhhhhhhhh 1 a 151 + Gene_ID_A4_85082 485 1 T->148C40 151M 148T2
NS500187:41:HMJGTAFXX:1:11101:13569:1120/2 GCCGTTACATGATGGG hhhhhhhhhhhhhhhhh 1 b 151 + MH0419_GL0057511 723 2 A->117G40 T->48C40 151M 48T68A33
NS500187:41:HMJGTAFXX:1:11101:24413:1122/1 CGATGTGATGAGAAAA hhhhhhhhhhhhhhhhh 1 a 151 - MH0366_GL0138790 243 2 A->129G40 T->31G40 151M 31T97A21
Thanks in Advance,
Regards
Ankit
Where they match, as per the first 2 reads that you've pasted, are they always reads over the exact same positions? The first two look like identical reads but just on opposite strands. Perhaps star erroneously interprets these as non-matching mate-pairs.
Hello Kevin,
Actually you are right, after you mentioned that the two paired reads looks the same, I went on to check all such cases. I observed that almost all of paired end reads in the SE_output are identical but just on opposite strands. One reason can be that the minimum insert size that I have taken is 100 and my read length is 150, so probably the pairs with insert size of around 150 will result in exactly identical pairs. Even after this, I think these pairs should have gone in the PE_output rather than SE_output. Do you have any suggestion on how could I resolve this ?
Additionally, there some cases in which the reads pairs are not identical, for e.g.,
NS500187:34:HJJY5AFXX:1:11101:24396:1417/1 TAGAAAAACTGTTAAATGAAGGGCACAAAAGTGCTCAGATAATCATTGACTTTGCAGAAGATCCAAATCAATTTTTATCTACTGTGCAAATAGGAATTACACTTATCGGTATCTTAACTGGTTCTTACGGTGGAGCTACTTTAGCTTAAC hhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhh 1 a 150 + Gene_ID_UA_864050 107 2 G->146T40 G->122T40 150M 122G23G3
NS500187:34:HJJY5AFXX:1:11101:24396:1417/2 TATCTAATGTGAAAATAGGAATTACACTTATCGGTATCTTAACTGGTGCTTACGGTGGAGCTACTTTAGCTGAACCATTATCAGAAGTAATTGGTCCATATGTACCATATAGCGGAACAATAAGTATTTTAGTAGTTGTTATCATTACAAC hhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhh 1 b 151 - V1.UC30-0_GL0125338 182 2 C->6A40 C->11A40 151M 6C4C139
These two paired reads are not identical, and are mapping on different reference genes. Again, I fail to understand the reason behind such cases in SE_Output.