Question

what is paired and unpaired reads by trimmomatic, again?

1

Entering edit mode

5.5 years ago

Decen ▴ 20

Who can give a finally determined answer: what is paired and unpaired reads by trimmomatic? what kind of sequences are in unpaired reads? please see the question link and my comments. Many many many thanks! here is my comments:

"Hi Genomax, I see your reply for this question, but I still do not understand what is unpaired reads or unpaired.fastq file? based on your answer, my understanding is that for a Paired End sequencing, generally, the types of sequences in the R1 file is equal to that in the R2 file. Here, we do not care about the number of each sequence. for instance, if one sequence cannot pass the QC (set in trimmomatic)in R1 file, but this sequence pass the QC in R2 file, however, this sequence in both R1 and R2 file will be classified into unpaired reads/.fastq file, which means all the copies in R1 and R2 files also will be classified into the unpaired reads. Or another understanding is that one sequence exist in both R1 and R2 file, but one copy in either R1 or R2 cannot pass the QC, this copy will be classified into unpaired fastq file/reads. (I think the second view might right). if so, some guys also mentioned using the unpaired for alignment/mapping with BWA, whether these under-QC sequences should be dealt with trimmomatic again with a strict set? are they useful? Finally, what kind of sequences are in unpaired.fastq file? please five some examples. my email is wangqk198738@163.com Thanks a lot!"

next-gen • 5.5k views

ADD COMMENT • link 5.5 years ago by Decen ▴ 20

0

Entering edit mode

Thanks a lot. I am a newcomer in bioinformatics. I think I made a mistake. the detected signal when sequenced is from a cluster but not from each copy in this cluster. the "types of sequences" means different sequences with different contents(A/T/C/G), since when running PCR, the amplification can generate more copies derived from the same parental copy. so in R1/R2 file, there must be the sequences with the identical content/sequence(e.g. 100 sequences with "ATCTCTGGA", which is one type/the same type, the number is 100, but only one type).

as your explanation, I think, during PE sequencing, if one cluster/read was sequenced and passed the QC in R1 file, but the complement sequence/cluster/read did not pass the QC, so both of the reads in R1 and R2 file will be classified into unpaired.fastq/file. and you also said "If it does not do that you are left with an unpaired read in R1 file." which might mean the passed sequence might be left in R1 file, the corresponding sequence in R2 will be discarded after trimmed.

I also noticed some researchers mentioned they aligned the unpaired reads, so the unpaired reads should be useful. if we think of the unpaired reads as SE sequencing reads and trim them by trimmomatic, is that right?, and the trim is necessary for unpaired reads with more strict parameters.

I am doing some research on immunorepertoire.

Great thanks for your kindest help! if possible, please show your comments on the above.

ADD REPLY • link 5.5 years ago by Decen ▴ 20

0

Entering edit mode

the detected signal when sequenced is from a cluster but not from each copy in this cluster.

Correct. Clusters that may be generating mixed signals are removed by Illumina's software during pre-processing of data. A pure cluster represents clonal amplification of one DNA fragment and thus has an identical sequence for all DNA strands.

which might mean the passed sequence might be left in R1 file, the corresponding sequence in R2 will be discarded after trimmed.

That will only happen if you trimmed R1/R2 read files independently. When trimming both files together (which is how it should be done) with trimmomatic/bbduk.sh this will not happen.

I also noticed some researchers mentioned they aligned the unpaired reads, so the unpaired reads should be useful. if we think of the unpaired reads as SE sequencing reads and trim them by trimmomatic,

You could do that. In general there should be few unpaired reads (unless you have a particular bad dataset or there was a problem with the run itself). Only a select few aligners are able to accept paired and unpaired reads in the same alignment run.

ADD REPLY • link 5.5 years ago by GenoMax 141k

0

Entering edit mode

Great thanks to Mr. Genomax. Now I think I understand what is paired or unpaired reads/fastq file by trimmomatic.

ADD REPLY • link 5.5 years ago by Decen ▴ 20

2

Entering edit mode

Hi Decen, two comments:

when addressing someone's comment or answer, do not add an answer, you should rather add a comment (ADD REPLY or ADD COMMENT buttons, depending on the situation.
when an answer is helpful and / or solves your problem, accept it by clicking on the green check-mark.

A final comment is this is an English speaking forum. Although I understand your desire to help, I think long posts in Chinese are out of place. You should consider starting a blog or tweeting, I believe it would be both more helpful for Chinese speakers and less disruptive here on the forum.

ADD REPLY • link 5.5 years ago by h.mon 35k

0

Entering edit mode

Hi h.mon,

Thanks for your suggestions. I have modified!

ADD REPLY • link 5.5 years ago by Decen ▴ 20

score 3 · Accepted Answer · 2018-11-04

my understanding is that for a Paired End sequencing, generally, the types of sequences in the R1 file is equal to that in the R2 file.

I don't know what you mean by that. There should only be one type/format of sequences, fastq. Number of reads in R1/R2 files will be identical when the sequence comes off the sequencer.

Here, we do not care about the number of each sequence.

Some may not (if you are willing to accept files that are out of sync in terms of order of R1/R2 reads. Aligners will produce odd results if you use such a file.

if one sequence cannot pass the QC (set in trimmomatic)in R1 file, but this sequence pass the QC in R2 file, however, this sequence in both R1 and R2 file will be classified into unpaired reads/.fastq file, which means all the copies in R1 and R2 files also will be classified into the unpaired reads.

That is the reason you should trim reads together and use a trimming program that is PE aware. There should be only one copy of sequence for each cluster coordinate.