I have few background questions about paired-end reads and related practical questions to handle paired-end reads with (Strainphlan) pipeline. The data I'm using are metagenomic shotgun sequencing from the Human Microbiome Project.
My understanding of paired-end reads is that when sequencing, we get one read starting from one end of the fragment, and one starting from the other end, the purpose being to have a better coverage of the sequence.
Shouldn't there be approximately the same number of reads in both files? For most samples, I have huge difference in files size. Typically the sample.1.fastq contains 8 millions reads while the sample.2.fastq has only 2 millions. Why such a big difference?
For genome assembly I get it is helpful. In my case I want to use this data to identify species and strains in those metagenomic samples. I do not have an intuition about how important it is to get the second read (is there so many errors?).
The reason for this second question is that the pipeline I am using does not handle paired-end reads. Here is what the help says: "MetaPhlAn 2 can also natively handle paired-end metagenomes (but does not use the paired-end information)"
- What does mean "it handles paired end reads" if it does not use paired-end information?
- If I run my command with --input_file sample.1.fastq,sample.2.fastq, it runs into an error because of orphan reads, but if I clear out all those orphan reads, I loose a lot of data (hence my question above).
- If I run the command only specifying --input_file sample.1.fastq, I'm not sure how reliable it is to use only a single read and completely ignoring the second one.