Hi all, Maybe I'm asking a too basic question, but I really feel confused. I have R1.fastq file and R2.fastq file from the paired-end RNA-seq. As far as I know, the read order in R1 and R2 files should be the same, namely the reads in the same pair should get the same rank in R1 and R2 respectively. However, when I count the initial read numbers in R1 and R2 files, they are different. For example, R1 has 1878678 reads, while R2 has 1800352 reads. This makes me confused becasue if so, does this mean the additional reads in R1 compared to R2 (1878678 - 1800352 = 78326 reads) are unpaired and all the other reads in R1 and R2 are paired and have the same rank? What makes me more confusing is that, after trim R1 and R2 using Trimmomatic (PE mode), the trimmed, and PAIRED R1 and R2 files still have different read numbers. (R1, 1397878, R2, 1402966). So, does this mean the additional reads in R2 this time (1402966 - 1397878 = 5088 reads) are not paired and others are paired with R1? But trimmomatic attributes these reads to the PAIRED result file and actually the unpaired reads have been transferred to the special unpaired fastq result files. This makes me feel confused. Could anyone give some answers? Thank you so much.
Step one...ask the person who gave you the fatsq how they were filtered. The fastqs that came off the instrument should all be paired and in order. You might have fastqs where some reads were purged for quality reasons while their mates were left in the file. Or one was truncated.
It happens many times even I had encountered the same problem. What I did was...
- trimming & filtering forward and reverse reads (I used NGSQCToolkit)
- Use fastq-pair to get only those reads which have mates in both forward and reverse fastq file.
- Here you have to check how much per cent of data you lost. If amount of data retained is significant then proceed for next step.
If you lose the huge amount of data then you can contact data provider.
Awesome. I think it is a very useful tool, fastq-pair.
You are "fixing" something which you don't even know how it is broken in the first place - at least, if you know, you didn't tell us. You didn't tell us the source of the data, and you didn't follow up on some of our questions. Again, what is the output of:
head -n1 R1.fastq head -n1 R2.fastq tail -n4 R1.fastq tail -n4 R2.fastq
For all we know, it is even possible you are treating as pairs two files from different samples. This can happen, see for example this post. So before fixing anything, try to discover how things got broken in the first place, before you have some really nonsensical results downstream.