Question: Trim Paired-end Fastq Files
0
gravatar for yuabrahamliu
13 months ago by
yuabrahamliu40
yuabrahamliu40 wrote:

Hi all, Maybe I'm asking a too basic question, but I really feel confused. I have R1.fastq file and R2.fastq file from the paired-end RNA-seq. As far as I know, the read order in R1 and R2 files should be the same, namely the reads in the same pair should get the same rank in R1 and R2 respectively. However, when I count the initial read numbers in R1 and R2 files, they are different. For example, R1 has 1878678 reads, while R2 has 1800352 reads. This makes me confused becasue if so, does this mean the additional reads in R1 compared to R2 (1878678 - 1800352 = 78326 reads) are unpaired and all the other reads in R1 and R2 are paired and have the same rank? What makes me more confusing is that, after trim R1 and R2 using Trimmomatic (PE mode), the trimmed, and PAIRED R1 and R2 files still have different read numbers. (R1, 1397878, R2, 1402966). So, does this mean the additional reads in R2 this time (1402966 - 1397878 = 5088 reads) are not paired and others are paired with R1? But trimmomatic attributes these reads to the PAIRED result file and actually the unpaired reads have been transferred to the special unpaired fastq result files. This makes me feel confused. Could anyone give some answers? Thank you so much.

ADD COMMENTlink modified 13 months ago by h.mon27k • written 13 months ago by yuabrahamliu40

Careful while Downloading fastq files. Always prefer fastq-dump or prefetch . Donot use direct download separately as R1 and R2. Contact the data provider also.

ADD REPLYlink written 13 months ago by k.kathirvel93200

Where did you obtain the files? Did you download them from ENA / SRA? A sequencing facility sequenced your samples? You were given these files by a collaborator?

Did you run FastQC on them? Seems like they may have been trimmed already. Some quick and dirty sanity checks - what is the output of:

head -n1 R1.fastq
head -n1 R2.fastq
tail -n4 R1.fastq
tail -n4 R2.fastq
ADD REPLYlink written 13 months ago by h.mon27k

This may sound stupid, but can you tell us how you have count the reads? Because if you are simply using grep command with "@" symbol then it may end-up counting sequence header as well as qualities in fourth line of sequence (i.e in illumina, 31 quality value is represented by symbol "@") which results in inequality of PE counts.

ADD REPLYlink written 13 months ago by toralmanvar810

Thank you. I used wc -l to check the total line, and then divide them by 4.

ADD REPLYlink written 13 months ago by yuabrahamliu40
1
gravatar for swbarnes2
13 months ago by
swbarnes26.2k
United States
swbarnes26.2k wrote:

Step one...ask the person who gave you the fatsq how they were filtered. The fastqs that came off the instrument should all be paired and in order. You might have fastqs where some reads were purged for quality reasons while their mates were left in the file. Or one was truncated.

ADD COMMENTlink modified 13 months ago • written 13 months ago by swbarnes26.2k
1
gravatar for Dattatray Mongad
13 months ago by
National Centre for Cell Science, Pune
Dattatray Mongad320 wrote:

It happens many times even I had encountered the same problem. What I did was...


  1. trimming & filtering forward and reverse reads (I used NGSQCToolkit)
  2. Use fastq-pair to get only those reads which have mates in both forward and reverse fastq file.
  3. Here you have to check how much per cent of data you lost. If amount of data retained is significant then proceed for next step.

If you lose the huge amount of data then you can contact data provider.

ADD COMMENTlink written 13 months ago by Dattatray Mongad320

Awesome. I think it is a very useful tool, fastq-pair.

ADD REPLYlink written 13 months ago by yuabrahamliu40
0
gravatar for h.mon
13 months ago by
h.mon27k
Brazil
h.mon27k wrote:

Awesome. I think it is a very useful tool, fastq-pair.

You are "fixing" something which you don't even know how it is broken in the first place - at least, if you know, you didn't tell us. You didn't tell us the source of the data, and you didn't follow up on some of our questions. Again, what is the output of:

head -n1 R1.fastq
head -n1 R2.fastq
tail -n4 R1.fastq
tail -n4 R2.fastq

For all we know, it is even possible you are treating as pairs two files from different samples. This can happen, see for example this post. So before fixing anything, try to discover how things got broken in the first place, before you have some really nonsensical results downstream.

ADD COMMENTlink written 13 months ago by h.mon27k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1132 users visited in the last hour