Question: Can't pair-end fastqs after filtering
gravatar for miles.thorburn
15 months ago by
miles.thorburn100 wrote:

I've got sequence data back from illumina. It's pair-end 300bp reads with ~50bp overlap. I originally pair-ended the reads as my second step (first being fastQC), and it worked a treat. But we decided to filter the fastq's first.

I've made another post here where I ask how to filter the fastq's based on sequence similarity to another fastq, and why I did it in case you're interested.

This is what I've done:

  1. FastQC to check for adapters

  2. Uclust search to find reads with 100% sequence similarity across R1's and across R2's (sepereately; we sequenced the same individual twice from different extractions) for each individual.

  3. Filter fastqs based on uclust 'hits' using seqtk - you lose around 75% of reads for the ones I've checked.

  4. Pair-end (I've tried both PEAR and EA-Utils). But I get errors saying the number of sequences is different, with PEAR saying no files are in any of the R2 files.

I've re-run the uclust and filtering steps in case it was truncating files. Nothing. I've used EA-Utils fastq-stats to get a summary of the R2 files, and there are reads there. I've even tried pairing the filtered and the unfiltered reads, and the same error comes up.

I may be missing something obvious, but I'm genuinely at a loss. Any help would be greatly appreciated. If anything is unclear, please let me know.

EDIT: The line executed was:

$TOOL/pear -f $DATA_DIR/$forward_read -r $reverse_read -o $OUT_DIR/$output_name

after I removed all the filtering options when it didn't work the first time. It equates to:

pear -f /data//01_Fastq_Filtering/BVG080--BVC393-2_S97_L001_R1_001_filtered.fastq \
     -r /data//01_Fastq_Filtering/BVG080--BVC393-2_S97_L001_R2_001_filtered.fastq \
    > /data/02_Paired/BVG080-2_paired.fastq`
pair-ending fastq • 466 views
ADD COMMENTlink modified 15 months ago by h.mon31k • written 15 months ago by miles.thorburn100

Please show the full command-lines you've used.

ADD REPLYlink modified 15 months ago • written 15 months ago by h.mon31k

Ah, apologies, I've updated the post now.

ADD REPLYlink written 15 months ago by miles.thorburn100

Sounds like you lost information about the R1/R2 reads headers in step 2/3. If you are sure the information is there you can try to "re-pair" the reads using from BBMap suite.

ADD REPLYlink modified 15 months ago • written 15 months ago by genomax89k

Thanks, I'll give that a go. I was looking for a way to 'validate' fastq's to see where I was losing information, but couldn't find anything. This seems like a potential workaround.

ADD REPLYlink written 15 months ago by miles.thorburn100

Can you show us example fastq headers for R1 and R2?

If you need to overlap R1/R2 then order of recommendations for BBTools is bbmerge first and then do any other operations you need to do on them.

Did you design your experiment so reads will overlap (chose an insert that was smaller than normal)? Reads will merge only when the length of sequencing is more than 1/2 length of the insert. If this was a WGS experiment then that would be a curious thing.

ADD REPLYlink modified 15 months ago • written 15 months ago by genomax89k
gravatar for gb
15 months ago by
gb1.9k wrote:

I think your approach is wrong. First, the answer to your question is that when you want to merge or assemble paired end reads the R1 and R2 need to be equal. A first quick check would be to check the amount of reads, R1 and R2 need to have the same amount of reads. But if I understand you right and you compared R1 of both samples and removed reads I am sure that you will remove other reads in the R2 files. I would guess that you even remove more from the R2 because most of the time the sequence quality is lower. Also keep in mind that "merging" improves quality because you take the bases with the best quality scores.

Also this:

search to find reads with 100% sequence similarity

Even if you want to it like this, you should first do a quality trim and make sure that all the reads are the same length. You can check the length distribution in fastqc. Uclust performs sort of a global alignment so if two reads are "the same" but one read is just one base longer it is not a 100% identity. Maybe you knew this but I still want to mention.

What you want is not impossible but I think it is unnecessary (I can be wrong) and it requires some scripting I think. All the reads that you remove from the R1 files you also need to remove from the R2 files and the other way around. Another small tip, because you are using the usearch tools anyways use fastx_uniques with the -uc output file if you are looking for a 100% match.

I would first do a quality check and maybe some quality trimming with a tool that is made for paired-end data. After that using PEAR and as last comparing the samples. I only have used FLASH for merging and it gives a percentage on how much of the reads are being merged, if this is really low you can also only use the R1 reads. In my experience if you really want "long" high quality reads an overlap of 50bp can be tight (depending on the quality).

EDIT: I assume you used primers, so you did amplicon sequencing but I don't know for sure. Think it helps others to help you if add that to your question. If yes, the used primers will not be found by fastqc and you need to trim them. I do that after merging but can also before depending on the tool. I like cutadapt but there are many others.

ADD COMMENTlink modified 15 months ago • written 15 months ago by gb1.9k

Did you see the original thread this question is referring to: Keeping only matching reads in fastq files ?

ADD REPLYlink modified 15 months ago • written 15 months ago by genomax89k

Yes I did, and now again. Clearly I missed something but don't know what sorry

ADD REPLYlink written 15 months ago by gb1.9k

Firstly, the reasoning for the sampling design is explained in the previous post that's been linked twice in this thread. I don't compare reads between R1 and R2s, I compare reads between RXs of the same individual - we used two separate samples per individual to verify alleles are real as we are working on a highly polymorphic region.

Secondly, thanks for the idea about trimming, I hadn't thought about that. The FastQC reports indicated there was very little length variation in the files I looked at (like 10 of 480), but it merits a closer investigation.

As for the fastx_uniques, that appears to be a tool for removing duplicated reads in a single fastx. That is the opposite of what I want! We aim to use coverage as an indication of copy number variation.

Finally, we did use primers, but the samples came back from the sequencing center demultiplexed and with the primers removed. I even checked in a handful of the sequences manually, as this is a step I am used to doing myself.

ADD REPLYlink written 15 months ago by miles.thorburn100

I know you are not comparing R1 and R2. You are comparing two R1 files with each other of the two extractions at least that is what I understand. The primers, mostly called index adapters to demultiplex are different then the primers used for amplification. But now I realise that in your project those primers does not influence the end result I think. The tool fastx_uniques does not just only removes duplicates. If it finds for example 100 duplicates and it places a size=100 tag in the fasta header. It is similar as clustering at 100% identity but faster. If I read the comment of genomax in the other post I think it works similar as

Your project sounds interesting and people probably love to discuss the project with you but before that, is the question that why it does not merge now answered?

ADD REPLYlink modified 15 months ago • written 15 months ago by gb1.9k

Yeah, I believe I have my answer. Unfortunately, it means it's back to the square one, and we'll have to find another methods to filter reads. I think we will now have two levels of analysis; a conserved data set, and a broad data set. The latter will be only paired reads that are ~97% (threshold not yet set) and the former will be a 100% similarity. We just lose like 80% of reads when using the 100% similarity, and that's why we tried a different approach.

ADD REPLYlink written 15 months ago by miles.thorburn100

Secondly, thanks for the idea about trimming, I hadn't thought about that.

If you do that then your reads will no longer be of the same length and programs like uclust will have trouble identifying sequence duplicate (at least clumpify will).

ADD REPLYlink written 15 months ago by genomax89k
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1421 users visited in the last hour