Question: When paired end reads doesn't join, can i consider them single-end in metagenomic analysis?
0
gravatar for fernandalpcosta
8 months ago by
Campinas/UNICAMP
fernandalpcosta0 wrote:

Hello all,

I'm doing a metagenomic analysis on human samples that were sequenced by Miseq platform, paired end reads, 40 to 300pb long (it varies according to FastQC report). After the trim/quality control fase, i ended up with reads of 60 to 190 bp long, but those paired end reads aren't able to join anymore. :(

I thought about 2 solutions:

  1. Consider all reads (R1 and R2) single end reads, concatenate them all and proceed with the metagenomic analysis with QIIME, or;
  2. Use only forward (R1) or reverse (R2) reads for the downstream analysis with QIIME.

Is there a 3th and better option for it? Does any of the solutions i thought about make sense?

Thank you all very much for any help or tips you can provide me.

Fernanda Luz

qiime sequencing metagenomics • 706 views
ADD COMMENTlink modified 3 months ago by h.mon15k • written 8 months ago by fernandalpcosta0
1

You should always merge the reads first (you can adapter trim but NOT quality trim). After merging the reads you could quality trim if needed. BBMerge from BBMap suite is what I recommend that you try for merging unelss you are following an established pipeline line QIIME or Mothur.

ADD REPLYlink written 8 months ago by genomax49k

Thank you for your reply, i will give it a try with BBMerge on the raw reads. I tried to join the reads before trimming them, but i got less than 10% joined reads (i used fastq-join from QIIME pipeline).

ADD REPLYlink written 8 months ago by fernandalpcosta0

If you expect the reads to join (the way you designed your amplicons) and if they are not then something may be wrong with the libraries. If your inserts are short and you have a lot of adapter contamination then you are going to get short reads after adapters are removed. They should overlap but will not give you long reads you expect.

ADD REPLYlink modified 8 months ago • written 8 months ago by genomax49k

Another problem is that i don't have the information about the library preparation, as the size of the overlap region between R1 and R2. I used fastq-join in the default mode. According to FastQC report, i don't have adapter contamination, just a really bad score at the end of the reads.

ADD REPLYlink written 8 months ago by fernandalpcosta0

Did you have a chance to try bbmerge.sh? Curious to see if it helps improve things.

ADD REPLYlink written 8 months ago by genomax49k

I'm going to try it today. As soon as i get a result, i will let you know. Thank you :)

ADD REPLYlink written 8 months ago by fernandalpcosta0

Start with the original data for first bbduk try. Do not worry about bad quality at ends of reads.

ADD REPLYlink modified 8 months ago • written 8 months ago by genomax49k

I ran bbmerge.sh today in one of my samples and got this report:


BBMerge version 37.58

Writing mergable reads merged.

Started output threads.

Total time: 1.617 seconds.

Pairs:                          54443

Joined:                 3442        6.322%

Ambiguous:              45310       83.225%

No Solution:            5691        10.453%

Too Short:              0           0.000%


Avg Insert:             456.8

Standard Deviation:     21.3

Mode:                   460

Insert range:           63 - 489

90th percentile:        460

75th percentile:        460

50th percentile:        460

25th percentile:        460

10th percentile:        443

Not looking good right?

About the bbduk.sh, i don't have the contaminant files for the 'ref' parameter.

ADD REPLYlink modified 8 months ago • written 8 months ago by fernandalpcosta0

For the adapters (if that is what you meant by contaminants) the sequences are available in a file called adapters.fa in the resources directory in BBMap software bundle.

You have a high percentage of Ambiguous: 45310 83.225% overlaps. That could signify that there may be simple repeats like ACACACACAAC at the ends of reads which make overlapping them difficult.

BTW: You don't appear to have many reads. 54443 is relatively a tiny amount of data. Is that correct?

ADD REPLYlink modified 8 months ago • written 8 months ago by genomax49k

Unfortunately is it correct, my data is really small.

I ran bbduk and it only removed 6 reads from my data. Then i reran bbmerge, as expected, got no change at those 6% of merged reads.

It does look like a little like a lost cause right? Should i consider it a single end reads data and continue the analysis with only R1 reads?

ADD REPLYlink written 8 months ago by fernandalpcosta0
1

At least we can be sure that there is likely no adapter contamination your data.

Can you take a sample of reads (from R1/R2 file) and blast them at NCBI? Confirm that they are retrieving hits to the right sequences. It is possible that something went wrong somewhere in the experimental procedure and you don't have the right kind of data.

ADD REPLYlink modified 8 months ago • written 8 months ago by genomax49k

Hey Fernanda, Is there any particular reason why the data is of poor quality (according to FastQC)? Are you running FastQC with default parameters? It seems a waste to throw away such long reads.

ADD REPLYlink written 8 months ago by Kevin Blighe21k

Thank you for your reply Kevin. I ran FastQC with default parameters, yes.

This is a real example of one of my libraries' FastQC report (just the per base sequence quality):

R1 example

R2 example

ADD REPLYlink modified 8 months ago by genomax49k • written 8 months ago by fernandalpcosta0

Wow, the quality is very poor at the 3` end. Yes, now I can understand the trimming to ~190bp max. I think that the quality is poorer than it should be. You should check to ensure that nothing was at error during the sequencing process.

The other comments (acima/above) também te ajudarão.

ADD REPLYlink written 8 months ago by Kevin Blighe21k
0
gravatar for h.mon
3 months ago by
h.mon15k
Brazil
h.mon15k wrote:

Another problem is that i don't have the information about the library preparation, as the size of the overlap region between R1 and R2.

The primers have been removed already? Which primer pair did you use? What is the size of your amplicon?

Several (all?) workflows demand primers to be removed in order to perform the taxonomical assignment, you should not proceed if you don't know if the primers have been removed or not. You should either get these answers from the person who gave you the data, or you could do some detective work: you could check if standard primers are present at the beginning of your reads, then you would know the expected amplicon size.

Do you know if phiX has been added to your sequencing run, or if a laddered primers has been used? The poor quality of your reads may be due to low complexity of the samples. In any case, I would seriously consider if it is worth to proceed analysis with this data, as it seems really bad.

ADD COMMENTlink written 3 months ago by h.mon15k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1589 users visited in the last hour