Question: a confusion about pandaseq assembly
0
gravatar for xioli2013
2.1 years ago by
xioli20130
xioli20130 wrote:

Hi,

I have a Miseq data with about 300 bp long, paired end reads, barcode is trnL(c)/UAA(h)

The first step I am attempting is use pandaseq to assemble the pe reads

this is the command line: pandaseq -f lane1-s001-indexN716-B-S502-B-ACTCGCTA-CTCTCTAT-V-1_S1 _L001_R1_001.fastq -r lane1-s001-indexN716-B-S502-B-ACTCGCTA-CTCTCTAT-V-1_S1_L001_R2_001.fastq -o 50 -F -N -A simple_bayesian > test.fastq

I was expecting:

[forward primer][barcode][reverse primer]

however, I noticed that the assemble sequences looked like this:

[forward primer][some sequence][forward primer][some sequence]

and there was no matching of reverse primers

I am not sure how it is generated in this form and I hope you can shed some light in it

forward read: https://drive.google.com/open?id=1WlY0mNUgqemqHAiasxyhmQpbPPZXIQig reverse read: https://drive.google.com/open?id=1-LhK3lS7hr7eB4fvdnVc02y-q1z_MFI0 output: https://drive.google.com/open?id=1keIISY-rPQEXynBiT_1ve2XWAdNKx12c

xp

miseq pandaseq • 1.5k views
ADD COMMENTlink modified 2.1 years ago by Kevin Blighe53k • written 2.1 years ago by xioli20130

Can you post some command output?; Were there any warnings or errors returned in the logs?

Where exactly are you looking when you see: [forward primer][some sequence][forward primer][some sequence] ?

Your results file, i.e., the assembled sequences, is test.fastq

ADD REPLYlink written 2.1 years ago by Kevin Blighe53k

Hi Kevin, I just added the links to the reads. Hope you can take a look at them.

ADD REPLYlink written 2.1 years ago by xioli20130

I played around the tools such as bbmerge and pandaseq with trimming and no trimming with trimmomatic 0.36

the # of raw reads in R1 is 194543 No trimming: $BBMerge in1=V1_R1.fastq in2=V1_R2.fastq out=bbmap_notrim_merged.fq outu=bbmap_notrim_unmerged.fq \ adapters=NexteraPE-PE.fa ihist=ihist_notrim.txt ecct extend2=20 iterations=5 k=62 bbmerge generated a merged fastq with 175003 reads, about 10% loss, the sequences begins with trnL(c) primers whereas no UAA(h) primers could be found.

pandaseq -f V1_R1.fastq -r V1_R2.fastq -A simple_bayesian -l 100 -N -F -t 0.8 -w pandaseq_merged.fq

pandaseq generated a paired fastq with 14238 reads, about 93% loss, and I observed [trnL(c)][some sequence][trnL(c) primer][some sequence] pattern

with trimming:

java -jar $Trim PE V1_R1.fastq V1_R2.fastq V1_R1_paired.fastq V1_R1_unpaired.fastq V1_R2_paired.fastq V1_R2_unpaired.fastq \ ILLUMINACLIP:NexteraPE-PE.fa:2:30:10:8:true LEADING:3 TRAILING:3 SLIDINGWINDOW:4:20 MINLEN:100

$BBMerge in1=V1_R1_paired.fastq in2=V1_R2_paired.fastq out=bbmap_notrim_merged.fq outu=bbmap_notrim_unmerged.fq \ adapters=NexteraPE-PE.fa ihist=ihist_notrim.txt ecct extend2=20 iterations=5 k=62

bbmerge generated merged reads # at 162668, about 16 % loss, the merged sequences begins with trnL(c) primers whereas no UAA(h) primers could be found.

java -jar $Trim PE V1_R1.fastq V1_R2.fastq V1_R1_paired.fastq V1_R1_unpaired.fastq V1_R2_paired.fastq V1_R2_unpaired.fastq \ ILLUMINACLIP:NexteraPE-PE.fa:2:30:10:8:true LEADING:3 TRAILING:3 SLIDINGWINDOW:4:20 MINLEN:100

pandaseq -f V1_R1_paired.fastq -r V1_R2_paired.fastq -A simple_bayesian -l 100 -N -F -t 0.8 -w pandaseq_merged_trim.fq

pandaseq generated 162242 reads, so about 16 % loss, the merged sequences begins with trnL(c) primers whereas no UAA(h) primers could be found

I am thinking the low quality of the reads have more effects on pandaseq's matching algorithm.

xp

ADD REPLYlink written 2.1 years ago by xioli20130

Hi Xiao, yes, I noticed the huge read loss (after PANDAseq) by just looking at the file sizes of your files.

Did you also look at general quality using FastQC?? This also runs in JAVA and can help to identify systematic problems with your reads.

ADD REPLYlink written 2.1 years ago by Kevin Blighe53k

Hi Kevin,

here are the fastqc report, the quality of 3' is not good as it normally is for NGS Foward QC Reverse QC

What do you think would be the practice here to analyze metabarcoding data? Assembly first or trim first?

Thanks for your help

Xiao

ADD REPLYlink written 2.1 years ago by xioli20130
1

Hi Xiao,

The quality of the reads is very poor at the 3' end. You definitely need to trim these reads prior to using PANDAseq. To ensure high quality reads, you could use (with trimmomatic):

  • LEADING:20
  • TRAILING:20
  • SLIDINGWINDOW:4:30
  • MINLEN:50

However, the best parameters will be decided through experimentation.

ADD REPLYlink written 2.1 years ago by Kevin Blighe53k

Hi Kevin, thanks for the reply. One more question: Should the output of pandaseq assembly look like [forward primer][sequence][reverse primer] without trimming the primers?

ADD REPLYlink written 2.1 years ago by xioli20130
1

Hi Ziao, yes, that is what the output should be, as per Figure 1 of the published work:

panda

[source: https://bmcbioinformatics.biomedcentral.com/articles/10.1186/1471-2105-13-31]

Try again with the higher quality reads (after filtering with Trimmomatic), and see what happens. Also, check that the orientation of the reads is correct and that the insert size (expected gap between matching pairs) is not too large.

ADD REPLYlink written 2.1 years ago by Kevin Blighe53k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1721 users visited in the last hour