Question

Contamination in my FASTQC files

0

Entering edit mode

7 weeks ago

carlosgonzalezcruz327 ▴ 20

HI, everyone. I hope someone can help me with this problem.

We have sequenced 5 bacterial genomes, (B1, B2, B3, B4 and B5) and when I ran fastqc into my third genome (B3), I got two peaks from GC content: contamination from another bacteria.

enter image description here

Well I could find that contamination come from my second strain and I have used bowtie2 to clean and get the reads from my third strain using the --un-conc command. I used the assembly from my B2 strain and a reference genome from NCBI to take out as much contamination reads as possible, but when I run again fastqc into my new "clean" fastqc files from B3, I get again two peaks in gc content, however now the second peaks is smaller. enter image description here

According to DFAST and tygs my B3 strain is possibly a Bacillus wiedmannii, I thought I can run bowtiw2 using this reference genome and save only the mapped reads. But I don't know, could I do this o what can I do?

Thanks in advance

Fastqc bowtie2 • 411 views

ADD COMMENT • link 7 weeks ago by carlosgonzalezcruz327 ▴ 20

0

Entering edit mode

Bacterial whole genome sequencing is not my field but I don't see why you should get two discrete peaks just by bacterial contamination. Can this be primer dimers or adapters that got sequenced? How about adapter content in fastqc and overrepresented seqs?

ADD REPLY • link 7 weeks ago by ATpoint 82k

0

Entering edit mode

Hi, thanks for your answer.

Yes, my fastqc output indicates poliG (~0.1% of the library), but that was no a problem, i think. The real problem was the contamination with my B2 strains. I've worked with this problem throughout the day, i took out all reads with gc content > 47% (that was the min gc contect for my B2 fastq/reads) using reformat.sh option from BBtools (reformat.sh in=nohit.1 in2=nohit.2 -out=cg631 -out2=cg632 mingc=0 maxgc=0.47) The result was this: enter image description here

I think that is a good result, but i'm new in boinformatic tools, so i hope that.

ADD REPLY • link 7 weeks ago by carlosgonzalezcruz327 ▴ 20

0

Entering edit mode

Is your goal to compare the genomes? If you can't redo the sequencing, my advice would be to apply the same filter to all your genomes to avoid unnecessary bias and to ensure your filter doesn't excessively alter your results.

ADD REPLY • link 7 weeks ago by yhdist ▴ 70

0

Entering edit mode

Hi, thanks . No, it is not my goal do a comparation among genomes, but i going to take care about filters

ADD REPLY • link 7 weeks ago by carlosgonzalezcruz327 ▴ 20

0

Entering edit mode

i took out all reads with gc content > 47% (that was the min gc contect for my B2 fastq/reads)

Hard to believe there is no single read with less than 47% GC in b2. This all sounds very unconventional to me, especially filtering by GC content, as this might eliminate regions with high GC content, but again, not exactly my field.

ADD REPLY • link 7 weeks ago by ATpoint 82k

0

Entering edit mode

I know but i dont find another solution. Well it's my best idea. So, i going to be less aggressive with the GC content of the reads that i'll take out. After that, do a mapping against to B2 assembly will say me the % of B2 reads remain in B3 fastq files, if the percentage is below 1% i'll do the downstream analysis.

ADD REPLY • link 7 weeks ago by carlosgonzalezcruz327 ▴ 20