Question: Is it possible to estimate the proportion of contamination from GC contents ?
0
gravatar for doinelpierrot
4 weeks ago by
doinelpierrot0 wrote:

Hello all,

I have multiple fastq files coming from different samples. Among them 2 show a significant diffrent GC content plot. I am wondering, is it possible from there to estimate the percentage of contaminated reads ?

Thanks

rna-seq • 140 views
ADD COMMENTlink modified 4 weeks ago • written 4 weeks ago by doinelpierrot0
1

it possible from there to estimate the percentage of contaminated reads ?

I don't think so. You may get a hint that there is contamination e.g. with rRNA or a different species or something like that but you can't determine % of contaminant reads that may be present unless you go looking for those contaminant reads.

ADD REPLYlink modified 4 weeks ago • written 4 weeks ago by genomax92k

What do you mean by "a significant GC content plot"?

ADD REPLYlink written 4 weeks ago by jared.andrews077.9k

a significant different GC plot, my bad !!

ADD REPLYlink written 4 weeks ago by doinelpierrot0

As genomax mentioned, no, you are not going to be able to determine this from GC content. If you have a large proportion of reads that don't map to the genome of your target organism, there are a few methods you could try.

ADD REPLYlink written 4 weeks ago by jared.andrews077.9k

I am doing de novo assembly. So far I am thinking of doing a pre-assembly with my samples with good gc content and then blast all my transcript to delete stranger transcripts. And then mapping all my reads to this transcriptom. And eventually do a final assembly with all mapping reads.

ADD REPLYlink modified 4 weeks ago • written 4 weeks ago by doinelpierrot0

If you know/suspect that there is contamination, it may be best to address it up front before doing the assembly.

ADD REPLYlink written 4 weeks ago by genomax92k

I have thougt about it but I can't blast 200 Gb of reads, I reduce considerably the data after assembly. Besides it seems to be a multi species contamination and I don't have the full genomes/transcriptomes of these associated species. So the other alternativethat was to identify the contaminants from a subset and then do a mapping on the full genome/transcriptom seem complicated.

ADD REPLYlink modified 4 weeks ago • written 4 weeks ago by doinelpierrot0
1

Then you may want to treat your data as if it was a metagenomic dataset and use an assembler like metaSPAdes.

ADD REPLYlink modified 4 weeks ago • written 4 weeks ago by genomax92k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 2011 users visited in the last hour