Question: Should I remove Kmers identified in FastQC?
7
gravatar for kezcleal
4.2 years ago by
kezcleal130
United Kingdom
kezcleal130 wrote:

Hi, apologies for this basic question as I am new to the field. I have been checking my NGS data using FastQC and the checks fail on the Kmer content section. There appears to be a sequence TAGATCGGAA at position 90-100 bp in the reads which is enriched around 12 fold (Obs/Exp Max). However nothing shows up in the 'Overrepresented sequences' or 'Adapter content' sections of the report (complete flat line). The sequencing was done as BGI and I do not know what primers were used. My question is should I be trying to remove this Kmer sequence?  

If I use grep on the first million reads in my fastq file to look for the sequence I only find 36 which appears to be quite a low number?: 

$ gunzip -c data1.fq.gz | head -4000000 | grep TAGATCGGAA | wc -l

Thanks for the help

fastqc next-gen genome • 5.3k views
ADD COMMENTlink modified 3.0 years ago by DataFanatic140 • written 4.2 years ago by kezcleal130
2

This plot in general and that fold change and pvalue in particular are not informative and, in the vast majority of cases cause only confusion.

This plot should be removed from the output of this tool.

ADD REPLYlink written 3.0 years ago by Istvan Albert ♦♦ 81k

Agreed, I really wish that FastQC had big warnings above their plots like "this is only meaningful for whole-genome sequencing".

ADD REPLYlink written 3.0 years ago by Devon Ryan91k

Hi @kezcleal I see a very similar kmer plot with exact same kmers showing up as over-represented. I am wondering if you were able to find the cause of these. My data is Agilent exomes so I am wondering if there could be adapters/baits/coltrol sequences in the target enrichment kit that are popping up. Thanks !

ADD REPLYlink written 2.8 years ago by rachanaj0
6
gravatar for Devon Ryan
4.2 years ago by
Devon Ryan91k
Freiburg, Germany
Devon Ryan91k wrote:

No, there's no point in removing that. It's likely that that's just a bit of adapter contamination (I think the adapter contamination section of FastQC is looking for closer to full-length adapters, rather than just 5-8 bases). Go ahead and trim adapters regardless.

ADD COMMENTlink written 4.2 years ago by Devon Ryan91k

Thanks for the response. I also have the problem of not knowing what adapters to try to trim, as this sequence doesn't seem to correspond to the well known adapters e.g. Illumina universal, Nextera, Truseq etc. Should I run something like trimmomatic and give it all of these potential adapter sequences anyway? 

ADD REPLYlink written 4.2 years ago by kezcleal130
4

If you have paired reads, you can find out what the adapter sequences are with BBMerge:

bbmerge.sh in1=r1.fq in2=r2.fq outa=adapters.fa reads=1m

You can also adapter-trim using BBDuk without knowing the adapter sequence using the "tbo" (trimbyoverlap) flag, but it's best to use both the "tbo" flag AND the adapter sequence.

ADD REPLYlink written 4.2 years ago by Brian Bushnell16k

Sure, or trim_galore, since its default sequence will probably work.

ADD REPLYlink written 4.2 years ago by Devon Ryan91k

Percentage of those sequences with partial adapters are low. In my case ca. 6%. I think that its best to identify and remove those reads -after all falsely duplicated sequences could somehow mess eg. CHiP-seq or DNAse-seq results, am I right- but trimmomatic doesn't allow to do that.

ADD REPLYlink written 2.9 years ago by boczniak767640

Don't bother deduplicating before alignment. It's faster and easier post-alignment.

ADD REPLYlink written 2.9 years ago by Devon Ryan91k

I mean, k-mers could potentially influence mapping to genome, making reads with partial-adapter sequences generating 'adapter' peaks. Or it's too short to influence mapping?

ADD REPLYlink written 2.9 years ago by boczniak767640

If the k-mers aren't part of an adapter (or something else artificially added) then they shouldn't be removed. They won't bias mapping because they should be there.

ADD REPLYlink written 2.9 years ago by Devon Ryan91k
2
gravatar for cyril-cros
4.2 years ago by
cyril-cros890
France
cyril-cros890 wrote:

FastQC often displays failed checks - it does not mean your data is bad, they are just warnings.
The only thing you should always really check is the quality along the length of your reads, which might force you to do some trimming. In this case, I would say that your reads are ok as they are, and you can proceed to the alignment phase of your workflow.

ADD COMMENTlink written 4.2 years ago by cyril-cros890

The per base quality seems to be good I think, (yellow boxes all in the green region). Thanks

 

ADD REPLYlink written 4.2 years ago by kezcleal130
0
gravatar for  DataFanatic
3.1 years ago by
DataFanatic140
DataFanatic140 wrote:

enter image description here enter image description here enter image description here enter image description here

I have ChIPseq data with a similar problem, the quality scores are good for this data, so I aligned the reads using bowtie and used macs2 for peak calling and only when I use this particular sample( input )I do not get peaks for my IP samples. Using a different input will yield peaks for all IP-samples. I will appreciate any feedback.

ADD COMMENTlink modified 3.1 years ago • written 3.1 years ago by DataFanatic140
1

Try doing a PCA or clustering of the various input and ChIP samples (e.g., with deepTools), perhaps you have a sample swap.

ADD REPLYlink written 3.1 years ago by Devon Ryan91k

ChIPseq sample PCA using deepTools

B=control; inputB is the input for this sample
P=treatment ; inputP is the input for this sample
IP for B1 is unusual, could you please provide your feedback.

ADD REPLYlink modified 3.0 years ago • written 3.0 years ago by DataFanatic140
0
gravatar for  DataFanatic
3.0 years ago by
DataFanatic140
DataFanatic140 wrote:

enter image description here

ADD COMMENTlink modified 3.0 years ago • written 3.0 years ago by DataFanatic140

I suspect that the outlier sample has signal mostly at regions that should be blacklisted. Either way, you have one weird sample that might need to be excluded, that happens.

ADD REPLYlink written 3.0 years ago by Devon Ryan91k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1572 users visited in the last hour