Should I remove Kmers identified in FastQC?
4
8
Entering edit mode
7.0 years ago
kezcleal ▴ 160

Hi, apologies for this basic question as I am new to the field. I have been checking my NGS data using FastQC and the checks fail on the Kmer content section. There appears to be a sequence TAGATCGGAA at position 90-100 bp in the reads which is enriched around 12 fold (Obs/Exp Max). However nothing shows up in the 'Overrepresented sequences' or 'Adapter content' sections of the report (complete flat line). The sequencing was done as BGI and I do not know what primers were used. My question is should I be trying to remove this Kmer sequence?  

If I use grep on the first million reads in my fastq file to look for the sequence I only find 36 which appears to be quite a low number?: 

$ gunzip -c data1.fq.gz | head -4000000 | grep TAGATCGGAA | wc -l

Thanks for the help

next-gen genome FastQC • 6.8k views
ADD COMMENT
2
Entering edit mode

This plot in general and that fold change and pvalue in particular are not informative and, in the vast majority of cases cause only confusion.

This plot should be removed from the output of this tool.

ADD REPLY
0
Entering edit mode

Agreed, I really wish that FastQC had big warnings above their plots like "this is only meaningful for whole-genome sequencing".

ADD REPLY
0
Entering edit mode

Hi @kezcleal I see a very similar kmer plot with exact same kmers showing up as over-represented. I am wondering if you were able to find the cause of these. My data is Agilent exomes so I am wondering if there could be adapters/baits/coltrol sequences in the target enrichment kit that are popping up. Thanks !

ADD REPLY
7
Entering edit mode
7.0 years ago

No, there's no point in removing that. It's likely that that's just a bit of adapter contamination (I think the adapter contamination section of FastQC is looking for closer to full-length adapters, rather than just 5-8 bases). Go ahead and trim adapters regardless.

ADD COMMENT
0
Entering edit mode

Thanks for the response. I also have the problem of not knowing what adapters to try to trim, as this sequence doesn't seem to correspond to the well known adapters e.g. Illumina universal, Nextera, Truseq etc. Should I run something like trimmomatic and give it all of these potential adapter sequences anyway? 

ADD REPLY
4
Entering edit mode

If you have paired reads, you can find out what the adapter sequences are with BBMerge:

bbmerge.sh in1=r1.fq in2=r2.fq outa=adapters.fa reads=1m

You can also adapter-trim using BBDuk without knowing the adapter sequence using the "tbo" (trimbyoverlap) flag, but it's best to use both the "tbo" flag AND the adapter sequence.

ADD REPLY
0
Entering edit mode

Sure, or trim_galore, since its default sequence will probably work.

ADD REPLY
0
Entering edit mode

Percentage of those sequences with partial adapters are low. In my case ca. 6%. I think that its best to identify and remove those reads -after all falsely duplicated sequences could somehow mess eg. CHiP-seq or DNAse-seq results, am I right- but trimmomatic doesn't allow to do that.

ADD REPLY
0
Entering edit mode

Don't bother deduplicating before alignment. It's faster and easier post-alignment.

ADD REPLY
0
Entering edit mode

I mean, k-mers could potentially influence mapping to genome, making reads with partial-adapter sequences generating 'adapter' peaks. Or it's too short to influence mapping?

ADD REPLY
0
Entering edit mode

If the k-mers aren't part of an adapter (or something else artificially added) then they shouldn't be removed. They won't bias mapping because they should be there.

ADD REPLY
2
Entering edit mode
7.0 years ago
cyril-cros ▴ 910

FastQC often displays failed checks - it does not mean your data is bad, they are just warnings.
The only thing you should always really check is the quality along the length of your reads, which might force you to do some trimming. In this case, I would say that your reads are ok as they are, and you can proceed to the alignment phase of your workflow.

ADD COMMENT
0
Entering edit mode

The per base quality seems to be good I think, (yellow boxes all in the green region). Thanks

 

ADD REPLY
0
Entering edit mode
5.8 years ago
ATCG ▴ 370

enter image description here enter image description here enter image description here enter image description here

I have ChIPseq data with a similar problem, the quality scores are good for this data, so I aligned the reads using bowtie and used macs2 for peak calling and only when I use this particular sample( input )I do not get peaks for my IP samples. Using a different input will yield peaks for all IP-samples. I will appreciate any feedback.

ADD COMMENT
1
Entering edit mode

Try doing a PCA or clustering of the various input and ChIP samples (e.g., with deepTools), perhaps you have a sample swap.

ADD REPLY
0
Entering edit mode

ChIPseq sample PCA using deepTools

B=control; inputB is the input for this sample
P=treatment ; inputP is the input for this sample
IP for B1 is unusual, could you please provide your feedback.

ADD REPLY
0
Entering edit mode
5.7 years ago
ATCG ▴ 370

enter image description here

ADD COMMENT
0
Entering edit mode

I suspect that the outlier sample has signal mostly at regions that should be blacklisted. Either way, you have one weird sample that might need to be excluded, that happens.

ADD REPLY

Login before adding your answer.

Traffic: 1602 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6