over presented kmer in fastq
2
1
Entering edit mode
5.1 years ago
Sam ▴ 150

Hi Guys

according fasqc result I have some over presented kmers in the position 25 to 27 of my adapter removed small RNAs for example :

GGCACCA 1885(cont)  0.0(pvalue) 2368.6978(obs/exp max)  27(max obs/exp position)


Should I remove these Kmers ?

fastqc RNA-Seq • 2.5k views
0
Entering edit mode

What is your query sequence? DNA or RNA seq?

0
Entering edit mode

Original post already says that this is smallRNA seq sample.

0
Entering edit mode

it's smallRNA-seq data and already converted to DNA format

3
Entering edit mode
5.1 years ago
GenoMax 122k

Don't do anything. Unless downstream analysis indicate problems. Are you sure you have completely removed all adapter sequence (and this k-mer is not a remnant from that). It is difficult to predict why these "errors" show up. Also keep in mind that k-mer module in FastQC only tracks 2% of data (1 in 50 sequences).

0
Entering edit mode

adapter sequence removed by company, fastqc reported more than 10 overpresented Kmer.

0
Entering edit mode

2
Entering edit mode

If your sequence provider has trimmed the RNA adapter then they may have already trimmed other contaminants too. You can always verify that using bbduk.sh from BBMap suite. If you are familiar with trimmomatic then that will work as well. Small RNA alignments need special care (i.e. ungapped alignments etc). How are you planning to do the downstream analysis?

0
Entering edit mode

 java -jar trimmomatic-0.36.jar SE -phred33 my_data.fq my_data_test.fq ILLUMINACLIP:final_adapters.fa:2:30:10 LEADING:3 TRAILING:3 SLIDINGWINDOW:4:15 MINLEN:17
result :
Input Reads: 17352297 Surviving: 17348598 (99.98%) Dropped: 3699 (0.02%)
TrimmomaticSE: Completed successfull

0
Entering edit mode

Looks like the data was mostly clean. You lost a few reads but that should be fine.

Is there a reason you are removing additional 3 bases from front and end of each read? Is that recommended for your small RNA kit?

0
Entering edit mode

just to remove low quality base from front and end , however with and without this criteria I have same result.

1
Entering edit mode

See how that affects alignment. You may need to reconsider that if the alignments are affected.

0
Entering edit mode
5.1 years ago
h.mon 34k

Do you have a good reference genome? Map the data to it and check for mismatches by position - if these later kmers are adapter remnants (or other artifact), you will get a much greater mismatch rate for them. You can use bbmap to map and output a mismatch histogram:

bbmap.sh in=file.fastq out=map.bam mhist=mismatch.txt


I never quantified, but I always had the impression Illumina's Real Time Analysis (RTA) adapter removal is not very sensitive and leaves some left-over adapter bases. If the company removed the adapters, they probably used RTA. If the company adapter removal is not complete and leaves a couple of bases behind, then these remnants will be too short to be picked up by most analysis tools.

0
Entering edit mode

I aligned with bbmap the mismatch out put is : (base 1 to 33 ) also I aligned them with no mismatch ( -v 0 ) via Bowtie but about 12 % of sequences did not match with genome.

 #BaseNum   Match1  Sub1    Del1    Ins1    N1  Other1
1   0.97613 0.02377 0.00000 0.00000 0.00000 0.00010
2   0.99235 0.00757 0.00000 0.00000 0.00000 0.00008
3   0.99418 0.00571 0.00000 0.00006 0.00000 0.00005
4   0.99497 0.00488 0.00003 0.00010 0.00000 0.00004
5   0.99572 0.00415 0.00006 0.00010 0.00000 0.00003
6   0.99549 0.00441 0.00009 0.00009 0.00000 0.00002
7   0.99600 0.00392 0.00009 0.00007 0.00000 0.00001
8   0.99673 0.00322 0.00011 0.00005 0.00000 0.00000
9   0.99735 0.00260 0.00010 0.00004 0.00000 0.00000
10  0.99840 0.00157 0.00006 0.00003 0.00000 0.00000
11  0.99867 0.00130 0.00005 0.00003 0.00000 0.00000
12  0.99950 0.00048 0.00004 0.00002 0.00000 0.00000
13  0.99964 0.00034 0.00003 0.00002 0.00000 0.00000
14  0.99722 0.00274 0.00007 0.00003 0.00000 0.00000
15  0.99686 0.00310 0.00007 0.00004 0.00000 0.00000
16  0.99647 0.00348 0.00007 0.00005 0.00000 0.00000
17  0.99490 0.00503 0.00013 0.00006 0.00000 0.00001
18  0.99437 0.00555 0.00012 0.00006 0.00000 0.00002
19  0.99324 0.00666 0.00009 0.00007 0.00000 0.00003
20  0.99082 0.00908 0.00009 0.00006 0.00000 0.00004
21  0.99007 0.00980 0.00005 0.00008 0.00000 0.00006
22  0.98878 0.01107 0.00005 0.00008 0.00000 0.00008
23  0.97457 0.02529 0.00015 0.00006 0.00000 0.00008
24  0.97647 0.02339 0.00006 0.00006 0.00000 0.00008
25  0.93083 0.06826 0.00055 0.00079 0.00000 0.00012
26  0.94358 0.05533 0.00129 0.00098 0.00000 0.00011
27  0.93846 0.06058 0.00027 0.00084 0.00000 0.00012
28  0.93580 0.06357 0.00012 0.00055 0.00000 0.00009
29  0.93180 0.06781 0.00013 0.00032 0.00000 0.00008
30  0.93132 0.06841 0.00003 0.00017 0.00000 0.00010
31  0.89765 0.10215 0.00001 0.00010 0.00000 0.00010
32  0.87379 0.12611 0.00000 0.00000 0.00000 0.00010
33  0.82169 0.17819 0.00000 0.00000 0.00000 0.00012

0
Entering edit mode

What happened to bases 1-19?

Anyway, this does indicate a very high presence of adapter sequence. I suggest that you start with the raw data and remove the adapters yourself, as it looks like it was done incorrectly.

0
Entering edit mode

This is small RNA seq data so we should be careful here. In my experience every smallRNA kit has some odd requirements and instructions to process the data in a certain way. @sam : If you are only doing the analysis you should talk with whoever made the libraries to inquire about kit etc.

0
Entering edit mode

they told me it's clean reads data and ready for downstream analysis.

0
Entering edit mode

but I think problem is detection just 5-8 bases regarding to adapter remnants , trimmomatic could not detect these remnants, could you introduce other script for accurate detection of remnants ?

1
Entering edit mode

Using BBMap:

bbduk.sh in=reads.fq out=trimmed.fq ref=adapters.fa ktrim=r k=15 mink=3 hdist=1 hdist2=0


As for what you should use for "adapters.fa"... well, the actual adapter sequence would be best, for maximal precision. Alternatively you can use BBMap's included "adapters.fa" file in /bbmap/resources/adapters.fa but that contains other adapters as well and so is less precise and not optimal when you are trying to trim very small portions of adapter sequence (mink=2 means it will trim as little as 3bp of adapter sequence, if they match the reference perfectly). So it's best to find out the actual adapter sequences used.

0
Entering edit mode

I got exact adapter sequence and trimmed file with bbduk.sh but again fastqc shows that overpresented kmers. here is report of trimming:

Input:                      17501242 reads      404848372 bases.
KTrimmed:                   1419206 reads (8.11%)   4785530 bases (1.18%)
Total Removed:              39 reads (0.00%)    4785530 bases (1.18%)
Result:                     17501203 reads (100.00%)    400062842 bases (98.82%

0
Entering edit mode

You are not following my suggestion from yesterday. Move forward with the rest of the analysis and worry about k-mers later .. if you are not able to get good alignments.

0
Entering edit mode

Problem is exactly during alignment with genome because more than 13 % of sequences could not match with genome and probably this huge amount is due to artifacts

2
Entering edit mode

You are never going to get 100% alignment with any NGS dataset. Having 13% unmapped sequences is not a show stopper. What smallRNA pipeline are you using for the analysis? Are you using ungapped alignments (are you mainly looking for miRNA?)

0
Entering edit mode

I used Bowtie , I'm looking for miRNAs and siRNA , please check bbmap mismatch report , this mismatch rate could produce false positive /negative result for expression analysis?

0
Entering edit mode

@Brian has recommended the following settings in past for miRNA (or in general for small reads) mapping with bbmap.sh. This will report only perfect matches which is what you would want.

ambig=all vslow perfectmode maxsites=1000 (@Brian had added this: vslow mainly removes masking of low-complexity repetitive kmers, which is not usually a problem but can be with extremely short sequences like microRNAs)

0
Entering edit mode

regarding to bbduk.sh result , I used one adapter sequence with 13 base length as "ref" and 4785530 bases were removed, is it confirm present of adapter remnants ?

0
Entering edit mode

Let us say they were the remnants of adapters (assuming bbduk.sh did its job right). Now try mapping with bbmap.sh with the command options above.

You may also want to look miRquant pipeline for this purpose.

0
Entering edit mode

What are your views on bbmap report?

Read 1 data:        pct reads      num reads        pct bases      num bases

mapped:              88.3481%    15462013    88.2010%      357080427
unambiguous:         57.4457%    10053709    57.7120%      233646092
ambiguous:           30.9024%     5408304    30.4890%      123434335
low-Q discards:       0.0051%         885     0.0050%          20257

perfect best site:   88.3481%    15462013    88.2010%      357080427
semiperfect site:    88.3481%    15462013    88.2010%      357080427

1
Entering edit mode

0
Entering edit mode

bbmap lunched with these flags

ambig=all vslow perfectmode maxsites=1000

0
Entering edit mode

Also add k=9 to those options and re-run. Let us see what we get. Read length of ~ 21 bp is spot on for miRNA.

0
Entering edit mode

for huge genome K=9 is quite time-consuming and seems should be run in a cluster, is it wise to lunch it with previous flags and then just use map.bam sequence for miRNAs expression analysis ?

0
Entering edit mode

BBMap is multi-threaded. If you have access to a cluster that should speed things up. Are you adding threads=n (use an n that makes sense)?

0
Entering edit mode

unfortunately I don't have access to cluster and lunched it in my pc

0
Entering edit mode

Use as many cores as your PC has then. Hopefully you have enough memory (at least 10G free?). You could also align a million reads to begin with to see if alignments improve. Add reads=1000000 option.

0
Entering edit mode

here is result of mapping with flag reads=1000000 k=9 , seems we have similar out put compare to previous command

Read 1 data:        pct reads   num reads   pct bases      num bases

mapped:              88.3973%      883973    88.2497%       20413459
unambiguous:         57.4974%      574974    57.7622%       13361243
ambiguous:           30.8999%      308999    30.4875%        7052216
low-Q discards:       0.0024%          24     0.0024%            557

perfect best site:   88.3973%      883973    88.2497%       20413459
semiperfect site:    88.3973%      883973    88.2497%       20413459

0
Entering edit mode

If you are convinced that this is the best result you are going to get from this dataset then you can now start the actual process of figuring out/counting miRNA's.

0
Entering edit mode

regarding to mapping out put , no need to to use k=9 , right ?

0
Entering edit mode

Yep, it doesn't look like it improved things noticeably, so don't bother. As you noted, k=9 is very slow with a large genome.

0
Entering edit mode

It's possible that you have the wrong (or incomplete) adapter sequences. I searched for "GGCACCA" and it was not present in my adapters file (I did not look for the reverse-complement), though a substring (GCACC) does in fact hit "RNA_Adapter_(RA3)_part_#_15013207". You could, of course, trim the specific overrepresented sequence:

bbduk.sh in=reads.fq out=trimmed.fq literal=GGCACCA ktrim=r k=7 mink=3 hdist=0 hdist2=0


That will guarantee that you pass FastQC's test, with the disadvantage of killing any reads that contain that heptamer. It won't occur very often by chance, though - 4^7 = 16384. Assuming your small RNAs are ~20bp long, roughly 1/820 will be unnecessarily truncated due to this preprocessing, if their sequence is random.

If it were me, I'd grep reads containing "GGCACCA" to determine the actual adapter sequence manually. There is zero possibility that the BBDuk command I suggested previously would fail to eliminate adapter sequences if the adapter sequence given is correct. However, if it is off by even 1 base (for example, "TGGCACC" versus "GGCACCA") then trimming will not be performed correctly. It needs to be exact. Perhaps you could post the adapter sequences you were told were used?

0
Entering edit mode

but I dont have only one kmer , should I do it for all ?

fastqc report for all kmer :

GGCACCA 1595    0.0 2230.5881   27
CGCCATG 235 0.0 1966.1718   27
TCCCGCC 745 0.0 1748.2594   26
TCGGCTG 1210    0.0 1565.6252   27
CGGCACC 595 0.0 1519.4214   26
GGTCGGC 765 0.0 1481.3131   25
ACCCGTT 415 0.0 1224.7118   27
ACCGCCC 670 0.0 1103.4039   27
GCACCGC 285 0.0 1069.2186   25
AATTGCC 1335    0.0 1067.4435   26
CCGCTCT 415 0.0 1002.0369   27
GCGGCAC 645 0.0 900.599 25
CGGTCGG 945 0.0 898.35223   24
CCTAGCC 345 0.0 855.6654    25
TTCCCGC 1360    0.0 826.2367    25
CCACACC 410 0.0 788.8666    27
CACCCGT 255 0.0 781.17145   26
TATCTGG 4125    0.0 777.978 25
GCACCCG 260 0.0 732.5175    25
TTCCACG 1585    0.0 725.06213   26

0
Entering edit mode

but I dont have only one kmer , should I do it for all ?

No... I didn't realize that was the case. Why did you post "GGCACCA" when the most common kmer is "TATCTGG"? That's very misleading.

Edit - actually, I have no idea what those columns mean; I am just assuming column 2 is the raw count.

Since you just have a bunch of random kmers there, none of which match known adapter sequences, I would go with Genomax's advice and just ignore it.

0
Entering edit mode

I think fasqc sort kmer according Obs/Exp Max , and GGCACCA is the 1st one in fastqc report. title of each column is

GGCACCA 1885(cont)  0.0(pvalue) 2368.6978(obs/exp max)  27(max obs/exp position)