I'm working with some metagenomic WGS data of induced sputum. Here's the sequencing information of the samples:
- Type of sequencer: Illumina platform
- Type of reads: Paired-end
- # of reads: ~200M per sample
- Read length: 151
- Library kit: TruSeq Nano DNA kit
I used Trimmomatic to remove adaptor sequences and low-quality ends and then used KneadData to remove the human contaminant. Since the airway is almost sterile, ~99% of reads were discarded during the process. Remaining decontaminated reads were then assembled into contigs(scaffolds) using metaSPAdes, and microbial genes were called from them using Prodigal.
The problem is:
While inspecting assembled scaffolds, I noticed that there are contigs with a bunch of weird repeating sequences. Like:
>NODE_83_length_3627_cov_6.176092 TTCCATTCCTTTATTGCGACTGGATCTCCCTCTGTAACCCAGACTGGAGTGCAGTGGCAC AATCTTAGCTCACATTTCATTTCATCGTTCCATTCCATTCCTCTCCATTCCTTTCCATTC CATTCCACTGGACTCCACTCCAATACACTCCACTCCACTGCATGCCACTCCATTACATTG AATTCCACTCGACACAACTGCATTCCCCTCCACTCCATTCCATTCCACTCCACGCCACTC CACTCCACTGAATTCCATTTCTCTCCACTACATTCCATTCCACTCCATTCCACTCCATTA CACTCCACTCCATTACACTCCACTCCATTACACTCTACTTCACTCCATTCCACTCCATTC CTTTCCATTCCAATCTGTTCCATTCCATTCTATTCCAGTTCATTCCACTACATTCCACTC CATTCCTTTCCATTCCACTCTGTTCCATTCCACTCTATTCCAATCCATTCCGCTACATTC CACTCCAGTCAACTCCACTCCATTCCACTCCACTCCATTCCACTCCATTCCTCTCCACTT
>NODE_85_length_3612_cov_9.758223 AGTATGGTTAAATAGAATGGAGTGGAATAGACTGGAATGGAGTGGAGGTGAATGGAATGG AGTGGGCTGGAATGGACTAGAAAGAAATGGAATGAAATGGAGTGGATTTGAATAGATTGG AATGGAATTGAATGGAATGGAATGAAATGGAATGGAGTAGAGTTGAATGGAGTGGAGTGT ACTGGAGTTGAGTGGAGTGGAATGGAGTGGAGTGGAATGGGGTGGAGTGGACTGGAGTTG ACCATAGTGGAGTGGAGTTTAATGGAATGGAGTAGAGTGGAATGGAGTGGAGAGGAGGGT TATGGAGTGGAGAGGAGTGGAGTTGAATGGAATGGAGTGGAATGGAGTGCCGTTGAGTGG AGTGGATTGGAGTGGATTGGAGTGCAGTGGAATGGAGTGGAGAGGAGTATAAAGGAGTGG AGTGGAGTGGAATAGAGTGGAAAGCAACGGAGTGGAATGGAATGGAATGGAGTGGAATCA AATGGAATGGAATGGAAGGGAATGGAATGGAATTTAGTGGAGTGTAGTGGAATGGAATGG AATTTAGTGGAGTGTAGTGGAATGGAATGGAAAAGAGTGGAATGGAGTGGAGAGTAGTGG
To see the proportion of these strange contigs, I calculated the fold change of the observed frequency of every possible tetranucleotide to the expected value of it (1/256 in this case) for all contig.
FC of a tetranucleotide frequency to the expected value = (# of the tetranucleotide / # of the total tetranucleotide in the sequence) * 256
The following is the histogram of top FC of each contig. It shows clear bimodality: (Peak on the left is located about 3.5, and the other, about 28~29.)
And this is the tetranucleotide FC distribution of contigs made from gut microbiome WGS data.
I did the same on the decontaminated reads, but trinucleotide instead of tetranucleotide (since the reads are only 151bp long). It also showed clear bimodality:
And reads before running KneadData seems normal, compared to the gut metagenomic reads: (Top: my trimmed reads, bottom: gut metagenomic reads)
During the manual inspection of FASTQ files, I also found strange reads that seemed to be assembled to abnormal contigs:
@ST-E00127:973:H2CWWCCX2:1:1101:16498:5634:N:0:ATTGTGAA+GCAATGCA#0/1 GTATGTAGTGTGTGAGGTGTGTGTGGTCGGTGGGTATGTAGTGTGTGTGAGGTGTGTGTGGTGTGTGTGTGTGTGGGGTGTAGGGTCTGTGGGTTTATGGTGTGTGGGGTGTGTGGGGTCTGTGGGCTTTTGGGGTGGGGGTTGTGTGTGG + -AAFFJJJAJJJJFFFJAJJJJJJJJJJ-FJJJJFAJF<AJJFFJJJAJ-----<FJJ-<7---77FJ<FF---7--77<-7-<7-7JJ-7F7-----7AFFJ-A--AJAA7AFJ7FAFFJJF-F<--7-A-F--7-<----77A<FAA-F
The question is:
Where did these strange readings come from? Is this an error in the Illumina platform itself? The quality score itself does not seem to be bad that much. Any helps would be appreciated.