Hi all,
I'm working with some metagenomic WGS data of induced sputum. Here's the sequencing information of the samples:
- Type of sequencer: Illumina platform
- Type of reads: Paired-end
- # of reads: ~200M per sample
- Read length: 151
- Library kit: TruSeq Nano DNA kit
I used Trimmomatic to remove adaptor sequences and low-quality ends and then used KneadData to remove the human contaminant. Since the airway is almost sterile, ~99% of reads were discarded during the process. Remaining decontaminated reads were then assembled into contigs(scaffolds) using metaSPAdes, and microbial genes were called from them using Prodigal.
The problem is:
While inspecting assembled scaffolds, I noticed that there are contigs with a bunch of weird repeating sequences. Like:
>NODE_83_length_3627_cov_6.176092
TTCCATTCCTTTATTGCGACTGGATCTCCCTCTGTAACCCAGACTGGAGTGCAGTGGCAC
AATCTTAGCTCACATTTCATTTCATCGTTCCATTCCATTCCTCTCCATTCCTTTCCATTC
CATTCCACTGGACTCCACTCCAATACACTCCACTCCACTGCATGCCACTCCATTACATTG
AATTCCACTCGACACAACTGCATTCCCCTCCACTCCATTCCATTCCACTCCACGCCACTC
CACTCCACTGAATTCCATTTCTCTCCACTACATTCCATTCCACTCCATTCCACTCCATTA
CACTCCACTCCATTACACTCCACTCCATTACACTCTACTTCACTCCATTCCACTCCATTC
CTTTCCATTCCAATCTGTTCCATTCCATTCTATTCCAGTTCATTCCACTACATTCCACTC
CATTCCTTTCCATTCCACTCTGTTCCATTCCACTCTATTCCAATCCATTCCGCTACATTC
CACTCCAGTCAACTCCACTCCATTCCACTCCACTCCATTCCACTCCATTCCTCTCCACTT
or like:
>NODE_85_length_3612_cov_9.758223
AGTATGGTTAAATAGAATGGAGTGGAATAGACTGGAATGGAGTGGAGGTGAATGGAATGG
AGTGGGCTGGAATGGACTAGAAAGAAATGGAATGAAATGGAGTGGATTTGAATAGATTGG
AATGGAATTGAATGGAATGGAATGAAATGGAATGGAGTAGAGTTGAATGGAGTGGAGTGT
ACTGGAGTTGAGTGGAGTGGAATGGAGTGGAGTGGAATGGGGTGGAGTGGACTGGAGTTG
ACCATAGTGGAGTGGAGTTTAATGGAATGGAGTAGAGTGGAATGGAGTGGAGAGGAGGGT
TATGGAGTGGAGAGGAGTGGAGTTGAATGGAATGGAGTGGAATGGAGTGCCGTTGAGTGG
AGTGGATTGGAGTGGATTGGAGTGCAGTGGAATGGAGTGGAGAGGAGTATAAAGGAGTGG
AGTGGAGTGGAATAGAGTGGAAAGCAACGGAGTGGAATGGAATGGAATGGAGTGGAATCA
AATGGAATGGAATGGAAGGGAATGGAATGGAATTTAGTGGAGTGTAGTGGAATGGAATGG
AATTTAGTGGAGTGTAGTGGAATGGAATGGAAAAGAGTGGAATGGAGTGGAGAGTAGTGG
To see the proportion of these strange contigs, I calculated the fold change of the observed frequency of every possible tetranucleotide to the expected value of it (1/256 in this case) for all contig.
FC of a tetranucleotide frequency to the expected value =
(# of the tetranucleotide / # of the total tetranucleotide in the sequence) * 256
The following is the histogram of top FC of each contig. It shows clear bimodality: (Peak on the left is located about 3.5, and the other, about 28~29.)
And this is the tetranucleotide FC distribution of contigs made from gut microbiome WGS data.
I did the same on the decontaminated reads, but trinucleotide instead of tetranucleotide (since the reads are only 151bp long). It also showed clear bimodality:
And reads before running KneadData seems normal, compared to the gut metagenomic reads: (Top: my trimmed reads, bottom: gut metagenomic reads)
During the manual inspection of FASTQ files, I also found strange reads that seemed to be assembled to abnormal contigs:
@ST-E00127:973:H2CWWCCX2:1:1101:16498:5634:N:0:ATTGTGAA+GCAATGCA#0/1
GTATGTAGTGTGTGAGGTGTGTGTGGTCGGTGGGTATGTAGTGTGTGTGAGGTGTGTGTGGTGTGTGTGTGTGTGGGGTGTAGGGTCTGTGGGTTTATGGTGTGTGGGGTGTGTGGGGTCTGTGGGCTTTTGGGGTGGGGGTTGTGTGTGG
+
-AAFFJJJAJJJJFFFJAJJJJJJJJJJ-FJJJJFAJF<AJJFFJJJAJ-----<FJJ-<7---77FJ<FF---7--77<-7-<7-7JJ-7F7-----7AFFJ-A--AJAA7AFJ7FAFFJJF-F<--7-A-F--7-<----77A<FAA-F
The question is:
Where did these strange readings come from? Is this an error in the Illumina platform itself? The quality score itself does not seem to be bad that much. Any helps would be appreciated.
--
Repeats are very common in genomes, and therefore it is not surprising to see them. My guess is that they are in part from human DNA. You could try RepeatMasker or Dfam even though there is nothing coming up for your examples.
I'll try it. Thanks, Mr. Dondrup.