Dear Biostars Leaders,
I am a bioinformatician in our lab and I have received raw data(bcl files) on ~300 single cells RNA-Seq samples from a biologist in our lab. I ran bcl2fastq and then run FastQC tool on the fastq files. In FastQC output’s "Overrepresented sequences" category, it has classified half of my sample’s fastq files (~160) with WARNING annotation due to the presence of Poly-T tail sequence(s), and other Clontech sequences. I wonder if I need to trim these sequences before the alignment step ? I appreciate any advice . Other FastQC metrics like Basic Stats, Adapter Content, Per Base seq quality , etc have PASSED for all of my samples.
The Single Cell RNA-Seq was performed using TakaBio/Clontech's SMART-Seq v4 Ultra Low Input RNA Kit chemistry, and the samples were indexed with illumina's Nextera XT adapaters and Index sequences. I made sure to put Adapter Sequence and Sample indexes (i5 & i7) in the Sample Sheet file that was given as input for bcl2fastq. I used default settings of bcl2fastq and I believe it performed adapter trimming and demultiplexing automatically.
"GTATCAACGCAGAGTACTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTT" is the top Poly-T sequence Overrepresented that is present in at least 160 sample fastq files, and FastQC reports its Possible Source as "No Hit". I did a simple google search for this above mentioned Poly-T sequence, and other people seem to have observed the same. Should I ignore this ? or remove/trim this sequence :
http://single-cell.clst.riken.jp/fastqc/GSE68981_QC/SRR2031413_2_fastqc.html
http://waxmanlabvm.bu.edu/waxmanlab/FASTQC/SRR/SRR6576929_1_fastqc.html
Other Poly-T tail sequences are reported by FastQC at lower frequencies. Other Overrepresented Sequences are annotated as “Clontech SMARTer…”, “Clontech Universal Primer Mix…” .
Thanks,
GSR
Here is the complete list of 29 "Overrepresented Sequences" reported on my fastq files. Please advice me on how to proceed :
"Overrepresented_Sequence" \t "Possible_Source" \t "Affected_Samples_Count"
GTATCAACGCAGAGTACTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTT No Hit 160
TATCAACGCAGAGTACTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTT No Hit 19
GGTATCAACGCAGAGTACTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTT No Hit 11
GGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGG No Hit 3
GTACATGGGAAGCAGTGGTATCAACGCAGAGTACATGGGAAGCAGTGGTA Clontech SMARTer II A Oligonucleotide (100% over 25bp) 2
AAGAAGAAGAAGAAGAAGAAGAAGAAGAAGAAGAAGAAGAAGAAGAAGAA No Hit 1
ACGCAGAGTACTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTT No Hit 1
CAACAACAACAACAACAACAACAACAACAACAACAACAACAACAACAACA No Hit 1
ATCAACGCAGAGTACTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTT No Hit 1
GTATCAACGCAGAGTACTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTG No Hit 1
TATCAACGCAGAGTACATGGGAAGCAGTGGTATCAACGCAGAGTACATGG Clontech Universal Primer Mix Long (96% over 26bp) 1
ACGCAGAGTACATGGGAAGCAGTGGTATCAACGCAGAGTACATGGGAAGC Clontech Universal Primer Mix Long (96% over 26bp) 1
TTGTTGTTGTTGTTGTTGTTGTTGTTGTTGTTGTTGTTGTTGTTGTTGTT No Hit 1
GGTATCAACGCAGAGTACATGGGAAGCAGTGGTATCAACGCAGAGTACAT Clontech SMARTer II A Oligonucleotide (100% over 25bp) 1
CTTCTTCTTCTTCTTCTTCTTCTTCTTCTTCTTCTTCTTCTTCTTCTTCT No Hit 1
TATCAACGCAGAGTACTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTA No Hit 1
GAGTACATGGGAAGCAGTGGTATCAACGCAGAGTACATGGGAAGCAGTGG Clontech Universal Primer Mix Long (96% over 26bp) 1
CCCATGTACTCTGCGTTGATACCACTGCTTCCCATGTACTCTGCGTTGAT Clontech Universal Primer Mix Long (96% over 26bp) 1
GTATCAACGCAGAGTACTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTA No Hit 1
GTATCAACGCAGAGTACTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTAA No Hit 1
GTATCAACGCAGAGTACATGGGAAGCAGTGGTATCAACGCAGAGTACATG Clontech SMARTer II A Oligonucleotide (100% over 25bp) 1
TTCTTCTTCTTCTTCTTCTTCTTCTTCTTCTTCTTCTTCTTCTTCTTCTT No Hit 1
GAAGAAGAAGAAGAAGAAGAAGAAGAAGAAGAAGAAGAAGAAGAAGAAGA No Hit 1
GTACTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTT No Hit 1
TCTTCTTCTTCTTCTTCTTCTTCTTCTTCTTCTTCTTCTTCTTCTTCTTC No Hit 1
AGAAGAAGAAGAAGAAGAAGAAGAAGAAGAAGAAGAAGAAGAAGAAGAAG No Hit 1
TATCAACGCAGAGTACTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTAA No Hit 1
GTATCAACGCAGAGTACTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTAAA No Hit 1
GGTATCAACGCAGAGTACTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTA No Hit 1
While many aligners will handle these oddities you may want to scan/re-trim the data (even though bcl2fastq did it).
Did you look at Appendix C in Takara's manual for this kit which has instructions on what you need to do.