I am attempting to trim Illumina RNA-seq data (paired-end) which I downloaded from NCBI in SRA format. I have converted the .sra file into two .fastq(*.sra_1.fastq and *.sra_2.fastq) using fastQ-dump. Since I have no idea what adpters were used for this dataset, I first tried to identify adapters in two ways.
First, I applied FastQC on both files. The overrepresented sequences in *.sra_1.fastq were sequences below:
Sequence Count Percentage Possible Source GATCGGAAGAGCACACGTCTGAACTCCAGTCACAGTCAACAATCTCGTAT 55984 0.26703954769651916 TruSeq Adapter, Index 13 (97% over 40bp) TTTTCATCTTGTCGAGTTCAGTCCTTGGCCTTTAACCGGCTCTATTGGTG 26952 0.1285590506129713 No Hit
Overrepresented sequences in *.sra_2.fastq was: Sequence Count Percentage Possible Source TTTTCATCTTGTCGAGTTCAGTCCTTGGCCTTTAACCGGCTCTATTGGTG 25961 0.12383205376088408 No Hit (same as the second sequence from *.sra_1.fastq)
I blastn "TTTTCATCTTGTCGAGTTCAGTCCTTGGCCTTTAACCGGCTCTATTGGTG" online, and the best hit is mitochondrion sequence fragement of this species. My first question is that, in this case(RNA-seq), should this sequence be treated as contaminant and should I get rid of it?
I then use download univec from NCBI(http://www.ncbi.nlm.nih.gov/tools/vecscreen/univec/) and blastn both files(*.sra_1.fastq and *.sra_2.fastq) against it.
The most hit sequence for *.sra_2.fastq is "gnl|uv|NGB00360.1:1-58 Illumina PCR Primer". About 80% of alignments hit the 3' end of the reads, which meets my expectation. The most hit sequence for *.sra_1.fastq is "gnl|uv|NGB00859.1:1 NEBNext Index 13 Primer for Illumina", same as that of fastQC result. However, what surprise me is that more than 2/3 of total hit start on the position 1 in query, that is the 5' end of the reads. I thought the index primer of illumina should at the 3' end of the sequenced fragment. Do I miss something here? Or there may be something wrong with these two dataset?