Trimming adapters of Hiseq2000 pair end reads
1
0
Entering edit mode
7.9 years ago
Emma ▴ 10

Hi,

I am attempting to trim Illumina RNA-seq data (paired-end) which I downloaded from NCBI in SRA format. I have converted the .sra file into two .fastq(*.sra_1.fastq and *.sra_2.fastq) using fastQ-dump. Since I have no idea what adpters were used for this dataset, I first tried to identify adapters in two ways.

First, I applied FastQC on both files. The overrepresented sequences in *.sra_1.fastq were sequences below:

Sequence Count Percentage Possible Source GATCGGAAGAGCACACGTCTGAACTCCAGTCACAGTCAACAATCTCGTAT 55984 0.26703954769651916 TruSeq Adapter, Index 13 (97% over 40bp) TTTTCATCTTGTCGAGTTCAGTCCTTGGCCTTTAACCGGCTCTATTGGTG 26952 0.1285590506129713 No Hit

Overrepresented sequences in *.sra_2.fastq was: Sequence Count Percentage Possible Source TTTTCATCTTGTCGAGTTCAGTCCTTGGCCTTTAACCGGCTCTATTGGTG 25961 0.12383205376088408 No Hit (same as the second sequence from *.sra_1.fastq)

I blastn "TTTTCATCTTGTCGAGTTCAGTCCTTGGCCTTTAACCGGCTCTATTGGTG" online, and the best hit is mitochondrion sequence fragement of this species. My first question is that, in this case(RNA-seq), should this sequence be treated as contaminant and should I get rid of it?

I then use download univec from NCBI(http://www.ncbi.nlm.nih.gov/tools/vecscreen/univec/) and blastn both files(*.sra_1.fastq and *.sra_2.fastq) against it.

The most hit sequence for *.sra_2.fastq is "gnl|uv|NGB00360.1:1-58 Illumina PCR Primer". About 80% of alignments hit the 3' end of the reads, which meets my expectation. The most hit sequence for *.sra_1.fastq is "gnl|uv|NGB00859.1:1 NEBNext Index 13 Primer for Illumina", same as that of fastQC result. However, what surprise me is that more than 2/3 of total hit start on the position 1 in query, that is the 5' end of the reads. I thought the index primer of illumina should at the 3' end of the sequenced fragment. Do I miss something here? Or there may be something wrong with these two dataset?

Many thanks,

Emma

RNA-Seq adapters • 3.5k views
ADD COMMENT
0
Entering edit mode

Two different questions and it is a bit uninteresting to read it. Can you post the custom adapter sequence part as a separate question and remove it from here ?

ADD REPLY
0
Entering edit mode

Sorry for that. And I have removed the second part as you suggested.

ADD REPLY
0
Entering edit mode
7.9 years ago

My first question is that, in this case(RNA-seq), should this sequence be treated as contaminant and should I get rid of it?

This sequence is not a contaminant because all eukaryotes naturally have mitochondrial RNA/DNA. Depending on your organism, the sequence of the mitochondrial genome may be known. If it is the case, then the mitochondrial RNA is not a problem since it will just map on its own chromosome during the mapping process.

If you really want to filter out mitochondrial reads (I don't recommend it unless you have a good reason to do so), you can always use the mitochondrial chromosome to map the reads on it with bowtie and keep only the unmapped reads.

However, what surprise me is that more than 2/3 of total hit start on the position 1 in query, that is the 5' end of the reads.

You have paired-end data, so it makes sense.

edit : Now I see the problem. It is difficult to imagine how PCR primers are sequenced since amplification occurs after adapter ligation during sample prep. enter image description here

The only explaination I can come with is that there is some kind of unspecific ligation that occurs after the addition of the primers. Now I'm only speculating but with ligation between the primer and the adapters like this :

adapter---primer---adapter

Then after sequencing, the 5' end of the read, i.e, the first base to be sequenced will belong to the primer while the 3' end will extend in the adapter. It would be nice if someone could confirm this hypothesis.

ADD COMMENT
0
Entering edit mode

Thank you very much for the explanation! But I can not understand why PE sequencing will result in such situation.

It may have adapter contaminant on both end. But the contaminants should be different. In this case, it is only one adapter sequence found in the *.sra_1.fastq. I supposed that "gnl|uv|NGB00859.1:1 NEBNext Index 13 Primer for Illumina" should appear on the 3' end (or on 5' end) of at least most reads. But now 2/3 of the hit start on the 5'end, and 1/3 on the 3' end.

Thanks, Emma

ADD REPLY
0
Entering edit mode

Hi, it could be easier to explain if you showed some of the blast results.

What bug me now is your last sentence : "But now 2/3 of the hit start on the 5'end, and 1/3 on the 3' end." A blast alignment can't start at the 3' end of the reads, that would mean an alignment of 0 bases (but it could end at the 3'end) ...

ADD REPLY
0
Entering edit mode

Sorry for the confusion.

What I have is PE reads(hiseq2000) downloaded from SRA. The /path/to/*_1.fa is the first reads of PE reads.

The command I used for blast is:

blastn -query /path/to/*_1.fa -out /path/to/output_1.blast -db /path/to/UniVec -outfmt "6 qseqid qlen sallseqid slen qstart qend evalue length pident" -evalue 1e-10 -num_threads 2 -max_target_seqs 1

The result format is shown below:

head -3 late_veliger_rep1_1.blast2

HWI-ST758:138:C394CACXX:5:1101:5757:2205#0/1 101 gnl|uv|J02482.1:1-5386-49 5435 1 101 3e-49 101 100.00 HWI-ST758:138:C394CACXX:5:1101:9991:2066#0/1 101 gnl|uv|JX069764.1:9741-9848-49 157 65 100 4e-13 36 100.00 HWI-ST758:138:C394CACXX:5:1101:12710:2140#0/1 101 gnl|uv|JX069762.1:9774-9902-49 178 70 101 6e-11 32 100.00

I then tried to find out the most hit sequence: awk -F "\t" '{print $3, $4, $5, $6}' output_1.blast | sort | uniq -c| sort -nr | grep "NGB" |head -10

62660 gnl|uv|NGB00859.1:1-66 66 1 65 ##(gnl|uv|NGB00859.1:1-66 NEBNext Index 13 Primer for Illumina)

14564 gnl|uv|NGB00859.1:1-66 66 1 64

5463 gnl|uv|NGB00859.1:1-66 66 64 101

5312 gnl|uv|NGB00859.1:1-66 66 63 101

4665 gnl|uv|NGB00859.1:1-66 66 62 101

4494 gnl|uv|NGB00860.1:1-66 66 65 101

3929 gnl|uv|NGB00859.1:1-66 66 61 101

3265 gnl|uv|NGB00859.1:1-66 66 60 101

2380 gnl|uv|NGB00859.1:1-66 66 1 63

2273 gnl|uv|NGB00859.1:1-66 66 59 101

What I mean is that some alignment start at the 5' end and some at the 3' end of the query reads. As far as I am concerned, adapter contaminant shold at the 3' end of the read (at least most should at the 3' end). In other words, the start position should not be the first few bases of the reads. However, in this file, most the adapter sequence(gnl|uv|NGB00859.1:1-66) hit the 5' end of the reads.

Thanks, Emma

ADD REPLY
0
Entering edit mode

Thanks for your precisions, I edited my answer accordingly.

ADD REPLY
0
Entering edit mode

Hi all,

Sorry for the late response on this topic. In Emma's blastn results I see that the dataset has a blast hit with JX069762.1 and JX069764.1 (perhaps more which I can't see), which are Cloning vectors (pFosill-2 and -4). In my own dataset (insect RNAseq) I have a lot of hits with these cloning vectors too. Can someone explain why this might be and what it means?

Thanks..

ADD REPLY

Login before adding your answer.

Traffic: 2658 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6