Question: how to remove adapter sequence in RNAseq fastq file from a paper.
2
gravatar for cfarmeri
3.9 years ago by
cfarmeri150
Japan
cfarmeri150 wrote:

Hi, Im ungraduate student. Please help me.

 

I want mapping-bam-file from sra-dataset from NCBI in order to  analyze heterogeneiety of mouse ESC.

 The dataset is generated from a paper below.(GSE60749)

Roshan M.Kumar et al. Deconstructing transcriptional heterogeneity in pluripotent stem cells. Nature(2014)

Please teach me how to get adapter sequence used in this experimentation

and what I should use for quality control(Prinseq?ShortRead?).

For example, I want to try GEO Sample GSM1486817 sra file.

(http://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSM1486817)

 

Can anyone give me process of quality control of this sra?

I thank you for reading it through.

Any help will be appreciated.

 

 

 

rna-seq • 4.0k views
ADD COMMENTlink modified 3.9 years ago by iraun3.5k • written 3.9 years ago by cfarmeri150
3
gravatar for iraun
3.9 years ago by
iraun3.5k
Norway
iraun3.5k wrote:

1. Download data

2. Convert from .sra format to .fastq with SRA Toolkit: http://www.ncbi.nlm.nih.gov/Traces/sra/sra.cgi?view=software.

3. Check the quality using fastQC package: http://www.bioinformatics.babraham.ac.uk/projects/fastqc/.
 In the generated report, you will see if your sequences have adapters and if so, their names.

4. Remove the adapters and bad quality sequences using:

       ... and many other tools. I would suggest you Trimmomatic.
 

These are the general steps that you should follow in order to perform a quality control of the raw data.
Hope it helps.

ADD COMMENTlink modified 3.9 years ago • written 3.9 years ago by iraun3.5k

You can also reproduce all these steps in Genestack platform:

1. Import data. Genestack will automatically recognise file format during files import.
2. Run Raw Reads QC report app, which is based on FastQC and PRINSEQ tools.
3. Run Trim adaptors and contaminants app. It's based on Fastq-mcf and use the list of universal adaptors (about 300 sequences).

There are also other preprocess apps that you can use to improve the quality of your data.

ADD REPLYlink modified 3.9 years ago • written 3.9 years ago by Evgeniia Golovina980
1

Actually using using FastQC is not recommended for finding adapters for the simple fact that it is not able to find adapters which it does not known a priori (i.e. they are no in its database). Therefore if that GSE60749 if uses some adapters which are not known by FastQC, then FastQC will not find them. FastQC is not able to find unknown adapters.

 

ADD REPLYlink written 3.9 years ago by enxxx23210

Obviously if those adapters are not known, fastQC will not know. But in general, for the majority of the experiments a well known adapters are used. FastQC will retrieve as over-represented sequence all the sequences that are repeated more than X times. These sequences could be adapters, contaminants, polyA... If they are adapters and there are present in fastQC db, you'll have a "tag" indicating it. If it isn't in the db, you'll have also the sequence but without "tag", I mean, you won't know if it is an adapter or another type of repeated seq. Just try it and see what it happen.

ADD REPLYlink written 3.9 years ago by iraun3.5k

Actually, our experience is that FastQC will fail to find adapters in most of the cases.

Our experience is that in 99% of the cases researchers do not validate the results of FastQC and they trust blindly the info which FastQC gives regarding the adapters.

ADD REPLYlink written 3.9 years ago by enxxx23210

Incidentally, BBMerge is able to find unknown adapters, if the reads are paired:

bbmerge.sh in=reads.fq outadapter=adapters.fa reads=1m

ADD REPLYlink written 3.9 years ago by Brian Bushnell16k

Indeed BBMerge is able to find unknown adapters. I used it myself all the time. Another one is http://goo.gl/HpkXv0

There are actually even more than these tools for finding unknown adapters.

ADD REPLYlink modified 3.9 years ago • written 3.9 years ago by enxxx23210
1

Thank you so much everyone!!

To Evgeniia Golovina

I dont know Genestack platforms. This may makes my analysis so smooth.
I can omit the process to install some apps for RNAseq analysis.

To enxxx23

Your advice is very helpful for me. I would like to try another way.

To airan

I got some over-represented sequences from FastQC report.
There same over-represented sequences between some samples.

I guess they would be adpters. But I dont find the "tag" you said...

To Brian Bushnell

Thanks!! I want to try BBMerge right now.

ADD REPLYlink modified 3.9 years ago • written 3.9 years ago by cfarmeri150
2
gravatar for h.mon
3.9 years ago by
h.mon24k
Brazil
h.mon24k wrote:

You can find the adapters used by reading a bit and googling around. "Nextera XT DNA Sample preparation reagents (Illumina)" were used to prepare the samples (as found here under "study summary", or here under "library").

You can check for adapter contamination with FastQC.

edit: if the downstream analysis to be performed is mapping with BWA, Bowtie2 or any mapper which performs local alignment, adapter should not have a major impact. Besides, the reads he pointed to in his question are 25+25, which should be shorter than Nextera typical insert size, and adapter contamination should be really low.

ADD COMMENTlink modified 3.9 years ago • written 3.9 years ago by h.mon24k

Thanks!!

I can get a hint from your help!.

I cannot find the adapter sequenes used by the advices, but I contenuously would like to search!!

 

(I understand "Nextra XT DNA Sample preparation reagents(Illumina)" is used in this experimentation and

search Illumina website to get the adapter sequences. However I still cannot get it...)

ADD REPLYlink written 3.9 years ago by cfarmeri150
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1900 users visited in the last hour