Question

Rna-Seq Paired End 76 Bp Read Data From Sra To Fastq To Bam

0

Entering edit mode

10.3 years ago

dfernan ▴ 760

Hi,

I was wondering if anyone has written a wrapper to process SRA files from SRA to fastq, then do some QC on the fastq and finally map them to the hg19 genome with aligner of preference. If not, I would be more interested to hear on which steps are definitely necessary.

My Data:

rna-seq FULL RNA, paired-end, 76 bp reads. more info: http://sra.dnanexus.com/runs/SRR601549

My plan:

(1) Convert SRA to fastq using fastq-dump. Since I have never used it I am wondering what's the best way to specify the output for later processing. Since it's paired end data seems one would need the .1.fq and the .2.fq files but I am not sure that's the default.

fastq-dump --split-3 myfile.sra

What is the difference between split-3 and --split-files? each read in a separate files, this makes little to no sense, why would anyone want each read in a separate file? Any other option that is recommended for subsequent analysis?

(2) Do quality control for fastq files. Here I am confused as what to do since it's public data and I do not have the adapter's information. I have used trim_galore before so I'd like to use it again but I am not sure which settings would be the best for my data. Specially since I do not have the adapter's information, would the first 13 bases that trim galore uses by default be ok?

trim_galore -o <out_dir> -a <adapter_sequence> --clip_R1 5 --clip_R2 5 --phred33 --stringency 5 -q 20 -e 0.05 --length 38 myfile.1.fq myfile.2.fq

Here I am not sure I wrote the right command, depends on step 1, if I have two: .1.fq and .2.fq files, and also I am not sure how to input that information into trim galore.

(3) Map the data with tophat to hg19.

tophat -r 20 test_ref reads_1.fq reads_2.fq

Here I am also not sure which r parameter to use. Maybe I need to try to dig more information about the RNA library?

Any suggestion to this process would be extremely helpful since I am at the research stage on setting up a wrapper to process multiple SRA files in parallel.

Thanks!

sra bam fastq • 5.5k views

ADD COMMENT • link updated 10.3 years ago by Ashutosh Pandey 12k • written 10.3 years ago by dfernan ▴ 760

score 1 · Answer 1 · 2013-12-30

1) Use fastq-dump --split-3. It will produce three files. One with _1 reads , second with _2 reads and third with orphan reads. Orphan reads are ones that don't have a pair. It may be because of one of several reasons. Sequencer couldn't sequence one of them or it was of a bad quality. Aligners don't check for the read order in two fastq files that represent paired end reads. Aligner will throw error if the read ids are not the same for a paired end reads (except the _1 and _2 part)


SRA mentions:

 --split-files                   Dump each read into a separate file.Files will received suffix corresponding to read number
 --split-3                       Legacy 3-file splitting for mate-pairs:
                                 First 2 biological reads satisfying dumping conditions
                                 are placed in files *_1.fastq and *_2.fastq
                                 If only 1 biological read is dumpable - it is placed in *.fastq

2) I would assume that adapters have already been removed. Generally, people do some pre-processing of read data before submitting it to SRA. Read the info file that came along with the data in SRA. You can also use FastQC tool to see if a k-mer is over represented and if it occurs in the beginning or end of the reads to check for presence of adapters. I don't think just deleting the first 13 bp will help.

3) I usually go with the aligners default settings for the alignment. As most of the aligners are already optimized to work with human genomic data, there is no need to change the default settings. You will have to mention which kind of RNA-seq library was used to generate the paired-end reads. Was it strand specific or unstranded. Generally, they are unstranded. Also, you need to provide the orientation information of the reads and insert sizes. If you dont know anything, then create a small test data comprising of 20,000 paired-end reads. Align it to get the BAM file. You can know use the BAM file and generate histogram of insert size and also check the orientation. You can use this information and rerun your alignment with the original data.