I was wondering if anyone has written a wrapper to process SRA files from SRA to fastq, then do some QC on the fastq and finally map them to the hg19 genome with aligner of preference. If not, I would be more interested to hear on which steps are definitely necessary.
rna-seq FULL RNA, paired-end, 76 bp reads. more info: http://sra.dnanexus.com/runs/SRR601549
(1) Convert SRA to fastq using fastq-dump. Since I have never used it I am wondering what's the best way to specify the output for later processing. Since it's paired end data seems one would need the .1.fq and the .2.fq files but I am not sure that's the default.
fastq-dump --split-3 myfile.sra
What is the difference between split-3 and --split-files? each read in a separate files, this makes little to no sense, why would anyone want each read in a separate file? Any other option that is recommended for subsequent analysis?
(2) Do quality control for fastq files. Here I am confused as what to do since it's public data and I do not have the adapter's information. I have used trim_galore before so I'd like to use it again but I am not sure which settings would be the best for my data. Specially since I do not have the adapter's information, would the first 13 bases that trim galore uses by default be ok?
trim_galore -o <out_dir> -a <adapter_sequence> --clip_R1 5 --clip_R2 5 --phred33 --stringency 5 -q 20 -e 0.05 --length 38 myfile.1.fq myfile.2.fq
Here I am not sure I wrote the right command, depends on step 1, if I have two: .1.fq and .2.fq files, and also I am not sure how to input that information into trim galore.
(3) Map the data with tophat to hg19.
tophat -r 20 test_ref reads_1.fq reads_2.fq
Here I am also not sure which r parameter to use. Maybe I need to try to dig more information about the RNA library?
Any suggestion to this process would be extremely helpful since I am at the research stage on setting up a wrapper to process multiple SRA files in parallel.