Automatically feeding a good mate inner distance to TopHat2
1
2
Entering edit mode
5.7 years ago
umn_bist ▴ 390

After browsing through Biostars and Seqanswers, I have a general understanding of how mate inner distance is calculated which is useful for TopHat aligner. Because, to my understanding, fragment length and reads can vary, the mate inner distance will also vary.

I could just use a hard filter and set a minimum length of 200 bp but I am hoping to form a smarter solution. Because I am working with 100+ files, ideally I would like to automatically retrieve fragment and read length (excuse my terminology if incorrect) and then calculate and feed the mate inner distance value to TopHat.

I know that Bowtie (and 2?) can provide such parameters but I am curious how one can extract, calculate, input this.

Here is my command line:

perl ~/Desktop/Applications/prinseq-lite-0.20.4/prinseq-lite.pl -fastq {in_1.fastq} -fastq2 {in_2.fastq} -no_qual_header -trim_qual_left 10 -trim_qual_right 10 -trim_right 1 -custom_params "A 75%;T 75%;G 75%;C 75%" min_qual_mean 25 -min_len 40 -out_format 3 -out_good {out} -out_bad {out_singleton}

parallel -j PARALLEL_TASK --xapply -q tophat -p 10 --mate-inner-dist 200 --mate-std-dev=40 --no-coverage-search --output-dir {placeholder} --transcriptome-index /Volumes/My\ Passport/Documents/hg19/transcriptome /Volumes/My\ Passport/Documents/hg19 {1},{2},{3},{4} ::: {out_good_1} ::: {out_good_2} ::: {out_singleton_1} ::: {out_singleton_2} EDIT: Quoting Arun from here: In order to use tophat, you'll have to have used the "-r" parameter. I guess you used -r value as 155 as the mean inner distance. However, its equally important to also provide the "--mate-std-dev".  The way I go about figuring these 2 parameters is this: I use BWA and align the paired end reads and obtain PE.sam file. From that, I use picard tools to obtain only those reads that are aligned to the reference genome. From this samfile, I calculate the inner distance of all pairs. From the vector of all such inner distances, I compute the 1st and 3rd quantiles, Q1 and Q3 and then calculate the inter quartile range IQ = Q3-Q1. From here, I take all the values that fall between Q1 - 2 * IQ to Q3 + 2 * IQ. I compute the mean and standard deviation of all these values (this is similar to what BWA does) and provide it to tophat. It seems to work great. Most of my reads are mapped under the given inner distance and standard deviation and the other (few) reads with larger inner distance. There'll always be a few reads that'll have larger inner distance and I consider them outliers and discard in the computation of mean and SD. Hope this helps. If I just align my PE with BWA, is there a way to automatically feed it into TopHat? RNA-Seq TopHat PE • 2.4k views ADD COMMENT 0 Entering edit mode hi, unfortunately I can't figure out how to use pipeline or like so to automatically detect the inner distance then I started alignment with tophat with default like below tophat -p 8 -G genes.gtf -o pri2h genome FCH3YYFBBXX-HKARAexkEAAARABPEI-204_L6_1-trimmed.fq FCH3YYFBBXX-HKARAexkEAAARABPEI-204_L6_2-trimmed.fq if I am wrong? in process I saw WARNING: read pairing issues detected (check prep_reads.log) ! left reads: min. length=15, max. length=100, 19206254 kept reads (3070 discarded) right reads: min. length=15, max. length=100, 19203774 kept reads (5550 discarded) can I simply use -r 200 in my case too??? library type is HiSeq for Arabidopsis thank you for any comment ADD REPLY 1 Entering edit mode The default is generally OK, you may or may not even get better results by tweaking it. ADD REPLY 7 Entering edit mode 5.7 years ago Here's the github repo for our internal RNAseq pipeline. The general steps are: 1. Down-sample fastq files to a reasonable number (you could also just take the first N reads pairs, but they tend to align poorly). 2. Align to the genome with bowtie2 (bwa would also work) 3. Run RSeQC to infer both strandedness and insert size 4. Parse that and set the tophat (or other aligner) options accordingly. So there's no built-in way to do that with tophat, but it's possible with a wrapper program. Edit: BTW, I didn't write that pipeline, that was done by a colleague. I've used it, though, and appreciate how easy it tends to make my life :) ADD COMMENT 0 Entering edit mode Thank you for your reply. I figured what I was wanting to do is not achievable without coding my own solution. Because I am new to this, I may input a hard value for both inner-distance and mate-std-dev values. For your RNAseq pipeline, just to confirm, can the downsampling, aligning to bowtie2, running rseqc, and setting tophat be done automatically over a set of 100+ fastq files - given input, output, genome are specified. ADD REPLY 1 Entering edit mode It's done per file/pair, so yes. We typically just give the python script a directory containing fastq files and let it figure out that they're paired-end and such. This also allows the insert size to change per sample, which is good because you can get significant variation. One might instead want to do this with snakemake, which would allow for better parallelization. ADD REPLY 0 Entering edit mode This is a huge boon in my sleepless studying. Thank you very much. Questions after scouring the github page. 1. The http://epicenter/wiki/index.php/RNA-seq_pipeline_rna-seq-pc.py) link in README does not work. 2. What constitutes a valid setup table (setup_table.tsv)? This will be a great tool for further downstream analysis. 3. If the (prereq) tools are stored in a different location than where the pipeline is hard configured to, will it manage as long as symbolic links to these tools are in my defaultPATH? Can it manage if my pairs have _1, _2 instead of _R1, _R2?

Upvoted, accepted, bookmarked. Thank you!

1
Entering edit mode

Sorry, this is an internal pipeline so the like is internal too. This is just an example pipeline, you probably want to change our hard-coded paths :)

I'll look up what the setup_table.tsv files look like.

1
Entering edit mode

Ah, right the tsv file is the sample table for DESeq2, so see help(DESeqDataSetFromHTSeqCount) in R.

Edit: I should probably push Fabian to get a biostars account. Then he can directly answer the questions :)

0
Entering edit mode

Sorry for your troubles. I really appreciate your help. I actually have emailed Fabian. Thank you again, Dr. Ryan!