Question

important alignment parameters

0

Entering edit mode

4.5 years ago

Sara ▴ 240

I am making a pipeline for the RNAseq data but we generate the data in-house therefore I am trying to optimize the pipeline for this data. my question is that for the alignment part what parameters should be taken into account (like read length etc ...) to make the best alignment command? I actually searched for it but did not find a good source to answer this question. BTW, I am using STAR.

alignment • 596 views

ADD COMMENT • link updated 4.5 years ago by dsull ★ 5.8k • written 4.5 years ago by Sara ▴ 240

score 0 · Answer 1 · 2019-11-11

If you're unsure, the default options work well (and without any detailed description of your experiment or what you want to gain from it, we really can't recommend much anyway).

Anyhow, here's what I usually use.

First, generate the genome index:

STAR --runThreadN 8 --runMode genomeGenerate --sjdbOverhang 100 --genomeDir /path/to/genomedir --genomeFastaFiles /path/to/genome_fasta_file.fa --sjdbGTFfile /path/to/annotation_file.gtf

Change 8 to the number of threads you wanna use (how many cores are available on your server). You can change 100 to be the length of your reads minus 1. For the remaining options, enter the path to the directory where you want to store the genome index, enter the path to your reference genome fasta file, and enter the path to your genome annotation file.

For the actual alignment (assuming paired end reads):

STAR --runThreadN 8 --genomeDir /path/to/genomedir --readFilesIn /path/to/sequencing_reads_1.fastq.gz /path/to/sequencing_reads_2.fastq.gz --outFileNamePrefix /path/to/yourPrefix --readFilesCommand zcat --quantMode GeneCounts

Again, change 8 to the number of threads. Specify your genome index path (defined in the previous step). Specify your fastq sequencing files (which I'll assume are gzip'd -- if not, remove the option: --readFilesCommand zcat), and that should get you going. You can set --outFileNamePrefix to whatever you want your output files to be prefixed with.

You'll get a counts file (that ends in ReadsPerGene.out.tab) that contains the counts for each gene, which you can use for downstream analysis.

Some other options to consider: You can do two-pass alignment which can map more reads to discovered novel junctions (if you're interested in splicing). You can use featureCounts to quantify your reads (which uses more advanced methods than STAR's default mode of producing counts, especially for paired-end reads).

But, what I've described above should be sufficient for most purposes. Honestly, to really optimize RNA-seq analysis, you're better off just doing good QC & doing good downstream analysis than fiddling with STAR options.