Dear All,
As per my understanding from STAR manual, I am about to run a STAR 2.7.0f mapping pipeline with 2-pass mode for multiple samples of patiets of diseases and healthy peoples as follows:
Could you please help me to validate all the commands I am running correctly or do you have any suggestions?
1) Indexing genome with annotations
STAR --runMode genomeGenerate --genomeDir ~/db/hg38/ --genomeFastaFiles ~/db/hg38/hg38.fa --sjdbGTFfile ~/db/hg38/hg38.gtf --runThreadN 30 --sjdbOverhang 89
Note:
- Indexing for maximum read length 90 bp.
2) 1-pass mapping with indexed genome
STAR --genomeDir ~/db/hg38/ --readFilesIn sample1.R1.fastq.gz sample1.R2.fastq.gz --readFilesCommand zcat --outSAMunmapped Within --outFileNamePrefix sample1. --runThreadN 30
Notes:
The same command has been run for multiple samples in the for loop, therefore, it will generate SJ.out.tab file for each sample.
Next, I have copied SJ.out.tab files of all the samples into a single folder "SJ_out"
3) Indexing genome with annotations and SJ.out.tab files
STAR --runMode genomeGenerate --genomeDir ~/db/hg38/SJ_Index/ --genomeFastaFiles ~/db/hg38/SJ_Index/hg38.fa --sjdbGTFfile ~/db/hg38/SJ_Index/hg38.gtf --runThreadN 30 --sjdbOverhang 89 --sjdbFileChrStartEnd SJ_out/*.SJ.out.tab
Note:
- Again indexing for maximum read length 90 bp.
4) 2-pass mapping with new indexed genome with annotations and SJ.out.tab files
STAR --genomeDir ~/db/hg38/SJ_Index/ --readFilesIn sample1.R1.fastq.gz sample1.R2.fastq.gz --readFilesCommand zcat --outSAMunmapped Within --outFileNamePrefix sample1. --runThreadN 30
Notes:
- Again, the same command has been run for multiple samples in the for loop, therefore, it will generate mapping files for each sample.
According to one of the latest post of Alex, he suggested the following criteria for the filtration:
1. Filter out the junctions on chrM, those are most likely to be false.
2. Filter out non-canonical junctions (column5 == 0).
3. Filter out junctions supported by multi mappers only (column7==0)
4. Filter out junctions supported by too few reads (e.g. column7<=2)