Question

A question about the raw RNA-seq processing workflow

1

Entering edit mode

10 months ago

wyt1995 ▴ 30

"Hello, I am a student who recently started studying bioinformatics. Since my understanding is still limited, I would appreciate it if you could explain even if the difficulty of the question is low. I am currently working with RNA-seq data and I am facing batch effects that are not reduced even with the Combat method using different pipeline and workflow. Therefore, I would like to standardize the analysis using the workflow available on the GDC portal. The code is provided on the website https://docs.gdc.cancer.gov/Data/Bioinformatics_Pipelines/Expression_mRNA_Pipeline/.

I already downloaded reference sequence files (GRCh.38.d1.vd1.fa.tar.gz) and annotation files (gencode.v36.annotation.gtf.gz) on the website (https://gdc.cancer.gov/about-data/gdc-data-processing/gdc-reference-files).

### Step 1: Building the STAR index.
apps/STAR \
--runMode genomeGenerate \
--genomeDir STAR_genomeGenerate \
--genomeFastaFiles GRCh.38.d1.vd1.fa \
--sjdbOverhang 100 \
--sjdbGTFfile gencode.v36.annotation.gtf \
--runThreadN 8

It makes STAR_genomeGenerate/ and GenomeDir.

###Step :2 Alignment 1st Pass.
--genomeDir STAR_genomeGenerate \
--readFilesIn a_1.fastq.gz b_1.fastq.gz c_1.fastq.gz a_2.fastq.gz b_2.fastq.gz c_2.fastq.gz \
--runThreadN 8 \
--outFilterMultimapScoreRange 1 \
--outFilterMultimapNmax 20 \
--outFilterMismatchNmax 10 \
--alignIntronMax 500000 \
--alignMatesGapMax 1000000 \
--sjdbScore 2 \
--alignSJDBoverhangMin 1 \
--genomeLoad NoSharedMemory \
--readFilesCommand zcat \
--outFilterMatchNminOverLread 0.33 \
--outFilterScoreMinOverLread 0.33 \
--sjdbOverhang 100 \
--outSAMstrandField intronMotif \
--outSAMtype None \
--outSAMmode None

However, when I tried to input multiple fastq.gz files in the same way as the above code (--readFilesIn), I encountered the following error (Segmentation fault (core dumped), so I had to input them one by one. It gives SJ.out.tab, Log.out, Log.progress.out, and Log.final.out. In next step, SJ.out.tab is used for input.

However, as you may know, when I repeat Step 2, a new SJ.out.tab file is generated, and the previous SJ.out.tab file disappears. Then, in the next step, Step 3, there is an intermediate index generation step, but I'm uncertain about how to incorporate the SJ.out.tab file.

I would greatly appreciate it if you could provide an explanation for the issue in question.

GDC RNA-seq STAR Ubuntu • 1.3k views

ADD COMMENT • link updated 10 months ago by Zhenyu Zhang ★ 1.2k • written 10 months ago by wyt1995 ▴ 30

1

Entering edit mode

Hi there, the Segmentation fault (core dumped) seems to be related to memory issues. I would check the core dump that has been created by the software to determine what are the exact errors (maybe you could post the content of it here as well?).

ADD REPLY • link 10 months ago by Decimus Maximus ▴ 130

0

Entering edit mode

are files a,b and c all from the same sample? Or are you trying to align three different samples all together?

ADD REPLY • link 10 months ago by swbarnes2 14k

0

Entering edit mode

a,b, and c are all different samples. I have always used '--readFilesIn a_1.fastq.gz a_2.fastq.gz' format for the --readFilesIn option. However, I noticed on this website that it seems possible to input multiple files at once. So, I attempted to use the above code format.

ADD REPLY • link 10 months ago by wyt1995 ▴ 30

0

Entering edit mode

I don't think it means running three totally different samples at the same time I think it means r1 and r2 of one sample.

ADD REPLY • link 10 months ago by swbarnes2 14k

0

Entering edit mode

What I need to show you?

ADD REPLY • link 10 months ago by wyt1995 ▴ 30

score 0 · Answer 1 · 2023-06-09

0

Entering edit mode

10 months ago

Trivas ★ 1.7k

First, use the flag --outFileNamePrefix to prevent your files from being overwritten. Second, you should pass STAR pairs of fastq files, not R1 and R2 separately. Agree with the commenter that your error is due to running out of memory; STAR likes to hog memory usage.

I'm not sure what your downstream applications are so I can't help you there, but by saving the SJ.out.tab file with different names, you should be able to troubleshoot a bit better.

ADD COMMENT • link 10 months ago by Trivas ★ 1.7k

0

Entering edit mode

--outFileNamePrefix helps to prevent being overwritten. However, if individual SJ.out.tab files are generated, and in the third step, I provide each SJ.out.tab file to create separate indexes, it seems like I would end up with more than one index, instead of using a single consistent index from the first step. I'm not sure if this is the correct approach. Is there a way to unify or consolidate multiple SJ.out.tab files?

ADD REPLY • link 10 months ago by wyt1995 ▴ 30

score 0 · Answer 2 · 2023-06-13

The GDC workflow example shows how to run multiple read groups from the same sample together. If you have multiple samples, you should run each sample separately.

I know there is a STAR mode you can load index once and keep adding separate input for faster processing, but I assume this is not what we are discussing here.