Question

Run STAR on .fastq files I1, R1 and R2

2

Entering edit mode

2.8 years ago

PianoEntropy ▴ 70

I'm trying to do STAR alignment on 10x data (I tried cellranger but I need a more customizable tool), but I'm a bit confused about the different fastq files and which ones to merge together. All my samples consists of .gz folders which have multiple files, but they come in triplets such as _S1_L001_I1_001.fastq, _S1_L001_R1_001.fastq and *_S1_L001_R2_001.fastq. Now I understand that R1 and R2 probably refer to the Illumina pair-end reads, but what is I1?

More concretely, which files should be given as arguments to --readFilesIn and in which order? In the manual and some examples I found that R1 and R2 both have to be supplied together, e.g. --readFilesIn *_R1.fastq *_R2.fastq. If I want to align all the reads, do I loop over this command, taking all R1 and R2 files, but ignore the I1 files?

RNA-seq alignment STAR • 6.0k views

ADD COMMENT • link 2.8 years ago by PianoEntropy ▴ 70

score 5 · Accepted Answer · 2021-06-18

5

Entering edit mode

2.8 years ago

louiesxscape ▴ 50

Yes, just ignore the I1 file. I1 is the sample index. (read this 10x single cell BAM files - Dave Tang's blog)

And the order should be like:

Importantly, in the --readFilesIn option, the 1st file has to be cDNA read, and the 2nd file has to be the barcode (cell+UMI) read.

STAR/STARsolo.md at master · alexdobin/STAR

ADD COMMENT • link 2.8 years ago by louiesxscape ▴ 50

2

Entering edit mode

Thanks, this is a much better explanation than the one on the 10x website! It's tricky that apparently R1 is barcode and R2 is cDNA, so that one needs to provide R2 before R1. Also, my sequence lengths are different from the ones quoted by Dave Tang. The cDNA reads seems to be 92 nt and barcodes 29 nt. I just assumed the longer one (from R2) is the cDNA.

ADD REPLY • link 2.8 years ago by PianoEntropy ▴ 70