Question

using STAR instead of HISAT2

0

Entering edit mode

21 months ago

Chris ▴ 280

Hello all,

I would like to run STAR instead of hisat2.

hisat2 -q --rna-strandness R -x HISAT2/grch38/genome -U data/demo_trimmed.fastq | samtools sort -o HISAT2/demo_trimmed.bam

STAR --runThreadN 6 \
--runMode genomeGenerate \
--genomeDir chr1_hg38_index \
--genomeFastaFiles /home/doanc2/data/demo_trimmed.fastq \
--sjdbGTFfile /home/doanc2/hg38/Homo_sapiens.GRCh38.92.gtf \
--sjdbOverhang 99

I got this error:

EXITING because of INPUT ERROR: the file format of the genomeFastaFile: /vcu_gpfs2/home/doanc2/data/demo_trimmed.fastq is not fasta: the first character is '@' (64), not '>'.
 Solution: check formatting of the fasta file. Make sure the file is uncompressed (unzipped).

So I have to convert my fastq file to fasta file, right?

If yes, I used: sed -n '1~4s/^@/>/p;2~4p' demo_trimmed.fastq > demo_trimmed.fasta. Is that correct?

I got a new error:

EXITING because of FATAL PARAMETER ERROR: limitGenomeGenerateRAM=31000000000is too small for your genome
SOLUTION: please specify --limitGenomeGenerateRAM not less than 873673523466 and make that much RAM available

So how can I change the parameter as the solution suggested above? Thank you so much!

STAR • 2.0k views

ADD COMMENT • link updated 21 months ago by Ram 43k • written 21 months ago by Chris ▴ 280

3

Entering edit mode

Read the STAR manual, please. You should generate a genome using the actual reference FASTA file, not your FASTQ files.

As for your second question, STAR is literally giving you the solution.

ADD REPLY • link 21 months ago by Ram 43k

0

Entering edit mode

Thank you so much for your reply!

I see several hg38 files here:

http://hgdownload.soe.ucsc.edu/goldenPath/hg38/bigZips/

So which file I should download?

ADD REPLY • link 21 months ago by Chris ▴ 280

0

Entering edit mode

Read the README there. Which file do you think would be most useful to you?

ADD REPLY • link 21 months ago by Ram 43k

0

Entering edit mode

this one?

hg38.fa.gz - "Soft-masked" assembly sequence in one file. Repeats from RepeatMasker and Tandem Repeats Finder (with period of 12 or less) are shown in lower case; non-repeating sequence is shown in upper case. (again, the most current version of this file is latest/hg38.fa.gz)

ADD REPLY • link 21 months ago by Chris ▴ 280

1

Entering edit mode

Sure, again - it's the one most useful to you. Soft-masked assembly is a great choice. Personally I'd pick a reference genome from the Gencode project and not UCSC, but that's a personal choice because I like EnsEMBL's versioning system. As long as you record the source URLs, file versions and maybe the download dates, you're golden.

ADD REPLY • link 21 months ago by Ram 43k

0

Entering edit mode

I was confused at UCSC so I downloaded this:

http://ftp.ensembl.org/pub/release-107/fasta/homo_sapiens/dna/Homo_sapiens.GRCh38.dna_sm.primary_assembly.fa.gz

So Hisat2 doesn't require a reference genome but STAR needs, is that correct?

ADD REPLY • link 21 months ago by Chris ▴ 280

2

Entering edit mode

So Hisat2 doesn't require a reference genome but STAR needs, is that correct?

That's incorrect. You are passing HISAT2/grch38/genome to the -x for HISAT2. The manual says that -x accepts index prefix, and also says hisat2-build is used to generate index files. You have prebuilt index files for HISAT2 that you are now creating for STAR using STAR --runMode genomeGenerate.

For Pete's sake, read manuals.

ADD REPLY • link 21 months ago by Ram 43k

0

Entering edit mode

Thank you so much for your answer!

Would you please explain why this reference genome is split into 8 files?

https://genome-idx.s3.amazonaws.com/hisat/grch38_genome.tar.gz

Also, I run STAR on a cluster and it took more than 120 minutes and still hasn't finished. Hisat2 run on a personal computer is only 3 minutes so I guess there is something wrong.

ADD REPLY • link 21 months ago by Chris ▴ 280

1

Entering edit mode

I do not have the bandwidth to download a file, extract it and do a bunch of comparisons to figure out why it's been split - you can read the manual, the paper and the source code, browse forums to see if it has been addressed anywhere or even email the author. In all probability, the reference genome hasn't been split, the prepared index has 8 files.

As for STAR vs HISAT2, look into benchmarking papers and ensure you're comparing apples to apples. Also, Googling terms such as "STAR vs HISAT2" will point to past discussions such as this one: HISAT2 V.S. STAR

ADD REPLY • link 21 months ago by Ram 43k

0

Entering edit mode

Thanks for a detailed answer! As you see from my screenshot, the genome is split into 8 files. enter image description here

ADD REPLY • link 21 months ago by Chris ▴ 280

1

Entering edit mode

the genome is split into 8 files

Again, no. The prepared index has 8 files. Chris, the manual is extremely clear on what's happening, read it.

From the manual:

Small and large indexes

hisat2-build can index reference genomes of any size. For genomes less than about 4 billion nucleotides in length, hisat2-build builds a “small” index using 32-bit numbers in various parts of the index. When the genome is longer, hisat2-build builds a “large” index using 64-bit numbers. Small indexes are stored in files with the .ht2 extension, and large indexes are stored in files with the .ht2l extension. The user need not worry about whether a particular index is small or large; the wrapper scripts will automatically build and use the appropriate index.

ADD REPLY • link 21 months ago by Ram 43k