Question

Genome Index and Alignment- STAR

0

Entering edit mode

5.4 years ago

carolgalah • 0

Hi...

I'm new to the field of bioinformatics. I'm trying to align RNAseq with STAR, but I'm having problems. I do not know if it's time to generate the genome index or alignment.

I am generating the genome index for Rattus novergicus (ftp://ftp.ensembl.org/pub/release-94/fasta/rattus_norvegicus/dna/). I downloaded the FASTA files (chr1-chr20, chrY, chrX and chr MT) and the GTF file. that was my script:

STAR --runThreadN 8 --runMode genomeGenerate --genomeDir /home/acamar2/files/Rnor_6.0 --genomeFastaFiles /home/acamar2/files/Fasta/Rattus_norvegicus.Rnor_6.0.dna.chromosome.1.fa /home/acamar2/files/Fasta/Rattus_norvegicus.Rnor_6.0.dna.chromosome.2.fa /home/acamar2/files/Fasta/Rattus_norvegicus.Rnor_6.0.dna.chromosome.3.fa /home/acamar2/files/Fasta/Rattus_norvegicus.Rnor_6.0.dna.chromosome.4.fa /home/acamar2/files/Fasta/Rattus_norvegicus.Rnor_6.0.dna.chromosome.5.fa /home/acamar2/files/Fasta/Rattus_norvegicus.Rnor_6.0.dna.chromosome.5.fa /home/acamar2/files/Fasta/Rattus_norvegicus.Rnor_6.0.dna.chromosome.6.fa /home/acamar2/files/Fasta/Rattus_norvegicus.Rnor_6.0.dna.chromosome.8.fa /home/acamar2/files/Fasta/Rattus_norvegicus.Rnor_6.0.dna.chromosome.9.fa /home/acamar2/files/Fasta/Rattus_norvegicus.Rnor_6.0.dna.chromosome.10.fa /home/acamar2/files/Fasta/Rattus_norvegicus.Rnor_6.0.dna.chromosome.11.fa /home/acamar2/files/Fasta/Rattus_norvegicus.Rnor_6.0.dna.chromosome.12.fa /home/acamar2/files/Fasta/Rattus_norvegicus.Rnor_6.0.dna.chromosome.13.fa /home/acamar2/files/Fasta/Rattus_norvegicus.Rnor_6.0.dna.chromosome.14.fa /home/acamar2/files/Fasta/Rattus_norvegicus.Rnor_6.0.dna.chromosome.15.fa /home/acamar2/files/Fasta/Rattus_norvegicus.Rnor_6.0.dna.chromosome.16.fa /home/acamar2/files/Fasta/Rattus_norvegicus.Rnor_6.0.dna.chromosome.17.fa /home/acamar2/files/Fasta/Rattus_norvegicus.Rnor_6.0.dna.chromosome.18.fa /home/acamar2/files/Fasta/Rattus_norvegicus.Rnor_6.0.dna.chromosome.19.fa /home/acamar2/files/Fasta/Rattus_norvegicus.Rnor_6.0.dna.chromosome.20.fa /home/acamar2/files/Fasta/Rattus_norvegicus.Rnor_6.0.dna.chromosome.Y.fa /home/acamar2/files/Fasta/Rattus_norvegicus.Rnor_6.0.dna.chromosome.MT.fa /home/acamar2/files/Fasta/Rattus_norvegicus.Rnor_6.0.dna.chromosome.X.fa --sjdbGTFfile /home/acamar2/files/annotation/Rattus_norvegicus.Rnor_6.0.94.gtf

The generated files were:

drwxr-xr-x 2 acamar2 acamar2 4,0K Nov 28 15:58 .
drwxr-xr-x 5 acamar2 acamar2   50 Nov 28 15:58 ..
-rw-rw-r-- 1 acamar2 acamar2  217 Nov 28 14:56 chrLength.txt
-rw-rw-r-- 1 acamar2 acamar2  275 Nov 28 14:56 chrNameLength.txt
-rw-rw-r-- 1 acamar2 acamar2   58 Nov 28 14:56 chrName.txt
-rw-rw-r-- 1 acamar2 acamar2  245 Nov 28 14:56 chrStart.txt
-rw-rw-r-- 1 acamar2 acamar2  11M Nov 28 15:44 exonGeTrInfo.tab
-rw-rw-r-- 1 acamar2 acamar2 4,5M Nov 28 15:44 exonInfo.tab
-rw-rw-r-- 1 acamar2 acamar2 606K Nov 28 15:44 geneInfo.tab
-rw-rw-r-- 1 acamar2 acamar2 2,7G Nov 28 15:49 Genome
-rw-rw-r-- 1 acamar2 acamar2 4,0K Nov 28 14:55 genomeParameters.txt
-rw-rw-r-- 1 acamar2 acamar2  21G Nov 28 15:53 SA
-rw-rw-r-- 1 acamar2 acamar2 1,5G Nov 28 15:53 SAindex
-rw-rw-r-- 1 acamar2 acamar2 5,6M Nov 28 15:44 sjdbInfo.txt
-rw-rw-r-- 1 acamar2 acamar2 4,4M Nov 28 15:44 sjdbList.fromGTF.out.tab
-rw-rw-r-- 1 acamar2 acamar2 4,4M Nov 28 15:44 sjdbList.out.tab
-rw-rw-r-- 1 acamar2 acamar2 2,4M Nov 28 15:44 transcriptInfo.tab

I followed the alignment of the samples with the following script:

STAR --runThreadN 8 --genomeDir /home/acamar2/files/Rnor_6.0 --sjdbGTFfile /home/acamar2/files/Rnor_6.0/Rattus_norvegicus.Rnor_6.0.94.gtf --readFilesIn /home/acamar2/rawdata/1_1.fastq.gz,/home/acamar2/rawdata/1_2.fastq.gz,/home/acamar2/rawdata/2_1.fastq.gz,/home/acamar2/rawdata/2_2.fastq.gz,/home/acamar2/rawdata/3_1.fastq.gz,/home/acamar2/rawdata/3_2.fastq.gz,/home/acamar2/rawdata/4_1.fastq.gz,/home/acamar2/rawdata/4_2.fastq.gz,/home/acamar2/rawdata/5_1.fastq.gz,/home/acamar2/rawdata/5_2.fastq.gz,/home/acamar2/rawdata/6_1.fastq.gz,/home/acamar2/rawdata/6_2.fastq.gz,/home/acamar2/rawdata/7_1.fastq.gz,/home/acamar2/rawdata/7_2.fastq.gz,/home/acamar2/rawdata/8_1.fastq.gz,/home/acamar2/rawdata/8_2.fastq.gz --readFilesCommand gunzip --quantMode TranscriptomeSAM GeneCounts --outSAMtype BAM SortedByCoordinate --outFileNamePrefix /home/acamar2/STARoutput/fastq

The generated files were:

drwxr-xr-x 4 acamar2 acamar2 4,0K Nov 28 16:42 .
drwxr-xr-x 7 acamar2 acamar2  115 Nov 28 16:00 ..
-rw-rw-r-- 1 acamar2 acamar2   28 Nov 28 16:42 fastqAligned.sortedByCoord.out.bam
-rw-rw-r-- 1 acamar2 acamar2 567K Nov 28 16:42 fastqAligned.toTranscriptome.out.bam
-rw-rw-r-- 1 acamar2 acamar2 1,8K Nov 28 16:42 fastqLog.final.out
-rw-rw-r-- 1 acamar2 acamar2 461K Nov 28 16:42 fastqLog.out
-rw-r--r-- 1 acamar2 acamar2 268M Nov 19 16:30 Rattus_norvegicus.Rnor_6.0.94.gtf
-rw-rw-r-- 1 acamar2 acamar2  246 Nov 28 16:42 fastqLog.progress.out
-rw-rw-r-- 1 acamar2 acamar2 797K Nov 28 16:42 fastqReadsPerGene.out.tab
-rw-rw-r-- 1 acamar2 acamar2    0 Nov 28 16:42 fastqSJ.out.tab
drwx------ 2 acamar2 acamar2 4,0K Nov 28 16:03 fastq_STARgenome
drwx------ 3 acamar2 acamar2   20 Nov 28 16:42 fastq_STARtmp

But, my files ReadsPerGene and Log.final.out appear zeroed.

N_unmapped  0   0   0
N_multimapping  0   0   0
N_noFeature 0   0   0
N_ambiguous 0   0   0
ENSRNOG00000046319  0   0   0
ENSRNOG00000047964  0   0   0
ENSRNOG00000050370  0   0   0
ENSRNOG00000032365  0   0   0
ENSRNOG00000040300  0   0   0
ENSRNOG00000058808  0   0   0
ENSRNOG00000061316  0   0   0
ENSRNOG00000050129  0   0   0
ENSRNOG00000040316  0   0   0
ENSRNOG00000023659  0   0   0
ENSRNOG00000029897  0   0   0
ENSRNOG00000042852  0   0   0
ENSRNOG00000061806  0   0   0
ENSRNOG00000055877  0   0   0
ENSRNOG00000014303  0   0   0
ENSRNOG00000051899  0   0   0
ENSRNOG00000014330  0   0   0
ENSRNOG00000049505  0   0   0
ENSRNOG00000014916  0   0   0
ENSRNOG00000014996  0   0   0
ENSRNOG00000015239  0   0   0
ENSRNOG00000055508  0   0   0
ENSRNOG00000053710  0   0   0
ENSRNOG00000015552  0   0   0
ENSRNOG00000016041  0   0   0
ENSRNOG00000016054  0   0   0
ENSRNOG00000016381  0   0   0
ENSRNOG00000054005  0   0   0
ENSRNOG00000059137  0   0   0
ENSRNOG00000013160  0   0   0
ENSRNOG00000052788  0   0   0
ENSRNOG00000023549  0   0   0
ENSRNOG00000031859  0   0   0
ENSRNOG00000013351  0   0   0


            Started job on |    Nov 28 17:12:06
                         Started mapping on |   Nov 28 17:15:03
                                Finished on |   Nov 28 17:51:41
   Mapping speed, Million of reads per hour |   0.00

                      Number of input reads |   0
                  Average input read length |   0
                                UNIQUE READS:
               Uniquely mapped reads number |   0
                    Uniquely mapped reads % |   0.00%
                      Average mapped length |   0.00
                   Number of splices: Total |   0
        Number of splices: Annotated (sjdb) |   0
                   Number of splices: GT/AG |   0
                   Number of splices: GC/AG |   0
                   Number of splices: AT/AC |   0
           Number of splices: Non-canonical |   0
                  Mismatch rate per base, % |   -nan%
                     Deletion rate per base |   0.00%
                    Deletion average length |   0.00
                    Insertion rate per base |   0.00%
                   Insertion average length |   0.00
                         MULTI-MAPPING READS:
    Number of reads mapped to multiple loci |   0
         % of reads mapped to multiple loci |   0.00%
    Number of reads mapped to too many loci |   0
         % of reads mapped to too many loci |   0.00%
                              UNMAPPED READS:
   % of reads unmapped: too many mismatches |   0.00%
             % of reads unmapped: too short |   0.00%
                 % of reads unmapped: other |   0.00%
                              CHIMERIC READS:
                   Number of chimeric reads |   0
                        % of chimeric reads |   0.00%

I do not know if I'm generating the genome index or doing the alignment correctly. Can someone help me?

Thanks

RNA-Seq genome alignment STAR • 3.5k views

ADD COMMENT • link updated 5.4 years ago by Kevin Blighe 87k • written 5.4 years ago by carolgalah • 0

1

Entering edit mode

I'm pretty sure you want either zcat or gunzip -c rather than just gunzip as the extraction command.

ADD REPLY • link 5.4 years ago by Devon Ryan 104k

0

Entering edit mode

You don't need to use multiples fasta files, you have the complete genome under ftp://ftp.ensembl.org/pub/release-94/fasta/rattus_norvegicus/dna/Rattus_norvegicus.Rnor_6.0.dna.toplevel.fa.gz

ADD REPLY • link 5.4 years ago by Bastien Hervé 5.3k

0

Entering edit mode

I tried to use the toplevel, but it also didn't work. They gave the same results.

When I run the alignment, do I need to put the FASTA and GTF file in the same folder as the genome?

ADD REPLY • link 5.4 years ago by carolgalah • 0

0

Entering edit mode

No, I think this should be enought. Unzip the genome file :

STAR --runThreadN 8 --runMode genomeGenerate --genomeDir /home/acamar2/files/Rnor_6.0 --genomeFastaFiles /home/acamar2/files/Fasta/Rattus_norvegicus.Rnor_6.0.dna.toplevel.fa --sjdbGTFfile /home/acamar2/files/annotation/Rattus_norvegicus.Rnor_6.0.94.gtf

ADD REPLY • link 5.4 years ago by Bastien Hervé 5.3k

0

Entering edit mode

I tried to use exactly this script with the toplevel. Is it some mistake in my samples?

Always before generating the genome, I unzipped the FASTA files.

I need to unzip this file: -rw-rw-r-- 1 acamar2 acamar2 2,7G Nov 28 15:49 Genome

ADD REPLY • link 5.4 years ago by carolgalah • 0

0

Entering edit mode

No your index seems OK. Try with two files :

STAR --runThreadN 8 --genomeDir /home/acamar2/files/Rnor_6.0 --sjdbGTFfile /home/acamar2/files/Rnor_6.0/Rattus_norvegicus.Rnor_6.0.94.gtf --readFilesIn /home/acamar2/rawdata/1_1.fastq.gz /home/acamar2/rawdata/1_2.fastq.gz --quantMode TranscriptomeSAM GeneCounts --outSAMtype BAM SortedByCoordinate --outFileNamePrefix /home/acamar2/STARoutput/fastq

ADD REPLY • link 5.4 years ago by Bastien Hervé 5.3k

0

Entering edit mode

--genomeDir /home/acamar2/files/Rnor_6.0

Rnor_6.0 is directory where the files of the genome:

drwxr-xr-x 2 acamar2 acamar2 4.0K Nov 28 15:58.
drwxr-xr-x 5 acamar2 acamar2 50 Nov 28 15:58 ..
-rw-rw-r-- 1 acamar2 acamar2 217 Nov 28 14:56 chrLength.txt
-rw-rw-r-- 1 acamar2 acamar2 275 Nov 28 14:56 chrNameLength.txt
-rw-rw-r-- 1 acamar2 acamar2 58 Nov 28 14:56 chrName.txt
-rw-rw-r-- 1 acamar2 acamar2 245 Nov 28 14:56 chrStart.txt
-rw-rw-r-- 1 acamar2 acamar2 11M Nov 28 15:44 exonGeTrInfo.tab
-rw-rw-r-- 1 acamar2 acamar2 4,5M Nov 28 15:44 exonInfo.tab
-rw-rw-r-- 1 acamar2 acamar2 606K Nov 28 15:44 geneInfo.tab
-rw-rw-r-- 1 acamar2 acamar2 2,7G Nov 28 15:49 Genome
-rw-rw-r-- 1 acamar2 acamar2 4.0K Nov 28 14:55 genomeParameters.txt
-rw-r-r-- 1 acamar2 acamar2 268M Nov 19 16:30 Rattus_norvegicus.Rnor_6.0.94.gtf
-rw-rw-r-- 1 acamar2 acamar2 21G Nov 28 15:53 SA
-rw-rw-r-- 1 acamar2 acamar2 1,5G Nov 28 15:53 SAindex
-rw-rw-r-- 1 acamar2 acamar2 5.6M Nov 28 15:44 sjdbInfo.txt
-rw-rw-r-- 1 acamar2 acamar2 4.4M Nov 28 15:44 sjdbList.fromGTF.out.tab
-rw-rw-r-- 1 acamar2 acamar2 4.4M Nov 28 15:44 sjdbList.out.tab
-rw-rw-r-- 1 acamar2 acamar2 2,4M Nov 28 15:44 transcriptInfo.tab

, is this correct to indicate the directory, not a specific file?

I'll try only with two files! Do you think it is better to generate the genome index again with the toplevel?

thanks

ADD REPLY • link updated 5.4 years ago by Devon Ryan 104k • written 5.4 years ago by carolgalah • 0

0

Entering edit mode

Correct, you just indicate the directory and not a file within it.

ADD REPLY • link 5.4 years ago by Devon Ryan 104k

0

Entering edit mode

The index seems ok but yes try to re-index with the lop level file

ADD REPLY • link 5.4 years ago by Bastien Hervé 5.3k

0

Entering edit mode

I tried everything again ... and nothing worked. I recreated the genome index with toplevel. I ran the STAR using only 1 pair of samples. I tried aligning it with a couple of samples using the toplevel genome, also tried using the genome with all the chr files ... but nothing worked. What will happen? Best to try to generate the genome index with other files ... maybe NCBI?

ADD REPLY • link 5.4 years ago by carolgalah • 0

0

Entering edit mode

May we have the command line you used and the standard output + log file please.

Note that

nothing worked

is the worst explanation about a problem. To be very complete and fully helpful we need all the softwares versions, the links to input files, the command lines used and the output results (log file, standard output).

ADD REPLY • link 5.4 years ago by Bastien Hervé 5.3k

0

Entering edit mode

Sorry for the delay in replying ...

Now it worked, I had not tried the use of gunzip -c, I had forgotten all the time to use the (-c), beginner error. Sorry for the inconvenience!

Another question: Star generates a directory named BAMsort, with several other folders. In those directories should have generated my .bam files, correct? Because those directories are empty.

I generated my genome index with the toplevel file, will it have any differences or problem? Because when I try to generate the genome index using all chr files 1-20, chr Y, MT and X, they are taking a long time and do not end (more than 24hr).

Thanks

ADD REPLY • link 5.4 years ago by carolgalah • 0

0

Entering edit mode

Those directories should be empty after it's done. The sorted BAM file is in the output directory you specified. STAR is a bit bad at cleaning up after itself, so you'll see a bunch of _STAR* folder sitting about that you can delete.

The toplevel fasta should include all of the chromosomes anyway. Double check to ensure you have the right one.

ADD REPLY • link 5.4 years ago by Devon Ryan 104k

0

Entering edit mode

My files after the alignment:

-rw-rw-r-- 1 acamar2 acamar2  22G Dez  3 16:12 fastqAligned.sortedByCoord.out.bam
-rw-rw-r-- 1 acamar2 acamar2  25G Dez  3 15:37 fastqAligned.toTranscriptome.out.bam
-rw-rw-r-- 1 acamar2 acamar2 1,9K Dez  3 16:12 fastqLog.final.out
-rw-rw-r-- 1 acamar2 acamar2  58K Dez  3 16:12 fastqLog.out
-rw-rw-r-- 1 acamar2 acamar2 9,5K Dez  3 16:12 fastqLog.progress.out
-rw-rw-r-- 1 acamar2 acamar2 930K Dez  3 15:37 fastqReadsPerGene.out.tab
-rw-rw-r-- 1 acamar2 acamar2  15M Dez  3 16:12 fastqSJ.out.tab
drwx------ 2 acamar2 acamar2 4,0K Dez  3 14:14 fastq_STARgenome
drwx------ 3 acamar2 acamar2   20 Dez  3 16:12 fastq_STARtmp

So, the directory fastq_STARgenome and fastq_STARtmp, I can delete?

Thanks for your help!

ADD REPLY • link updated 5.4 years ago by Devon Ryan 104k • written 5.4 years ago by carolgalah • 0

0

Entering edit mode

Yes, you can delete them

ADD REPLY • link 5.4 years ago by Devon Ryan 104k

0

Entering edit mode

Hi,

After alignment, I need to run the Cufflinks for assembly and quantify transcripts.

I need to indicate only my file Aligned.sortedByCoord.out.bam, or do I need to indicate my raw data + Aligned.sortedByCoord.out.bam.?

Do I need to index the Aligned.sortedByCoord.out.bam?

thanks

ADD REPLY • link 5.4 years ago by carolgalah • 0

0

Entering edit mode

You only need the BAM file, I don't think it needs to be indexed.

Do not use cufflinks, use stringTie.

ADD REPLY • link 5.4 years ago by Devon Ryan 104k

0

Entering edit mode

Thanks for your reply...

I have two bam: rw-rw-r-- 1 acamar2 acamar2 22G Dez 3 16:12 fastqAligned.sortedByCoord.out.bam -rw-rw-r-- 1 acamar2 acamar2 25G Dez 3 15:37 fastqAligned.toTranscriptome.out.bam

I need use just this fastqAligned.sortedByCoord.out.bam... correct?

Thank you for the advice about StringTie, but my advisor wants me to use Cufflinks. But I'm sure to try StringTie too, to learn!

Thanks for your help!

ADD REPLY • link 5.4 years ago by carolgalah • 0

0

Entering edit mode

Try aligning a single file or pair of them and see if you get something more reasonable.

ADD REPLY • link 5.4 years ago by Devon Ryan 104k

0

Entering edit mode

I tried everything again ... and nothing worked. I recreated the genome index with toplevel. I ran the STAR using only 1 pair of samples. I tried aligning it with a couple of samples using the toplevel genome, also tried using the genome with all the chr files ... but nothing worked. What will happen? Best to try to generate the genome index with other files ... maybe NCBI?

ADD REPLY • link 5.4 years ago by carolgalah • 0

0

Entering edit mode

Stick to ENSEMBL or GENCODE for reference sequences, NCBI will cause you a lot of annoyances (UCSC likely will as well).

ADD REPLY • link 5.4 years ago by Devon Ryan 104k

0

Entering edit mode

Sorry for the delay in replying ...

Now it worked, I had not tried the use of gunzip -c, I had forgotten all the time to use the (-c), beginner error. Sorry for the inconvenience!

Another question: Star generates a directory named BAMsort, with several other folders. In those directories should have generated my .bam files, correct? Because those directories are empty.

I generated my genome index with the toplevel file, will it have any differences or problem? Because when I try to generate the genome index using all chr files 1-20, chr Y, MT and X, they are taking a long time and do not end (more than 24hr).

Thanks

ADD REPLY • link 5.4 years ago by carolgalah • 0