Question: Genome Index and Alignment- STAR
0
gravatar for carolgalah
13 days ago by
carolgalah0
EUA/Atlanta/Emory University
carolgalah0 wrote:

Hi...

I'm new to the field of bioinformatics. I'm trying to align RNAseq with STAR, but I'm having problems. I do not know if it's time to generate the genome index or alignment.

I am generating the genome index for Rattus novergicus (ftp://ftp.ensembl.org/pub/release-94/fasta/rattus_norvegicus/dna/). I downloaded the FASTA files (chr1-chr20, chrY, chrX and chr MT) and the GTF file. that was my script:

STAR --runThreadN 8 --runMode genomeGenerate --genomeDir /home/acamar2/files/Rnor_6.0 --genomeFastaFiles /home/acamar2/files/Fasta/Rattus_norvegicus.Rnor_6.0.dna.chromosome.1.fa /home/acamar2/files/Fasta/Rattus_norvegicus.Rnor_6.0.dna.chromosome.2.fa /home/acamar2/files/Fasta/Rattus_norvegicus.Rnor_6.0.dna.chromosome.3.fa /home/acamar2/files/Fasta/Rattus_norvegicus.Rnor_6.0.dna.chromosome.4.fa /home/acamar2/files/Fasta/Rattus_norvegicus.Rnor_6.0.dna.chromosome.5.fa /home/acamar2/files/Fasta/Rattus_norvegicus.Rnor_6.0.dna.chromosome.5.fa /home/acamar2/files/Fasta/Rattus_norvegicus.Rnor_6.0.dna.chromosome.6.fa /home/acamar2/files/Fasta/Rattus_norvegicus.Rnor_6.0.dna.chromosome.8.fa /home/acamar2/files/Fasta/Rattus_norvegicus.Rnor_6.0.dna.chromosome.9.fa /home/acamar2/files/Fasta/Rattus_norvegicus.Rnor_6.0.dna.chromosome.10.fa /home/acamar2/files/Fasta/Rattus_norvegicus.Rnor_6.0.dna.chromosome.11.fa /home/acamar2/files/Fasta/Rattus_norvegicus.Rnor_6.0.dna.chromosome.12.fa /home/acamar2/files/Fasta/Rattus_norvegicus.Rnor_6.0.dna.chromosome.13.fa /home/acamar2/files/Fasta/Rattus_norvegicus.Rnor_6.0.dna.chromosome.14.fa /home/acamar2/files/Fasta/Rattus_norvegicus.Rnor_6.0.dna.chromosome.15.fa /home/acamar2/files/Fasta/Rattus_norvegicus.Rnor_6.0.dna.chromosome.16.fa /home/acamar2/files/Fasta/Rattus_norvegicus.Rnor_6.0.dna.chromosome.17.fa /home/acamar2/files/Fasta/Rattus_norvegicus.Rnor_6.0.dna.chromosome.18.fa /home/acamar2/files/Fasta/Rattus_norvegicus.Rnor_6.0.dna.chromosome.19.fa /home/acamar2/files/Fasta/Rattus_norvegicus.Rnor_6.0.dna.chromosome.20.fa /home/acamar2/files/Fasta/Rattus_norvegicus.Rnor_6.0.dna.chromosome.Y.fa /home/acamar2/files/Fasta/Rattus_norvegicus.Rnor_6.0.dna.chromosome.MT.fa /home/acamar2/files/Fasta/Rattus_norvegicus.Rnor_6.0.dna.chromosome.X.fa --sjdbGTFfile /home/acamar2/files/annotation/Rattus_norvegicus.Rnor_6.0.94.gtf

The generated files were:

drwxr-xr-x 2 acamar2 acamar2 4,0K Nov 28 15:58 .
drwxr-xr-x 5 acamar2 acamar2   50 Nov 28 15:58 ..
-rw-rw-r-- 1 acamar2 acamar2  217 Nov 28 14:56 chrLength.txt
-rw-rw-r-- 1 acamar2 acamar2  275 Nov 28 14:56 chrNameLength.txt
-rw-rw-r-- 1 acamar2 acamar2   58 Nov 28 14:56 chrName.txt
-rw-rw-r-- 1 acamar2 acamar2  245 Nov 28 14:56 chrStart.txt
-rw-rw-r-- 1 acamar2 acamar2  11M Nov 28 15:44 exonGeTrInfo.tab
-rw-rw-r-- 1 acamar2 acamar2 4,5M Nov 28 15:44 exonInfo.tab
-rw-rw-r-- 1 acamar2 acamar2 606K Nov 28 15:44 geneInfo.tab
-rw-rw-r-- 1 acamar2 acamar2 2,7G Nov 28 15:49 Genome
-rw-rw-r-- 1 acamar2 acamar2 4,0K Nov 28 14:55 genomeParameters.txt
-rw-rw-r-- 1 acamar2 acamar2  21G Nov 28 15:53 SA
-rw-rw-r-- 1 acamar2 acamar2 1,5G Nov 28 15:53 SAindex
-rw-rw-r-- 1 acamar2 acamar2 5,6M Nov 28 15:44 sjdbInfo.txt
-rw-rw-r-- 1 acamar2 acamar2 4,4M Nov 28 15:44 sjdbList.fromGTF.out.tab
-rw-rw-r-- 1 acamar2 acamar2 4,4M Nov 28 15:44 sjdbList.out.tab
-rw-rw-r-- 1 acamar2 acamar2 2,4M Nov 28 15:44 transcriptInfo.tab

I followed the alignment of the samples with the following script:

STAR --runThreadN 8 --genomeDir /home/acamar2/files/Rnor_6.0 --sjdbGTFfile /home/acamar2/files/Rnor_6.0/Rattus_norvegicus.Rnor_6.0.94.gtf --readFilesIn /home/acamar2/rawdata/1_1.fastq.gz,/home/acamar2/rawdata/1_2.fastq.gz,/home/acamar2/rawdata/2_1.fastq.gz,/home/acamar2/rawdata/2_2.fastq.gz,/home/acamar2/rawdata/3_1.fastq.gz,/home/acamar2/rawdata/3_2.fastq.gz,/home/acamar2/rawdata/4_1.fastq.gz,/home/acamar2/rawdata/4_2.fastq.gz,/home/acamar2/rawdata/5_1.fastq.gz,/home/acamar2/rawdata/5_2.fastq.gz,/home/acamar2/rawdata/6_1.fastq.gz,/home/acamar2/rawdata/6_2.fastq.gz,/home/acamar2/rawdata/7_1.fastq.gz,/home/acamar2/rawdata/7_2.fastq.gz,/home/acamar2/rawdata/8_1.fastq.gz,/home/acamar2/rawdata/8_2.fastq.gz --readFilesCommand gunzip --quantMode TranscriptomeSAM GeneCounts --outSAMtype BAM SortedByCoordinate --outFileNamePrefix /home/acamar2/STARoutput/fastq

The generated files were:

drwxr-xr-x 4 acamar2 acamar2 4,0K Nov 28 16:42 .
drwxr-xr-x 7 acamar2 acamar2  115 Nov 28 16:00 ..
-rw-rw-r-- 1 acamar2 acamar2   28 Nov 28 16:42 fastqAligned.sortedByCoord.out.bam
-rw-rw-r-- 1 acamar2 acamar2 567K Nov 28 16:42 fastqAligned.toTranscriptome.out.bam
-rw-rw-r-- 1 acamar2 acamar2 1,8K Nov 28 16:42 fastqLog.final.out
-rw-rw-r-- 1 acamar2 acamar2 461K Nov 28 16:42 fastqLog.out
-rw-r--r-- 1 acamar2 acamar2 268M Nov 19 16:30 Rattus_norvegicus.Rnor_6.0.94.gtf
-rw-rw-r-- 1 acamar2 acamar2  246 Nov 28 16:42 fastqLog.progress.out
-rw-rw-r-- 1 acamar2 acamar2 797K Nov 28 16:42 fastqReadsPerGene.out.tab
-rw-rw-r-- 1 acamar2 acamar2    0 Nov 28 16:42 fastqSJ.out.tab
drwx------ 2 acamar2 acamar2 4,0K Nov 28 16:03 fastq_STARgenome
drwx------ 3 acamar2 acamar2   20 Nov 28 16:42 fastq_STARtmp

But, my files ReadsPerGene and Log.final.out appear zeroed.

N_unmapped  0   0   0
N_multimapping  0   0   0
N_noFeature 0   0   0
N_ambiguous 0   0   0
ENSRNOG00000046319  0   0   0
ENSRNOG00000047964  0   0   0
ENSRNOG00000050370  0   0   0
ENSRNOG00000032365  0   0   0
ENSRNOG00000040300  0   0   0
ENSRNOG00000058808  0   0   0
ENSRNOG00000061316  0   0   0
ENSRNOG00000050129  0   0   0
ENSRNOG00000040316  0   0   0
ENSRNOG00000023659  0   0   0
ENSRNOG00000029897  0   0   0
ENSRNOG00000042852  0   0   0
ENSRNOG00000061806  0   0   0
ENSRNOG00000055877  0   0   0
ENSRNOG00000014303  0   0   0
ENSRNOG00000051899  0   0   0
ENSRNOG00000014330  0   0   0
ENSRNOG00000049505  0   0   0
ENSRNOG00000014916  0   0   0
ENSRNOG00000014996  0   0   0
ENSRNOG00000015239  0   0   0
ENSRNOG00000055508  0   0   0
ENSRNOG00000053710  0   0   0
ENSRNOG00000015552  0   0   0
ENSRNOG00000016041  0   0   0
ENSRNOG00000016054  0   0   0
ENSRNOG00000016381  0   0   0
ENSRNOG00000054005  0   0   0
ENSRNOG00000059137  0   0   0
ENSRNOG00000013160  0   0   0
ENSRNOG00000052788  0   0   0
ENSRNOG00000023549  0   0   0
ENSRNOG00000031859  0   0   0
ENSRNOG00000013351  0   0   0


            Started job on |    Nov 28 17:12:06
                         Started mapping on |   Nov 28 17:15:03
                                Finished on |   Nov 28 17:51:41
   Mapping speed, Million of reads per hour |   0.00

                      Number of input reads |   0
                  Average input read length |   0
                                UNIQUE READS:
               Uniquely mapped reads number |   0
                    Uniquely mapped reads % |   0.00%
                      Average mapped length |   0.00
                   Number of splices: Total |   0
        Number of splices: Annotated (sjdb) |   0
                   Number of splices: GT/AG |   0
                   Number of splices: GC/AG |   0
                   Number of splices: AT/AC |   0
           Number of splices: Non-canonical |   0
                  Mismatch rate per base, % |   -nan%
                     Deletion rate per base |   0.00%
                    Deletion average length |   0.00
                    Insertion rate per base |   0.00%
                   Insertion average length |   0.00
                         MULTI-MAPPING READS:
    Number of reads mapped to multiple loci |   0
         % of reads mapped to multiple loci |   0.00%
    Number of reads mapped to too many loci |   0
         % of reads mapped to too many loci |   0.00%
                              UNMAPPED READS:
   % of reads unmapped: too many mismatches |   0.00%
             % of reads unmapped: too short |   0.00%
                 % of reads unmapped: other |   0.00%
                              CHIMERIC READS:
                   Number of chimeric reads |   0
                        % of chimeric reads |   0.00%

I do not know if I'm generating the genome index or doing the alignment correctly. Can someone help me?

Thanks

rna-seq star alignment genome • 175 views
ADD COMMENTlink modified 13 days ago by Kevin Blighe33k • written 13 days ago by carolgalah0
1

I'm pretty sure you want either zcat or gunzip -c rather than just gunzip as the extraction command.

ADD REPLYlink written 13 days ago by Devon Ryan86k

You don't need to use multiples fasta files, you have the complete genome under ftp://ftp.ensembl.org/pub/release-94/fasta/rattus_norvegicus/dna/Rattus_norvegicus.Rnor_6.0.dna.toplevel.fa.gz

ADD REPLYlink written 13 days ago by Bastien Hervé2.7k

I tried to use the toplevel, but it also didn't work. They gave the same results.

When I run the alignment, do I need to put the FASTA and GTF file in the same folder as the genome?

ADD REPLYlink written 13 days ago by carolgalah0

No, I think this should be enought. Unzip the genome file :

STAR --runThreadN 8 --runMode genomeGenerate --genomeDir /home/acamar2/files/Rnor_6.0 --genomeFastaFiles /home/acamar2/files/Fasta/Rattus_norvegicus.Rnor_6.0.dna.toplevel.fa --sjdbGTFfile /home/acamar2/files/annotation/Rattus_norvegicus.Rnor_6.0.94.gtf
ADD REPLYlink written 13 days ago by Bastien Hervé2.7k

I tried to use exactly this script with the toplevel. Is it some mistake in my samples?

Always before generating the genome, I unzipped the FASTA files.

I need to unzip this file: -rw-rw-r-- 1 acamar2 acamar2 2,7G Nov 28 15:49 Genome

ADD REPLYlink written 13 days ago by carolgalah0

No your index seems OK. Try with two files :

STAR --runThreadN 8 --genomeDir /home/acamar2/files/Rnor_6.0 --sjdbGTFfile /home/acamar2/files/Rnor_6.0/Rattus_norvegicus.Rnor_6.0.94.gtf --readFilesIn /home/acamar2/rawdata/1_1.fastq.gz /home/acamar2/rawdata/1_2.fastq.gz --quantMode TranscriptomeSAM GeneCounts --outSAMtype BAM SortedByCoordinate --outFileNamePrefix /home/acamar2/STARoutput/fastq
ADD REPLYlink written 13 days ago by Bastien Hervé2.7k

--genomeDir /home/acamar2/files/Rnor_6.0

Rnor_6.0 is directory where the files of the genome:

drwxr-xr-x 2 acamar2 acamar2 4.0K Nov 28 15:58.
drwxr-xr-x 5 acamar2 acamar2 50 Nov 28 15:58 ..
-rw-rw-r-- 1 acamar2 acamar2 217 Nov 28 14:56 chrLength.txt
-rw-rw-r-- 1 acamar2 acamar2 275 Nov 28 14:56 chrNameLength.txt
-rw-rw-r-- 1 acamar2 acamar2 58 Nov 28 14:56 chrName.txt
-rw-rw-r-- 1 acamar2 acamar2 245 Nov 28 14:56 chrStart.txt
-rw-rw-r-- 1 acamar2 acamar2 11M Nov 28 15:44 exonGeTrInfo.tab
-rw-rw-r-- 1 acamar2 acamar2 4,5M Nov 28 15:44 exonInfo.tab
-rw-rw-r-- 1 acamar2 acamar2 606K Nov 28 15:44 geneInfo.tab
-rw-rw-r-- 1 acamar2 acamar2 2,7G Nov 28 15:49 Genome
-rw-rw-r-- 1 acamar2 acamar2 4.0K Nov 28 14:55 genomeParameters.txt
-rw-r-r-- 1 acamar2 acamar2 268M Nov 19 16:30 Rattus_norvegicus.Rnor_6.0.94.gtf
-rw-rw-r-- 1 acamar2 acamar2 21G Nov 28 15:53 ​​SA
-rw-rw-r-- 1 acamar2 acamar2 1,5G Nov 28 15:53 ​​SAindex
-rw-rw-r-- 1 acamar2 acamar2 5.6M Nov 28 15:44 sjdbInfo.txt
-rw-rw-r-- 1 acamar2 acamar2 4.4M Nov 28 15:44 sjdbList.fromGTF.out.tab
-rw-rw-r-- 1 acamar2 acamar2 4.4M Nov 28 15:44 sjdbList.out.tab
-rw-rw-r-- 1 acamar2 acamar2 2,4M Nov 28 15:44 transcriptInfo.tab

, is this correct to indicate the directory, not a specific file?

I'll try only with two files! Do you think it is better to generate the genome index again with the toplevel?

thanks

ADD REPLYlink modified 13 days ago by Devon Ryan86k • written 13 days ago by carolgalah0

Correct, you just indicate the directory and not a file within it.

ADD REPLYlink written 13 days ago by Devon Ryan86k

The index seems ok but yes try to re-index with the lop level file

ADD REPLYlink written 13 days ago by Bastien Hervé2.7k

I tried everything again ... and nothing worked. I recreated the genome index with toplevel. I ran the STAR using only 1 pair of samples. I tried aligning it with a couple of samples using the toplevel genome, also tried using the genome with all the chr files ... but nothing worked. What will happen? Best to try to generate the genome index with other files ... maybe NCBI?

ADD REPLYlink written 12 days ago by carolgalah0

May we have the command line you used and the standard output + log file please.

Note that

nothing worked

is the worst explanation about a problem. To be very complete and fully helpful we need all the softwares versions, the links to input files, the command lines used and the output results (log file, standard output).

ADD REPLYlink written 12 days ago by Bastien Hervé2.7k

Sorry for the delay in replying ...

Now it worked, I had not tried the use of gunzip -c, I had forgotten all the time to use the (-c), beginner error. Sorry for the inconvenience!

Another question: Star generates a directory named BAMsort, with several other folders. In those directories should have generated my .bam files, correct? Because those directories are empty.

I generated my genome index with the toplevel file, will it have any differences or problem? Because when I try to generate the genome index using all chr files 1-20, chr Y, MT and X, they are taking a long time and do not end (more than 24hr).

Thanks

ADD REPLYlink written 8 days ago by carolgalah0

Those directories should be empty after it's done. The sorted BAM file is in the output directory you specified. STAR is a bit bad at cleaning up after itself, so you'll see a bunch of _STAR* folder sitting about that you can delete.

The toplevel fasta should include all of the chromosomes anyway. Double check to ensure you have the right one.

ADD REPLYlink written 8 days ago by Devon Ryan86k

My files after the alignment:

-rw-rw-r-- 1 acamar2 acamar2  22G Dez  3 16:12 fastqAligned.sortedByCoord.out.bam
-rw-rw-r-- 1 acamar2 acamar2  25G Dez  3 15:37 fastqAligned.toTranscriptome.out.bam
-rw-rw-r-- 1 acamar2 acamar2 1,9K Dez  3 16:12 fastqLog.final.out
-rw-rw-r-- 1 acamar2 acamar2  58K Dez  3 16:12 fastqLog.out
-rw-rw-r-- 1 acamar2 acamar2 9,5K Dez  3 16:12 fastqLog.progress.out
-rw-rw-r-- 1 acamar2 acamar2 930K Dez  3 15:37 fastqReadsPerGene.out.tab
-rw-rw-r-- 1 acamar2 acamar2  15M Dez  3 16:12 fastqSJ.out.tab
drwx------ 2 acamar2 acamar2 4,0K Dez  3 14:14 fastq_STARgenome
drwx------ 3 acamar2 acamar2   20 Dez  3 16:12 fastq_STARtmp

So, the directory fastq_STARgenome and fastq_STARtmp, I can delete?

Thanks for your help!

ADD REPLYlink modified 7 days ago by Devon Ryan86k • written 8 days ago by carolgalah0

Yes, you can delete them

ADD REPLYlink written 7 days ago by Devon Ryan86k

Hi,

After alignment, I need to run the Cufflinks for assembly and quantify transcripts.

I need to indicate only my file Aligned.sortedByCoord.out.bam, or do I need to indicate my raw data + Aligned.sortedByCoord.out.bam.?

Do I need to index the Aligned.sortedByCoord.out.bam?

thanks

ADD REPLYlink written 5 days ago by carolgalah0

You only need the BAM file, I don't think it needs to be indexed.

Do not use cufflinks, use stringTie.

ADD REPLYlink written 5 days ago by Devon Ryan86k

Thanks for your reply...

I have two bam: rw-rw-r-- 1 acamar2 acamar2 22G Dez 3 16:12 fastqAligned.sortedByCoord.out.bam -rw-rw-r-- 1 acamar2 acamar2 25G Dez 3 15:37 fastqAligned.toTranscriptome.out.bam

I need use just this fastqAligned.sortedByCoord.out.bam... correct?

Thank you for the advice about StringTie, but my advisor wants me to use Cufflinks. But I'm sure to try StringTie too, to learn!

Thanks for your help!

ADD REPLYlink written 5 days ago by carolgalah0

Try aligning a single file or pair of them and see if you get something more reasonable.

ADD REPLYlink written 13 days ago by Devon Ryan86k

I tried everything again ... and nothing worked. I recreated the genome index with toplevel. I ran the STAR using only 1 pair of samples. I tried aligning it with a couple of samples using the toplevel genome, also tried using the genome with all the chr files ... but nothing worked. What will happen? Best to try to generate the genome index with other files ... maybe NCBI?

ADD REPLYlink written 12 days ago by carolgalah0

Stick to ENSEMBL or GENCODE for reference sequences, NCBI will cause you a lot of annoyances (UCSC likely will as well).

ADD REPLYlink written 12 days ago by Devon Ryan86k

Sorry for the delay in replying ...

Now it worked, I had not tried the use of gunzip -c, I had forgotten all the time to use the (-c), beginner error. Sorry for the inconvenience!

Another question: Star generates a directory named BAMsort, with several other folders. In those directories should have generated my .bam files, correct? Because those directories are empty.

I generated my genome index with the toplevel file, will it have any differences or problem? Because when I try to generate the genome index using all chr files 1-20, chr Y, MT and X, they are taking a long time and do not end (more than 24hr).

Thanks

ADD REPLYlink written 8 days ago by carolgalah0
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1676 users visited in the last hour