Hi...
I'm new to the field of bioinformatics. I'm trying to align RNAseq with STAR, but I'm having problems. I do not know if it's time to generate the genome index or alignment.
I am generating the genome index for Rattus novergicus (ftp://ftp.ensembl.org/pub/release-94/fasta/rattus_norvegicus/dna/). I downloaded the FASTA files (chr1-chr20, chrY, chrX and chr MT) and the GTF file. that was my script:
STAR --runThreadN 8 --runMode genomeGenerate --genomeDir /home/acamar2/files/Rnor_6.0 --genomeFastaFiles /home/acamar2/files/Fasta/Rattus_norvegicus.Rnor_6.0.dna.chromosome.1.fa /home/acamar2/files/Fasta/Rattus_norvegicus.Rnor_6.0.dna.chromosome.2.fa /home/acamar2/files/Fasta/Rattus_norvegicus.Rnor_6.0.dna.chromosome.3.fa /home/acamar2/files/Fasta/Rattus_norvegicus.Rnor_6.0.dna.chromosome.4.fa /home/acamar2/files/Fasta/Rattus_norvegicus.Rnor_6.0.dna.chromosome.5.fa /home/acamar2/files/Fasta/Rattus_norvegicus.Rnor_6.0.dna.chromosome.5.fa /home/acamar2/files/Fasta/Rattus_norvegicus.Rnor_6.0.dna.chromosome.6.fa /home/acamar2/files/Fasta/Rattus_norvegicus.Rnor_6.0.dna.chromosome.8.fa /home/acamar2/files/Fasta/Rattus_norvegicus.Rnor_6.0.dna.chromosome.9.fa /home/acamar2/files/Fasta/Rattus_norvegicus.Rnor_6.0.dna.chromosome.10.fa /home/acamar2/files/Fasta/Rattus_norvegicus.Rnor_6.0.dna.chromosome.11.fa /home/acamar2/files/Fasta/Rattus_norvegicus.Rnor_6.0.dna.chromosome.12.fa /home/acamar2/files/Fasta/Rattus_norvegicus.Rnor_6.0.dna.chromosome.13.fa /home/acamar2/files/Fasta/Rattus_norvegicus.Rnor_6.0.dna.chromosome.14.fa /home/acamar2/files/Fasta/Rattus_norvegicus.Rnor_6.0.dna.chromosome.15.fa /home/acamar2/files/Fasta/Rattus_norvegicus.Rnor_6.0.dna.chromosome.16.fa /home/acamar2/files/Fasta/Rattus_norvegicus.Rnor_6.0.dna.chromosome.17.fa /home/acamar2/files/Fasta/Rattus_norvegicus.Rnor_6.0.dna.chromosome.18.fa /home/acamar2/files/Fasta/Rattus_norvegicus.Rnor_6.0.dna.chromosome.19.fa /home/acamar2/files/Fasta/Rattus_norvegicus.Rnor_6.0.dna.chromosome.20.fa /home/acamar2/files/Fasta/Rattus_norvegicus.Rnor_6.0.dna.chromosome.Y.fa /home/acamar2/files/Fasta/Rattus_norvegicus.Rnor_6.0.dna.chromosome.MT.fa /home/acamar2/files/Fasta/Rattus_norvegicus.Rnor_6.0.dna.chromosome.X.fa --sjdbGTFfile /home/acamar2/files/annotation/Rattus_norvegicus.Rnor_6.0.94.gtf
The generated files were:
drwxr-xr-x 2 acamar2 acamar2 4,0K Nov 28 15:58 .
drwxr-xr-x 5 acamar2 acamar2 50 Nov 28 15:58 ..
-rw-rw-r-- 1 acamar2 acamar2 217 Nov 28 14:56 chrLength.txt
-rw-rw-r-- 1 acamar2 acamar2 275 Nov 28 14:56 chrNameLength.txt
-rw-rw-r-- 1 acamar2 acamar2 58 Nov 28 14:56 chrName.txt
-rw-rw-r-- 1 acamar2 acamar2 245 Nov 28 14:56 chrStart.txt
-rw-rw-r-- 1 acamar2 acamar2 11M Nov 28 15:44 exonGeTrInfo.tab
-rw-rw-r-- 1 acamar2 acamar2 4,5M Nov 28 15:44 exonInfo.tab
-rw-rw-r-- 1 acamar2 acamar2 606K Nov 28 15:44 geneInfo.tab
-rw-rw-r-- 1 acamar2 acamar2 2,7G Nov 28 15:49 Genome
-rw-rw-r-- 1 acamar2 acamar2 4,0K Nov 28 14:55 genomeParameters.txt
-rw-rw-r-- 1 acamar2 acamar2 21G Nov 28 15:53 SA
-rw-rw-r-- 1 acamar2 acamar2 1,5G Nov 28 15:53 SAindex
-rw-rw-r-- 1 acamar2 acamar2 5,6M Nov 28 15:44 sjdbInfo.txt
-rw-rw-r-- 1 acamar2 acamar2 4,4M Nov 28 15:44 sjdbList.fromGTF.out.tab
-rw-rw-r-- 1 acamar2 acamar2 4,4M Nov 28 15:44 sjdbList.out.tab
-rw-rw-r-- 1 acamar2 acamar2 2,4M Nov 28 15:44 transcriptInfo.tab
I followed the alignment of the samples with the following script:
STAR --runThreadN 8 --genomeDir /home/acamar2/files/Rnor_6.0 --sjdbGTFfile /home/acamar2/files/Rnor_6.0/Rattus_norvegicus.Rnor_6.0.94.gtf --readFilesIn /home/acamar2/rawdata/1_1.fastq.gz,/home/acamar2/rawdata/1_2.fastq.gz,/home/acamar2/rawdata/2_1.fastq.gz,/home/acamar2/rawdata/2_2.fastq.gz,/home/acamar2/rawdata/3_1.fastq.gz,/home/acamar2/rawdata/3_2.fastq.gz,/home/acamar2/rawdata/4_1.fastq.gz,/home/acamar2/rawdata/4_2.fastq.gz,/home/acamar2/rawdata/5_1.fastq.gz,/home/acamar2/rawdata/5_2.fastq.gz,/home/acamar2/rawdata/6_1.fastq.gz,/home/acamar2/rawdata/6_2.fastq.gz,/home/acamar2/rawdata/7_1.fastq.gz,/home/acamar2/rawdata/7_2.fastq.gz,/home/acamar2/rawdata/8_1.fastq.gz,/home/acamar2/rawdata/8_2.fastq.gz --readFilesCommand gunzip --quantMode TranscriptomeSAM GeneCounts --outSAMtype BAM SortedByCoordinate --outFileNamePrefix /home/acamar2/STARoutput/fastq
The generated files were:
drwxr-xr-x 4 acamar2 acamar2 4,0K Nov 28 16:42 .
drwxr-xr-x 7 acamar2 acamar2 115 Nov 28 16:00 ..
-rw-rw-r-- 1 acamar2 acamar2 28 Nov 28 16:42 fastqAligned.sortedByCoord.out.bam
-rw-rw-r-- 1 acamar2 acamar2 567K Nov 28 16:42 fastqAligned.toTranscriptome.out.bam
-rw-rw-r-- 1 acamar2 acamar2 1,8K Nov 28 16:42 fastqLog.final.out
-rw-rw-r-- 1 acamar2 acamar2 461K Nov 28 16:42 fastqLog.out
-rw-r--r-- 1 acamar2 acamar2 268M Nov 19 16:30 Rattus_norvegicus.Rnor_6.0.94.gtf
-rw-rw-r-- 1 acamar2 acamar2 246 Nov 28 16:42 fastqLog.progress.out
-rw-rw-r-- 1 acamar2 acamar2 797K Nov 28 16:42 fastqReadsPerGene.out.tab
-rw-rw-r-- 1 acamar2 acamar2 0 Nov 28 16:42 fastqSJ.out.tab
drwx------ 2 acamar2 acamar2 4,0K Nov 28 16:03 fastq_STARgenome
drwx------ 3 acamar2 acamar2 20 Nov 28 16:42 fastq_STARtmp
But, my files ReadsPerGene and Log.final.out appear zeroed.
N_unmapped 0 0 0
N_multimapping 0 0 0
N_noFeature 0 0 0
N_ambiguous 0 0 0
ENSRNOG00000046319 0 0 0
ENSRNOG00000047964 0 0 0
ENSRNOG00000050370 0 0 0
ENSRNOG00000032365 0 0 0
ENSRNOG00000040300 0 0 0
ENSRNOG00000058808 0 0 0
ENSRNOG00000061316 0 0 0
ENSRNOG00000050129 0 0 0
ENSRNOG00000040316 0 0 0
ENSRNOG00000023659 0 0 0
ENSRNOG00000029897 0 0 0
ENSRNOG00000042852 0 0 0
ENSRNOG00000061806 0 0 0
ENSRNOG00000055877 0 0 0
ENSRNOG00000014303 0 0 0
ENSRNOG00000051899 0 0 0
ENSRNOG00000014330 0 0 0
ENSRNOG00000049505 0 0 0
ENSRNOG00000014916 0 0 0
ENSRNOG00000014996 0 0 0
ENSRNOG00000015239 0 0 0
ENSRNOG00000055508 0 0 0
ENSRNOG00000053710 0 0 0
ENSRNOG00000015552 0 0 0
ENSRNOG00000016041 0 0 0
ENSRNOG00000016054 0 0 0
ENSRNOG00000016381 0 0 0
ENSRNOG00000054005 0 0 0
ENSRNOG00000059137 0 0 0
ENSRNOG00000013160 0 0 0
ENSRNOG00000052788 0 0 0
ENSRNOG00000023549 0 0 0
ENSRNOG00000031859 0 0 0
ENSRNOG00000013351 0 0 0
Started job on | Nov 28 17:12:06
Started mapping on | Nov 28 17:15:03
Finished on | Nov 28 17:51:41
Mapping speed, Million of reads per hour | 0.00
Number of input reads | 0
Average input read length | 0
UNIQUE READS:
Uniquely mapped reads number | 0
Uniquely mapped reads % | 0.00%
Average mapped length | 0.00
Number of splices: Total | 0
Number of splices: Annotated (sjdb) | 0
Number of splices: GT/AG | 0
Number of splices: GC/AG | 0
Number of splices: AT/AC | 0
Number of splices: Non-canonical | 0
Mismatch rate per base, % | -nan%
Deletion rate per base | 0.00%
Deletion average length | 0.00
Insertion rate per base | 0.00%
Insertion average length | 0.00
MULTI-MAPPING READS:
Number of reads mapped to multiple loci | 0
% of reads mapped to multiple loci | 0.00%
Number of reads mapped to too many loci | 0
% of reads mapped to too many loci | 0.00%
UNMAPPED READS:
% of reads unmapped: too many mismatches | 0.00%
% of reads unmapped: too short | 0.00%
% of reads unmapped: other | 0.00%
CHIMERIC READS:
Number of chimeric reads | 0
% of chimeric reads | 0.00%
I do not know if I'm generating the genome index or doing the alignment correctly. Can someone help me?
Thanks
I'm pretty sure you want either
zcat
orgunzip -c
rather than justgunzip
as the extraction command.You don't need to use multiples fasta files, you have the complete genome under ftp://ftp.ensembl.org/pub/release-94/fasta/rattus_norvegicus/dna/Rattus_norvegicus.Rnor_6.0.dna.toplevel.fa.gz
I tried to use the toplevel, but it also didn't work. They gave the same results.
When I run the alignment, do I need to put the FASTA and GTF file in the same folder as the genome?
No, I think this should be enought. Unzip the genome file :
I tried to use exactly this script with the toplevel. Is it some mistake in my samples?
Always before generating the genome, I unzipped the FASTA files.
I need to unzip this file: -rw-rw-r-- 1 acamar2 acamar2 2,7G Nov 28 15:49 Genome
No your index seems OK. Try with two files :
--genomeDir /home/acamar2/files/Rnor_6.0
Rnor_6.0 is directory where the files of the genome:
, is this correct to indicate the directory, not a specific file?
I'll try only with two files! Do you think it is better to generate the genome index again with the toplevel?
thanks
Correct, you just indicate the directory and not a file within it.
The index seems ok but yes try to re-index with the lop level file
I tried everything again ... and nothing worked. I recreated the genome index with toplevel. I ran the STAR using only 1 pair of samples. I tried aligning it with a couple of samples using the toplevel genome, also tried using the genome with all the chr files ... but nothing worked. What will happen? Best to try to generate the genome index with other files ... maybe NCBI?
May we have the command line you used and the standard output + log file please.
Note that
is the worst explanation about a problem. To be very complete and fully helpful we need all the softwares versions, the links to input files, the command lines used and the output results (log file, standard output).
Sorry for the delay in replying ...
Now it worked, I had not tried the use of gunzip -c, I had forgotten all the time to use the (-c), beginner error. Sorry for the inconvenience!
Another question: Star generates a directory named BAMsort, with several other folders. In those directories should have generated my .bam files, correct? Because those directories are empty.
I generated my genome index with the toplevel file, will it have any differences or problem? Because when I try to generate the genome index using all chr files 1-20, chr Y, MT and X, they are taking a long time and do not end (more than 24hr).
Thanks
Those directories should be empty after it's done. The sorted BAM file is in the output directory you specified. STAR is a bit bad at cleaning up after itself, so you'll see a bunch of
_STAR*
folder sitting about that you can delete.The toplevel fasta should include all of the chromosomes anyway. Double check to ensure you have the right one.
My files after the alignment:
So, the directory fastq_STARgenome and fastq_STARtmp, I can delete?
Thanks for your help!
Yes, you can delete them
Hi,
After alignment, I need to run the Cufflinks for assembly and quantify transcripts.
I need to indicate only my file Aligned.sortedByCoord.out.bam, or do I need to indicate my raw data + Aligned.sortedByCoord.out.bam.?
Do I need to index the Aligned.sortedByCoord.out.bam?
thanks
You only need the BAM file, I don't think it needs to be indexed.
Do not use cufflinks, use stringTie.
Thanks for your reply...
I have two bam: rw-rw-r-- 1 acamar2 acamar2 22G Dez 3 16:12 fastqAligned.sortedByCoord.out.bam -rw-rw-r-- 1 acamar2 acamar2 25G Dez 3 15:37 fastqAligned.toTranscriptome.out.bam
I need use just this fastqAligned.sortedByCoord.out.bam... correct?
Thank you for the advice about StringTie, but my advisor wants me to use Cufflinks. But I'm sure to try StringTie too, to learn!
Thanks for your help!
Try aligning a single file or pair of them and see if you get something more reasonable.
I tried everything again ... and nothing worked. I recreated the genome index with toplevel. I ran the STAR using only 1 pair of samples. I tried aligning it with a couple of samples using the toplevel genome, also tried using the genome with all the chr files ... but nothing worked. What will happen? Best to try to generate the genome index with other files ... maybe NCBI?
Stick to ENSEMBL or GENCODE for reference sequences, NCBI will cause you a lot of annoyances (UCSC likely will as well).
Sorry for the delay in replying ...
Now it worked, I had not tried the use of gunzip -c, I had forgotten all the time to use the (-c), beginner error. Sorry for the inconvenience!
Another question: Star generates a directory named BAMsort, with several other folders. In those directories should have generated my .bam files, correct? Because those directories are empty.
I generated my genome index with the toplevel file, will it have any differences or problem? Because when I try to generate the genome index using all chr files 1-20, chr Y, MT and X, they are taking a long time and do not end (more than 24hr).
Thanks