Question

Genome index and Aligment with STAR- Rattus Norvegicus

0

Entering edit mode

5.4 years ago

carolgalah • 0

Hi,

I'm new at working with bioinformatics. To generate de genome index for Rattus_norvegicus on ENSEMBL we have many FASTA files (Chr1 ou Chr2...) and the Top level file. Should I download the Toplevel file or Chr1 or Chr2 it`s enough.

Please, Could confirm my Script for aligment.

/moreno/STAR-master/bin/Linux_x86_64_static/STAR --runThreadN 8 --genomeDir /home/acamar2/files/reference/genomeindex --sjdbGTFfile /home/acamar2/files/annotation/Rattus_norvegicus.Rnor_6.0.94.gtf --readFilesIn /home/acamar2/rawdata/*.fastq.gz --readFilesCommand zcat --quantMode TranscriptomeSAM GeneCounts --outSAMtype BAM SortedByCoordinate --outFileNamePrefix fastq

Thanks

RNA-Seq genome alignment • 3.2k views

ADD COMMENT • link updated 5.4 years ago by Bruno Fantinatti • 0 • written 5.4 years ago by carolgalah • 0

0

Entering edit mode

Hello,

I see you are working on RNAseq data. I presume you want to see your reads on the whole genome, so you need a fasta file with all chromosomes.

On ENSEMBL ftp server for rat, you have information about the files in the README file.

TOPLEVEL

These files contains all sequence regions flagged as toplevel in an Ensembl schema. This includes chromsomes, regions not assembled into chromosomes and N padded haplotype/patch regions.

You should have take this file to create your index

First recreate your index with the proper command line then do the alignment

Copy/Paste your index command line in your post please

Also, be careful with french grammar in your post (de genome, Chr1 ou Chr2)

ADD REPLY • link 5.4 years ago by Bastien Hervé 5.3k

0

Entering edit mode

Hi,

I tried to run the alignment, but the script does not find my raw files (NO such file or directory).

Nov 21 09:26:43 ..... Started STAR run Nov 21 09:26:43 ..... Loading genome Nov 21 09:27:12 ..... Processing annotations GTF Nov 21 09:27:20 ..... Inserting junctions into the genome indices gunzip: 1_2.fastq.gz: No such file or directory gunzip: 2_1.fastq.gz: No such file or directory gunzip: 2_2.fastq.gz: No such file or directory gunzip: 3_1.fastq.gz: No such file or directory gunzip: 3_2.fastq.gz: No such file or directory gunzip: 4_1.fastq.gz: No such file or directory gunzip: 4_2.fastq.gz: No such file or directory gunzip: 5_1.fastq.gz: No such file or directory gunzip: 5_2.fastq.gz: No such file or directory gunzip: 6_1.fastq.gz: No such file or directory gunzip: 6_2.fastq.gz: No such file or directory gunzip: 7_1.fastq.gz: No such file or directory gunzip: 7_2.fastq.gz: No such file or directory gunzip: 8_1.fastq.gz: No such file or directory gunzip: 8_2.fastq.gz: No such file or directory Nov 21 09:29:43 ..... Started mapping Nov 21 09:29:48 ..... Started sorting BAM Nov 21 09:29:48 ..... Finished successfully

I tried changing the command to gunzip, it also did not work. I tried to put every name of all my files:

/moreno/STAR-master/bin/Linux_x86_64_static/STAR --runThreadN 8 --genomeDir /home/acamar2/files/reference/ --sjdbGTFfile /home/acamar2/files/annotation/Rattus_norvegicus.Rnor_6.0.94.gtf --readFilesIn /home/acamar2/rawdata/1_1.fastq.gz,1_2.fastq.gz,2_1.fastq.gz,2_2.fastq.gz,3_1.fastq.gz,3_2.fastq.gz,4_1.fastq.gz,4_2.fastq.gz,5_1.fastq.gz,5_2.fastq.gz,6_1.fastq.gz,6_2.fastq.gz,7_1.fastq.gz,7_2.fastq.gz,8_1.fastq.gz,8_2.fastq.gz --readFilesCommand gunzip --quantMode TranscriptomeSAM GeneCounts --outSAMtype BAM SortedByCoordinate --outFileNamePrefix /home/acamar2/STAR1/fastq

the BAM directory only generates an empty folder.

Is it that I should create a variable for the samples, so does the STAR read the variable? Like:

/home/acamar2/rawdata/fastqc $f/*.fastq.gz

for filename_1 in $f/_1_.fastq.gz do echo $filename_1 done for filename_2 in $f/_2_.fastq.gz do echo $filename_2 done /moreno/STAR-master/bin/Linux_x86_64_static/STAR --runThreadN 8 --genomeDir /home/acamar2/files/reference/ --sjdbGTFfile /home/acamar2/files/annotation/Rattus_norvegicus.Rnor_6.0.94.gtf --readFilesIn /home/acamar2/rawdata/1_1.fastq.gz,1_2.fastq.gz,2_1.fastq.gz,2_2.fastq.gz,3_1.fastq.gz,3_2.fastq.gz,4_1.fastq.gz,4_2.fastq.gz,5_1.fastq.gz,5_2.fastq.gz,6_1.fastq.gz,6_2.fastq.gz,7_1.fastq.gz,7_2.fastq.gz,8_1.fastq.gz,8_2.fastq.gz --readFilesCommand gunzip --quantMode TranscriptomeSAM GeneCounts --outSAMtype BAM SortedByCoordinate --outFileNamePrefix /home/acamar2/STAR1/fastq

echo $f"Complete"

done

Thanks

ADD REPLY • link 5.4 years ago by carolgalah • 0

0

Entering edit mode

Please,

I still have trouble generating genome index:

This is my command line: /moreno/STAR-master/bin/Linux_x86_64_static/STAR --runThreadN 8 --runMode genomeGenerate --genomeDir /home/acamar2/files/reference --genomeFastaFiles /home/acamar2/files/RattuN/Rattus_norvegicus.Rnor_6.0.dna.chromosome.toplevel.fa --sjdbGTFfile /home/acamar2/annotation/Rattus_norvegicus.Rnor_6.0.94.gtf

when they are finished, these are the generated files (I used de top

drwxr-xr-x 2 acamar2 acamar2 4,0K Nov 20 10:26 . drwxr-xr-x 5 acamar2 acamar2 53 Nov 27 14:00 .. -rw-rw-r-- 1 acamar2 acamar2 5,1K Nov 20 09:28 chrLength.txt -rw-rw-r-- 1 acamar2 acamar2 17K Nov 20 09:28 chrNameLength.txt -rw-rw-r-- 1 acamar2 acamar2 12K Nov 20 09:28 chrName.txt -rw-rw-r-- 1 acamar2 acamar2 11K Nov 20 09:28 chrStart.txt -rw-rw-r-- 1 acamar2 acamar2 11M Nov 20 10:17 exonGeTrInfo.tab -rw-rw-r-- 1 acamar2 acamar2 4,5M Nov 20 10:17 exonInfo.tab -rw-rw-r-- 1 acamar2 acamar2 611K Nov 20 10:17 geneInfo.tab -rw-rw-r-- 1 acamar2 acamar2 3,0G Nov 20 10:22 Genome -rw-rw-r-- 1 acamar2 acamar2 747 Nov 20 09:26 genomeParameters.txt -rw-rw-r-- 1 acamar2 acamar2 22G Nov 20 10:26 SA -rw-rw-r-- 1 acamar2 acamar2 1,5G Nov 20 10:26 SAindex -rw-rw-r-- 1 acamar2 acamar2 5,7M Nov 20 10:17 sjdbInfo.txt -rw-rw-r-- 1 acamar2 acamar2 4,5M Nov 20 10:17 sjdbList.fromGTF.out.tab -rw-rw-r-- 1 acamar2 acamar2 4,5M Nov 20 10:17 sjdbList.out.tab -rw-rw-r-- 1 acamar2 acamar2 2,5M Nov 20 10:17 transcriptInfo.tab

I think my genome was not generated because I get to run the alignment, my generated files are zeroed. Could you help me, please?

ADD REPLY • link 5.4 years ago by carolgalah • 0

0

Entering edit mode

5.4 years ago

caggtaagtat ★ 1.9k

Hi,

I would include the unplaced scaffolds for the generation of the indexing. They should be included in the toplevel file, but check check the documentation to be sure.

The STAR command for alignment of your file looks good, however at the paramter --outFileNamePrefix a directoy is required, to determine where you want to save the files. --outFileNamePrefix /path/to/output/dir/prefix. So the current version of code would save the output of STAR in the directory fastq within the current working directory.

Edit: In ensemble the file to go would be the primary_assambly file, if availible, since it does not contain haplotype/patch regions which could potentially disrupt your analysis.

ADD COMMENT • link 5.4 years ago by caggtaagtat ★ 1.9k

0

Entering edit mode

5.4 years ago

carolgalah • 0

Hi,

Thank you so much for answers!

I recreate my index using TOPLEVEL file.

And I'll go run the alignment now:

/moreno/STAR-master/bin/Linux_x86_64_static/STAR --runThreadN 8 --genomeDir /home/acamar2/files/reference/ --sjdbGTFfile /home/acamar2/files/annotation/Rattus_norvegicus.Rnor_6.0.94.gtf --readFilesIn /home/acamar2/rawdata/*.fastq.gz --readFilesCommand zcat --quantMode TranscriptomeSAM GeneCounts --outSAMtype BAM SortedByCoordinate --outFileNamePrefix /home/acamar2/STARoutput/fastq

ADD COMMENT • link 5.4 years ago by carolgalah • 0

0

Entering edit mode

5.4 years ago

carolgalah • 0

Hi,

I tried to run the alignment, but the script does not find my raw files (NO such file or directory).

Nov 21 09:26:43 ..... Started STAR run Nov 21 09:26:43 ..... Loading genome Nov 21 09:27:12 ..... Processing annotations GTF Nov 21 09:27:20 ..... Inserting junctions into the genome indices gunzip: 1_2.fastq.gz: No such file or directory gunzip: 2_1.fastq.gz: No such file or directory gunzip: 2_2.fastq.gz: No such file or directory gunzip: 3_1.fastq.gz: No such file or directory gunzip: 3_2.fastq.gz: No such file or directory gunzip: 4_1.fastq.gz: No such file or directory gunzip: 4_2.fastq.gz: No such file or directory gunzip: 5_1.fastq.gz: No such file or directory gunzip: 5_2.fastq.gz: No such file or directory gunzip: 6_1.fastq.gz: No such file or directory gunzip: 6_2.fastq.gz: No such file or directory gunzip: 7_1.fastq.gz: No such file or directory gunzip: 7_2.fastq.gz: No such file or directory gunzip: 8_1.fastq.gz: No such file or directory gunzip: 8_2.fastq.gz: No such file or directory Nov 21 09:29:43 ..... Started mapping Nov 21 09:29:48 ..... Started sorting BAM Nov 21 09:29:48 ..... Finished successfully

I tried changing the command to gunzip, it also did not work. I tried to put every name of all my files:

/moreno/STAR-master/bin/Linux_x86_64_static/STAR --runThreadN 8 --genomeDir /home/acamar2/files/reference/ --sjdbGTFfile /home/acamar2/files/annotation/Rattus_norvegicus.Rnor_6.0.94.gtf --readFilesIn /home/acamar2/rawdata/1_1.fastq.gz,1_2.fastq.gz,2_1.fastq.gz,2_2.fastq.gz,3_1.fastq.gz,3_2.fastq.gz,4_1.fastq.gz,4_2.fastq.gz,5_1.fastq.gz,5_2.fastq.gz,6_1.fastq.gz,6_2.fastq.gz,7_1.fastq.gz,7_2.fastq.gz,8_1.fastq.gz,8_2.fastq.gz --readFilesCommand gunzip --quantMode TranscriptomeSAM GeneCounts --outSAMtype BAM SortedByCoordinate --outFileNamePrefix /home/acamar2/STAR1/fastq

the BAM directory only generates an empty folder.

Is it that I should create a variable for the samples, so does the STAR read the variable? Like:

/home/acamar2/rawdata/fastqc $f/*.fastq.gz

for filename_1 in $f/*_1_*.fastq.gz
    do
        echo $filename_1
    done
    for filename_2 in $f/*_2_*.fastq.gz
        do
            echo $filename_2
        done

/moreno/STAR-master/bin/Linux_x86_64_static/STAR --runThreadN 8 --genomeDir /home/acamar2/files/reference/ --sjdbGTFfile /home/acamar2/files/annotation/Rattus_norvegicus.Rnor_6.0.94.gtf --readFilesIn /home/acamar2/rawdata/1_1.fastq.gz,1_2.fastq.gz,2_1.fastq.gz,2_2.fastq.gz,3_1.fastq.gz,3_2.fastq.gz,4_1.fastq.gz,4_2.fastq.gz,5_1.fastq.gz,5_2.fastq.gz,6_1.fastq.gz,6_2.fastq.gz,7_1.fastq.gz,7_2.fastq.gz,8_1.fastq.gz,8_2.fastq.gz --readFilesCommand gunzip --quantMode TranscriptomeSAM GeneCounts --outSAMtype BAM SortedByCoordinate --outFileNamePrefix /home/acamar2/STAR1/fastq

echo $f"Complete"

done

Thanks

ADD COMMENT • link 5.4 years ago by carolgalah • 0

0

Entering edit mode

5.4 years ago

Bruno Fantinatti • 0

Hi, Instead of creating a big loop like this, I would create a text file containing the input command for each sample set. One line per sample. Save it. Change the file to executable using chmod and then run it inside screen.

If you are new to bioinformatics and to linux/unix environments, this way will be easier to get used to. Then, with time you get experience. And with experience you can start trying different things.

Sometimes we tend to create scripts to speed up things. But when scripting takes too long, its not a good idea...

Best regards

ADD COMMENT • link 5.4 years ago by Bruno Fantinatti • 0

score 1 · Accepted Answer · 2018-11-20

Why don't you read what the STAR manual says?

https://github.com/alexdobin/STAR/blob/master/doc/STARmanual.pdf

2.2.1 Which chromosomes/scaffolds/patches to include? It is strongly recommended to include major chromosomes (e.g., for human chr1-22,chrX,chrY,chrM,) as well as un-placed and un-localized scaffolds. Typically, un-placed/un-localized scaffolds add just a few MegaBases to the genome length, however, a substantial number of reads may map to ribosomal RNA (rRNA) repeats on these scaffolds. These reads would be reported as unmapped if the scaffolds are not included in the genome, or, even worse, may be aligned to wrong loci on the chromosomes. Generally, patches and alternative haplotypes should not be included in the genome. Examples of acceptable genome sequence files:

• ENSEMBL: files marked with .dna.primary.assembly, such as: ftp://ftp.ensembl. org/pub/release-77/fasta/homo_sapiens/dna/Homo_sapiens.GRCh38.dna.primary_ assembly.fa.gz

• NCBI: ”no alternative - analysis set”: ftp://ftp.ncbi.nlm.nih.gov/genbank/ genomes/Eukaryotes/vertebrates_mammals/Homo_sapiens/GRCh38/seqs_for_alignment_ pipelines/GCA_000001405.15_GRCh38_no_alt_analysis_set.fna.gz