Question

output SAM file from STAR aligner looks incomplete

0

Entering edit mode

8.4 years ago

nalandaatmi ▴ 100

Dear All,

I am using STAR alignment for aligning my fastq reads from human DNA against human reference genome.

Steps followed in installing STAR alignment:

1) Using git clone https://github.com/alexdobin/STAR.git, I cloned STAR directory in my linux machine.

[software@gw2 STAR]$ ls
bin  CHANGES.md  doc  extras  LICENSE  Makefile  README.md  RELEASEnotes.md  source  STAR-Fusion

2) Under bin, I found STAR executable file. Is this the file do I need to use for aligning?

[software@gw2 STAR]$  bin/Linux_x86_64/STAR
Usage: STAR  [options]... --genomeDir REFERENCE   --readFilesIn R1.fq R2.fq

3) Generating index for human genome

[software@gw2 STAR]$ /bin/Linux_x86_64/STAR --runMode genomeGenerate --genomeDir /references/STAR_References/ --genomeFastaFiles /references/Homo_sapiens/UCSC/hg19/Sequence/Bowtie2Index/genome.fa --runThreadN 20
Dec 01 17:10:31 ..... Started STAR run
Dec 01 17:10:31 ... Starting to generate Genome files
Dec 01 17:12:17 ... starting to sort  Suffix Array. This may take a long time...
Dec 01 17:12:41 ... sorting Suffix Array chunks and saving them to disk...
Dec 01 17:55:14 ... loading chunks from disk, packing SA...
Dec 01 18:02:40 ... Finished generating suffix array
Dec 01 18:02:40 ... Generating Suffix Array index
Dec 01 18:07:01 ... Completed Suffix Array index
Dec 01 18:07:01 ... writing Genome to disk ...
Dec 01 18:08:20 ... writing Suffix Array to disk ...
Dec 01 18:16:44 ... writing SAindex to disk
Dec 01 18:17:30 ..... Finished successfully
[software@gw2 STAR]$

4) Command executed for my samples.

$ STAR/bin/Linux_x86_64/STAR --genomeDir /references/STAR_References/ --runThreadN 20 --readFilesIn r1.fastq r2.fastq --outFileNamePrefix Sample_2002 _sam

5) Log file:

Dec 02 03:38:55 ..... Started STAR run
Dec 02 03:38:55 ..... Loading genome
Dec 02 03:43:38 ..... Started mapping
Dec 02 03:43:52 ..... Finished successfully

5) I received following output files

Sample_2002.samLog.final.out
Sample_2002.samLog.out
Sample_2002.samLog.progress.out
Sample_2002.samSJ.out.tab
Sample_2002.samAligned.out.sam #(The file contents are displayed below)

@HD     VN:1.4
@SQ     SN:chrM LN:16571
@SQ     SN:chr1 LN:249250621
@SQ     SN:chr2 LN:243199373
@SQ     SN:chr3 LN:198022430
@SQ     SN:chr4 LN:191154276
@SQ     SN:chr5 LN:180915260
@SQ     SN:chr6 LN:171115067
@SQ     SN:chr7 LN:159138663
@SQ     SN:chr8 LN:146364022
@SQ     SN:chr9 LN:141213431
@SQ     SN:chr10        LN:135534747
@SQ     SN:chr11        LN:135006516
@SQ     SN:chr12        LN:133851895
@SQ     SN:chr13        LN:115169878
@SQ     SN:chr14        LN:107349540
@SQ     SN:chr15        LN:102531392
@SQ     SN:chr16        LN:90354753
@SQ     SN:chr17        LN:81195210
@SQ     SN:chr18        LN:78077248
@SQ     SN:chr19        LN:59128983
@SQ     SN:chr20        LN:63025520
@SQ     SN:chr21        LN:48129895
@SQ     SN:chr22        LN:51304566
@SQ     SN:chrX LN:155270560
@SQ     SN:chrY LN:59373566
@PG     ID:STAR PN:STAR VN:STAR_2.5.0b  CL:/STAR/bin/Linux_x86_64/STAR   --runThreadN 20   --genomeDir /references/STAR_References/   --readFilesIn /Sample_2002/2002_AGCTAGTG_L002_R1.all_val_1.fq   /Sample_2002/2002_AGCTAGTG_L002_R2.all_val_2.fq      --outFileNamePrefix /Sample_2002/2002.sam
@CO     user command line: /STAR/bin/Linux_x86_64/STAR --genomeDir /references/STAR_References/ --runThreadN 20 --readFilesIn /Sample_2002/2002_AGCTAGTG_L002_R1.all_val_1.fq /Sample_2002/2002_AGCTAGTG_L002_R2.all_val_2.fq --outFileNamePrefix /Sample_2002/2002.sam

NOTHING after this?

DNAseq alignment aligner STAR • 5.8k views

ADD COMMENT • link updated 20 months ago by Ram 43k • written 8.4 years ago by nalandaatmi ▴ 100

0

Entering edit mode

What's in Sample_2002.samLog.final.out ? Why did you not use a genome annotation file during the genome generation step to make full use of spliced alignments?

ADD REPLY • link 8.4 years ago by Michael 54k

0

Entering edit mode

Dear Michael,

Why did you not use a genome annotation file during the genome generation step to make full use of spliced alignments?Do you mean human GTF file? No I didn't use it. Thanks for making a note of it. I will try to create new index file based on GTF file.

Please find the content of Sample_2002.samLog.final.out

                             Started job on |       Dec 02 03:38:55
                         Started mapping on |       Dec 02 03:43:38
                                Finished on |       Dec 02 03:43:52
   Mapping speed, Million of reads per hour |       0.00

                      Number of input reads |       0
                  Average input read length |       0
                                UNIQUE READS:
               Uniquely mapped reads number |       0
                    Uniquely mapped reads % |       0.00%
                      Average mapped length |       0.00
                   Number of splices: Total |       0
        Number of splices: Annotated (sjdb) |       0
                   Number of splices: GT/AG |       0
                   Number of splices: GC/AG |       0
                   Number of splices: AT/AC |       0
           Number of splices: Non-canonical |       0
                  Mismatch rate per base, % |       -nan%
                     Deletion rate per base |       0.00%
                    Deletion average length |       0.00
                    Insertion rate per base |       0.00%
                   Insertion average length |       0.00
                         MULTI-MAPPING READS:
    Number of reads mapped to multiple loci |       0
         % of reads mapped to multiple loci |       0.00%
    Number of reads mapped to too many loci |       0
         % of reads mapped to too many loci |       0.00%
                              UNMAPPED READS:
   % of reads unmapped: too many mismatches |       0.00%
             % of reads unmapped: too short |       0.00%
                 % of reads unmapped: other |       0.00%
                              CHIMERIC READS:
                   Number of chimeric reads |       0
                        % of chimeric reads |       0.00%

ADD REPLY • link updated 4.4 years ago by Ram 43k • written 8.4 years ago by nalandaatmi ▴ 100

2

Entering edit mode

Number of input reads |       0

There you have it, your input file contained no reads or was not readable or truncated or whatever.

ADD REPLY • link updated 4.4 years ago by Ram 43k • written 8.4 years ago by Michael 54k

0

Entering edit mode

Dear Michael,

Yes my trimmed fastq files are empty. Do I need to start new question?

I used trim galore for adapter trimming. This is the summary of trim galore

=== Summary ===

Input filename: //Sample_2002/2002_AGCTAGTG_L002_R1.all.fastq.gz

Total reads processed:               6,289,520
Reads with adapters:                 2,963,516 (47.1%)
Reads written (passing filters):     6,289,520 (100.0%)

Total basepairs processed:   635,241,520 bp
Quality-trimmed:              12,351,960 bp (1.9%)
Total written (filtered):    568,191,752 bp (89.4%)

=== Adapter 1 ===

Sequence: GAGAGCGATCCTTGC; Type: regular 3'; Length: 15; Trimmed: 2963516 times.

=== Summary ===

Input filename: //Sample_2002/2002_AGCTAGTG_L002_R2.all.fastq.gz

Total reads processed:               6,289,520
Reads with adapters:                 1,832,060 (29.1%)
Reads written (passing filters):     6,289,520 (100.0%)

Total basepairs processed:    37,737,120 bp
Quality-trimmed:                 580,324 bp (1.5%)
Total written (filtered):     34,632,200 bp (91.8%)

=== Adapter 1 ===

Sequence: AGATCGGAAGAGC; Type: regular 3'; Length: 13; Trimmed: 1832060 times.

Last 3 lines of my trim galore log file:

Total number of sequences analysed: 6289520
Number of sequence pairs removed because at least one read was shorter than the length cutoff (20 bp): 6289520 (100.00%)
Deleting both intermediate output files 2002_AGCTAGTG_L002_R1.all_trimmed.fq.gz and 2002_AGCTAGTG_L002_R2.all_trimmed.fq.gz

ADD REPLY • link updated 4.4 years ago by Ram 43k • written 8.4 years ago by nalandaatmi ▴ 100

0

Entering edit mode

Michael is right. I will just add the only other real caveat I have with STAR is to make sure you have enough RAM, otherwise the alignment slows down to a crawl.

ADD REPLY • link updated 4.4 years ago by Ram 43k • written 8.4 years ago by cyril-cros ▴ 950

0

Entering edit mode

Yeah, about 30GB RAM might be required for mapping of human genome.

ADD REPLY • link 8.4 years ago by Sishuo Wang ▴ 230