Question: output SAM file from STAR aligner looks incomplete
0
gravatar for nalandaatmi
3.4 years ago by
nalandaatmi60
United States
nalandaatmi60 wrote:

Dear All,

I am using STAR alignment for aligning my fastq reads from human DNA against human reference genome.

Steps followed in installing STAR alignment:

1) Using git clone https://github.com/alexdobin/STAR.git. I cloned STAR directory in my linux machine.

 [software@gw2 STAR]$ ls
bin  CHANGES.md  doc  extras  LICENSE  Makefile  README.
md  RELEASEnotes.md  source  STAR-Fusion

2) Under bin, I found STAR executable file. Is this the file do I need to use for aligning?

[software@gw2 STAR]$  bin/Linux_x86_64/STAR
Usage: STAR  [options]... --genomeDir REFERENCE   --readFilesIn R1.fq R2.fq

3) Genrating index for human genome

[software@gw2 STAR]$ /bin/Linux_x86_64/STAR --runMode genomeGenerate --genomeDir /references/STAR_References/ --genomeFastaFiles /references/Homo_sapiens/UCSC/hg19/Sequence/Bowtie2Index/genome.fa --runThreadN 20
Dec 01 17:10:31 ..... Started STAR run
Dec 01 17:10:31 ... Starting to generate Genome files
Dec 01 17:12:17 ... starting to sort  Suffix Array. This may take a long time...
Dec 01 17:12:41 ... sorting Suffix Array chunks and saving them to disk...
Dec 01 17:55:14 ... loading chunks from disk, packing SA...
Dec 01 18:02:40 ... Finished generating suffix array
Dec 01 18:02:40 ... Generating Suffix Array index
Dec 01 18:07:01 ... Completed Suffix Array index
Dec 01 18:07:01 ... writing Genome to disk ...
Dec 01 18:08:20 ... writing Suffix Array to disk ...
Dec 01 18:16:44 ... writing SAindex to disk
Dec 01 18:17:30 ..... Finished successfully
[software@gw2 STAR]$

4) Command executed for my samples.

$ STAR/bin/Linux_x86_64/STAR --genomeDir /references/STAR_References/ --runThreadN 20 --readFilesIn r1.fastq r2.fastq --outFileNamePrefix Sample_2002 _sam

5) Log file:

Dec 02 03:38:55 ..... Started STAR run
Dec 02 03:38:55 ..... Loading genome
Dec 02 03:43:38 ..... Started mapping
Dec 02 03:43:52 ..... Finished successfully

5)I received following output files

Sample_2002.samLog.final.out

Sample_2002.samLog.out

Sample_2002.samLog.progress.out

Sample_2002.samSJ.out.tab

Sample_2002.samAligned.out.sam (The file contents are displayed below)

@HD     VN:1.4
@SQ     SN:chrM LN:16571
@SQ     SN:chr1 LN:249250621
@SQ     SN:chr2 LN:243199373
@SQ     SN:chr3 LN:198022430
@SQ     SN:chr4 LN:191154276
@SQ     SN:chr5 LN:180915260
@SQ     SN:chr6 LN:171115067
@SQ     SN:chr7 LN:159138663
@SQ     SN:chr8 LN:146364022
@SQ     SN:chr9 LN:141213431
@SQ     SN:chr10        LN:135534747
@SQ     SN:chr11        LN:135006516
@SQ     SN:chr12        LN:133851895
@SQ     SN:chr13        LN:115169878
@SQ     SN:chr14        LN:107349540
@SQ     SN:chr15        LN:102531392
@SQ     SN:chr16        LN:90354753
@SQ     SN:chr17        LN:81195210
@SQ     SN:chr18        LN:78077248
@SQ     SN:chr19        LN:59128983
@SQ     SN:chr20        LN:63025520
@SQ     SN:chr21        LN:48129895
@SQ     SN:chr22        LN:51304566
@SQ     SN:chrX LN:155270560
@SQ     SN:chrY LN:59373566
@PG     ID:STAR PN:STAR VN:STAR_2.5.0b  CL:/STAR/bin/Linux_x86_64/STAR   --runThreadN 20   --genomeDir /references/STAR_References/   --readFilesIn /Sample_2002/2002_AGCTAGTG_L002_R1.all_val_1.fq   /Sample_2002/2002_AGCTAGTG_L002_R2.all_val_2.fq      --outFileNamePrefix /Sample_2002/2002.sam
@CO     user command line: /STAR/bin/Linux_x86_64/STAR --genomeDir /references/STAR_References/ --runThreadN 20 --readFilesIn /Sample_2002/2002_AGCTAGTG_L002_R1.all_val_1.fq /Sample_2002/2002_AGCTAGTG_L002_R2.all_val_2.fq --outFileNamePrefix /Sample_2002/2002.sam

NOTHING after this?

 

 

dnaseq star alignment aligner • 2.6k views
ADD COMMENTlink modified 3.4 years ago • written 3.4 years ago by nalandaatmi60

What's in Sample_2002.samLog.final.out ? Why did you not use a genome annotation file during the genome generation step to make full use of spliced alignments?

ADD REPLYlink modified 3.4 years ago • written 3.4 years ago by Michael Dondrup46k

Dear Michael,

Why did you not use a genome annotation file during the genome generation step to make full use of spliced alignments?Do you mean human GTF file? No I didn't use it. Thanks for making a note of it. I will try to create new index file based on GTF file.

Please find the content of Sample_2002.samLog.final.out

 Started job on |       Dec 02 03:38:55
                             Started mapping on |       Dec 02 03:43:38
                                    Finished on |       Dec 02 03:43:52
       Mapping speed, Million of reads per hour |       0.00

                          Number of input reads |       0
                      Average input read length |       0
                                    UNIQUE READS:
                   Uniquely mapped reads number |       0
                        Uniquely mapped reads % |       0.00%
                          Average mapped length |       0.00
                       Number of splices: Total |       0
            Number of splices: Annotated (sjdb) |       0
                       Number of splices: GT/AG |       0
                       Number of splices: GC/AG |       0
                       Number of splices: AT/AC |       0
               Number of splices: Non-canonical |       0
                      Mismatch rate per base, % |       -nan%
                         Deletion rate per base |       0.00%
                        Deletion average length |       0.00
                        Insertion rate per base |       0.00%
                       Insertion average length |       0.00
                             MULTI-MAPPING READS:
        Number of reads mapped to multiple loci |       0
             % of reads mapped to multiple loci |       0.00%
        Number of reads mapped to too many loci |       0
             % of reads mapped to too many loci |       0.00%
                                  UNMAPPED READS:
       % of reads unmapped: too many mismatches |       0.00%
                 % of reads unmapped: too short |       0.00%
                     % of reads unmapped: other |       0.00%
                                  CHIMERIC READS:
                       Number of chimeric reads |       0
                            % of chimeric reads |       0.00%

 

ADD REPLYlink modified 3.4 years ago • written 3.4 years ago by nalandaatmi60
1

Number of input reads |       0

There you have it, your input file contained no reads or was not readable or truncated or whatever.

ADD REPLYlink written 3.4 years ago by Michael Dondrup46k

Dear Michael,

Yes my trimmed fastq files are empty. Do I need to start new question?

I used trim galore for adapter trimming. This is the summary of trim galore

=== Summary ===

Input filename: //Sample_2002/2002_AGCTAGTG_L002_R1.all.fastq.gz

Total reads processed:               6,289,520
Reads with adapters:                 2,963,516 (47.1%)
Reads written (passing filters):     6,289,520 (100.0%)

Total basepairs processed:   635,241,520 bp
Quality-trimmed:              12,351,960 bp (1.9%)
Total written (filtered):    568,191,752 bp (89.4%)

=== Adapter 1 ===

Sequence: GAGAGCGATCCTTGC; Type: regular 3'; Length: 15; Trimmed: 2963516 times.

=== Summary ===

Input filename: //Sample_2002/2002_AGCTAGTG_L002_R2.all.fastq.gz

Total reads processed:               6,289,520
Reads with adapters:                 1,832,060 (29.1%)
Reads written (passing filters):     6,289,520 (100.0%)

Total basepairs processed:    37,737,120 bp
Quality-trimmed:                 580,324 bp (1.5%)
Total written (filtered):     34,632,200 bp (91.8%)

=== Adapter 1 ===

Sequence: AGATCGGAAGAGC; Type: regular 3'; Length: 13; Trimmed: 1832060 times.

Last 3 lines of my trim galore log file:

Total number of sequences analysed: 6289520

Number of sequence pairs removed because at least one read was shorter than the length cutoff (20 bp): 6289520 (100.00%)

Deleting both intermediate output files 2002_AGCTAGTG_L002_R1.all_trimmed.fq.gz and 2002_AGCTAGTG_L002_R2.all_trimmed.fq.gz

ADD REPLYlink modified 3.4 years ago • written 3.4 years ago by nalandaatmi60

Michael is right. I will just add the only other real caveat I have with STAR is to make sure you have enough RAM, otherwise the alignment slows down to a crawl. 

ADD REPLYlink written 3.4 years ago by cyril-cros890

Yeah, about 30GB RAM might be required for mapping of human genome.

ADD REPLYlink written 3.4 years ago by Sishuo Wang170
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1279 users visited in the last hour