STAR:

Question

RNA-Seq error in htseq-count

1

Entering edit mode

5.8 years ago

elb ▴ 250

Hi guys, I really do not know what is happening to my RNA seq data analysis pipeline and hence I'm looking for some help. Although I found some related posts and I tried to follow instructions to solve the problem I'm still here and I'm getting crazy.

Briefly: the steps are the following:

STAR:

for i in *fastq.gz; do STAR --runMode alignReads --genomeLoad LoadAndKeep --readFilesCommand zcat --outSAMtype BAM Unsorted --genomeDir /.../star_indices_overhang100 --readFilesIn $i ${i%.fastq.gz}.fastq.gz --runThreadN 10 --outFileNamePrefix ${i%.fastq.gz} done

This is ok. Then:

   samtools merge S4_merged.bam S4_1_sorted.bam S4_2_sorted.bam S4_3_sorted.bam S4_4_sorted.bam
   samtools sort S4_merged.bam S4_4_sorted
   samtools index S4_4_sorted

then:

      samtools view -F 4 S4_4_sorted | htseq-count -f bam -m intersection-nonempty  -t exon -i gene_id -a 15  -r pos  -s no - /.../hg38.refseq.gtf > Counts_S4.txt

Error:

    100000 GFF lines processed.
    200000 GFF lines processed.
    300000 GFF lines processed.
    400000 GFF lines processed.
    500000 GFF lines processed.
    600000 GFF lines processed.
    700000 GFF lines processed.
    800000 GFF lines processed.
    900000 GFF lines processed.
    926832 GFF lines processed.
    Error occured when reading beginning of SAM/BAM file.
    expected string or Unicode object, file found
   [Exception type: TypeError, raised in csamtools.pyx:477]

       samtools view -H  S4_4_sorted.bam

           @HD  VN:1.4  SO:coordinate
           @SQ  SN:chr1 LN:249250621
           @SQ  SN:chr2 LN:243199373
           @SQ  SN:chr3 LN:198022430
           @SQ  SN:chr4 LN:191154276
           @SQ  SN:chr5 LN:180915260
           @SQ  SN:chr6 LN:171115067
           @SQ  SN:chr7 LN:159138663
           @SQ  SN:chrX LN:155270560
           @SQ  SN:chr8 LN:146364022
           @SQ  SN:chr9 LN:141213431
           @SQ  SN:chr10    LN:135534747
           @SQ  SN:chr11    LN:135006516
           @SQ  SN:chr12    LN:133851895
           @SQ  SN:chr13    LN:115169878
           @SQ  SN:chr14    LN:107349540
           @SQ  SN:chr15    LN:102531392
           @SQ  SN:chr16    LN:90354753
           @SQ  SN:chr17    LN:81195210
           @SQ  SN:chr18    LN:78077248
           @SQ  SN:chr20    LN:63025520
           @SQ  SN:chrY LN:59373566
           @SQ  SN:chr19    LN:59128983
           @SQ  SN:chr22    LN:51304566
           @SQ  SN:chr21    LN:48129895
           @SQ  SN:chr1_gl000192_random LN:547496
           @SQ  SN:chrUn_gl000225   LN:211173
           @SQ  SN:chr4_gl000194_random LN:191469
           @SQ  SN:chr4_gl000193_random LN:189789
           @SQ  SN:chr9_gl000200_random LN:187035
           @SQ  SN:chrUn_gl000222   LN:186861
           @SQ  SN:chrUn_gl000212   LN:186858
           @SQ  SN:chr7_gl000195_random LN:182896
           ........................................................................

             more hg38.refseq.gtf.         
                chr1    hg38_ncbiRefSeq stop_codon  67093005    67093007    0.000000    -   .   gene_id "XM_011541469.1"; transcript_id "XM_011541469.1"; 
                 chr1   hg38_ncbiRefSeq CDS 67093008    67093604    0.000000    -   0   gene_id "XM_011541469.1"; transcript_id "XM_011541469.1"; 
                  chr1  hg38_ncbiRefSeq exon    67092176    67093604    0.000000    -   .   gene_id "XM_011541469.1"; transcript_id "XM_011541469.1"; 
                  chr1  hg38_ncbiRefSeq CDS 67095235    67095421    0.000000    -   1   gene_id "XM_011541469.1"; transcript_id "XM_011541469.1"; 
                   chr1 hg38_ncbiRefSeq exon    67095235    67095421    0.000000    -   .   gene_id "XM_011541469.1"; transcript_id "XM_011541469.1"; 
                   chr1 hg38_ncbiRefSeq CDS 67096252    67096321    0.000000    -   2   gene_id "XM_011541469.1"; transcript_id "XM_011541469.1"; 
                    chr1    hg38_ncbiRefSeq exon    67096252    67096321    0.000000    -   .   gene_id "XM_011541469.1"; transcript_id "XM_011541469.1"; 
                   chr1 hg38_ncbiRefSeq CDS 67103238    67103382    0.000000    -   0   gene_id "XM_011541469.1"; transcript_id "XM_011541469.1";

Can anyone help me please?

RNA-Seq next-gen sequencing • 1.6k views

ADD COMMENT • link updated 5.8 years ago by swbarnes2 14k • written 5.8 years ago by elb ▴ 250

0

Entering edit mode

samtools view -F 4 S4_4_sorted | htseq-count -f bam

Samtools view converts it to sam format, but them you are telling htseq-count that it is a bam file. So skip the view command. (Or drop -f bam)

ADD REPLY • link 5.8 years ago by swbarnes2 14k

score 1 · Answer 1 · 2018-07-12

1

Entering edit mode

5.8 years ago

swbarnes2 14k

samtools view -F 4 S4_4_sorted | htseq-count -f bam

Samtools view converts it to sam format, but them you are telling htseq-count that it is a bam file. So skip the view command. (Or drop -f bam)

Also, merging sorted files will result in a sorted merged file. Are the inputs to merge really sorted? It looks like you are telling STAR not to bother sorting, but then why are the files named "sorted"?

ADD COMMENT • link 5.8 years ago by swbarnes2 14k

0

Entering edit mode

So basically I'm doing some naive mistakes...I try to run properly. Thank you very much.

ADD REPLY • link 5.8 years ago by elb ▴ 250