Question: RNA-Seq error in htseq-count
1
gravatar for elb
9 weeks ago by
elb110
Torino
elb110 wrote:

Hi guys, I really do not know what is happening to my RNA seq data analysis pipeline and hence I'm looking for some help. Although I found some related posts and I tried to follow instructions to solve the problem I'm still here and I'm getting crazy.

Briefly: the steps are the following:

STAR:

for i in *fastq.gz; do STAR --runMode alignReads --genomeLoad LoadAndKeep --readFilesCommand zcat --outSAMtype BAM Unsorted --genomeDir /.../star_indices_overhang100 --readFilesIn $i ${i%.fastq.gz}.fastq.gz --runThreadN 10 --outFileNamePrefix ${i%.fastq.gz} done

This is ok. Then:

   samtools merge S4_merged.bam S4_1_sorted.bam S4_2_sorted.bam S4_3_sorted.bam S4_4_sorted.bam
   samtools sort S4_merged.bam S4_4_sorted
   samtools index S4_4_sorted
  

then:

      samtools view -F 4 S4_4_sorted | htseq-count -f bam -m intersection-nonempty  -t exon -i gene_id -a 15  -r pos  -s no - /.../hg38.refseq.gtf > Counts_S4.txt
  

Error:

    100000 GFF lines processed.
    200000 GFF lines processed.
    300000 GFF lines processed.
    400000 GFF lines processed.
    500000 GFF lines processed.
    600000 GFF lines processed.
    700000 GFF lines processed.
    800000 GFF lines processed.
    900000 GFF lines processed.
    926832 GFF lines processed.
    Error occured when reading beginning of SAM/BAM file.
    expected string or Unicode object, file found
   [Exception type: TypeError, raised in csamtools.pyx:477]
       samtools view -H  S4_4_sorted.bam
  
           @HD  VN:1.4  SO:coordinate
           @SQ  SN:chr1 LN:249250621
           @SQ  SN:chr2 LN:243199373
           @SQ  SN:chr3 LN:198022430
           @SQ  SN:chr4 LN:191154276
           @SQ  SN:chr5 LN:180915260
           @SQ  SN:chr6 LN:171115067
           @SQ  SN:chr7 LN:159138663
           @SQ  SN:chrX LN:155270560
           @SQ  SN:chr8 LN:146364022
           @SQ  SN:chr9 LN:141213431
           @SQ  SN:chr10    LN:135534747
           @SQ  SN:chr11    LN:135006516
           @SQ  SN:chr12    LN:133851895
           @SQ  SN:chr13    LN:115169878
           @SQ  SN:chr14    LN:107349540
           @SQ  SN:chr15    LN:102531392
           @SQ  SN:chr16    LN:90354753
           @SQ  SN:chr17    LN:81195210
           @SQ  SN:chr18    LN:78077248
           @SQ  SN:chr20    LN:63025520
           @SQ  SN:chrY LN:59373566
           @SQ  SN:chr19    LN:59128983
           @SQ  SN:chr22    LN:51304566
           @SQ  SN:chr21    LN:48129895
           @SQ  SN:chr1_gl000192_random LN:547496
           @SQ  SN:chrUn_gl000225   LN:211173
           @SQ  SN:chr4_gl000194_random LN:191469
           @SQ  SN:chr4_gl000193_random LN:189789
           @SQ  SN:chr9_gl000200_random LN:187035
           @SQ  SN:chrUn_gl000222   LN:186861
           @SQ  SN:chrUn_gl000212   LN:186858
           @SQ  SN:chr7_gl000195_random LN:182896
           ........................................................................
             more hg38.refseq.gtf.         
                chr1    hg38_ncbiRefSeq stop_codon  67093005    67093007    0.000000    -   .   gene_id "XM_011541469.1"; transcript_id "XM_011541469.1"; 
                 chr1   hg38_ncbiRefSeq CDS 67093008    67093604    0.000000    -   0   gene_id "XM_011541469.1"; transcript_id "XM_011541469.1"; 
                  chr1  hg38_ncbiRefSeq exon    67092176    67093604    0.000000    -   .   gene_id "XM_011541469.1"; transcript_id "XM_011541469.1"; 
                  chr1  hg38_ncbiRefSeq CDS 67095235    67095421    0.000000    -   1   gene_id "XM_011541469.1"; transcript_id "XM_011541469.1"; 
                   chr1 hg38_ncbiRefSeq exon    67095235    67095421    0.000000    -   .   gene_id "XM_011541469.1"; transcript_id "XM_011541469.1"; 
                   chr1 hg38_ncbiRefSeq CDS 67096252    67096321    0.000000    -   2   gene_id "XM_011541469.1"; transcript_id "XM_011541469.1"; 
                    chr1    hg38_ncbiRefSeq exon    67096252    67096321    0.000000    -   .   gene_id "XM_011541469.1"; transcript_id "XM_011541469.1"; 
                   chr1 hg38_ncbiRefSeq CDS 67103238    67103382    0.000000    -   0   gene_id "XM_011541469.1"; transcript_id "XM_011541469.1";
  

Can anyone help me please?

sequencing rna-seq next-gen • 199 views
ADD COMMENTlink modified 9 weeks ago by swbarnes24.0k • written 9 weeks ago by elb110
samtools view -F 4 S4_4_sorted | htseq-count -f bam

Samtools view converts it to sam format, but them you are telling htseq-count that it is a bam file. So skip the view command. (Or drop -f bam)

ADD REPLYlink written 9 weeks ago by swbarnes24.0k
1
gravatar for swbarnes2
9 weeks ago by
swbarnes24.0k
United States
swbarnes24.0k wrote:
samtools view -F 4 S4_4_sorted | htseq-count -f bam

Samtools view converts it to sam format, but them you are telling htseq-count that it is a bam file. So skip the view command. (Or drop -f bam)

Also, merging sorted files will result in a sorted merged file. Are the inputs to merge really sorted? It looks like you are telling STAR not to bother sorting, but then why are the files named "sorted"?

ADD COMMENTlink modified 9 weeks ago • written 9 weeks ago by swbarnes24.0k

So basically I'm doing some naive mistakes...I try to run properly. Thank you very much.

ADD REPLYlink written 9 weeks ago by elb110
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1402 users visited in the last hour