Question

HTSeq-count : is it same to use .bam file or .sam file?

0

Entering edit mode

2.4 years ago

mujiangxielu ▴ 10

.sam file was generated by following code

samtools sort -n Untreated-3/accepted_hits.bam > Untreated-3_sn.bam 
samtools view -o Untreated-3_sn.sam Untreated-3_sn.bam 
samtools sort Untreated-3/accepted_hits.bam > Untreated-3_s.bam
samtools index Untreated-3_s.bam

.gtf file was downloaded by:

wget ftp://ftp.ensembl.org/pub/release-70/gtf/drosophila_melanogaster/Drosophila_melanogaster.BDGP5.70.gtf.gz

gunzip Drosophila_melanogaster.BDGP5.70.gtf.gz

when I use htseq-count:

htseq-count -s no -a 10 Untreated-3_sn.sam Drosophila_melanogaster.BDGP5.70.gtf > Untreated-3.count

an error occured:

 file has no sequences defined (mode='r') - is it SAM/BAM format? Consider opening with check_sq=False
  [Exception type: ValueError, raised in libcalignmentfile.pyx:990]

I can use samtools view see the sorted .bam file, but can't the .sam file generated by .bam file, error occured that

[E::sam_parse1] missing SAM header
[W::sam_read1] Parse error at line 1
[main_samview] truncated file.

I find I can use the .bam file to get the count file:

 htseq-count -s no -a 10 Untreated-3_sn.bam Drosophila_melanogaster.BDGP5.70.gtf > Untreated-3.count

It works like this:

100000 GFF lines processed.
200000 GFF lines processed.
300000 GFF lines processed.
358027 GFF lines processed.
100000 alignment record pairs processed.
200000 alignment record pairs processed.
……
9700000 alignment record pairs processed.
9800000 alignment record pairs processed.
9900000 alignment record pairs processed.
10000000 alignment record pairs processed.
10028400 alignment pairs processed.

I'm totally new to bioinfo, please tell me is it OK to get count file by .bam file rather than .sam file? thanks

count htseq • 1.7k views

ADD COMMENT • link 2.4 years ago by mujiangxielu ▴ 10

1

Entering edit mode

It is unclear what you are doing by flipping back and forth between SAM and BAM files at the beginning. That is not needed. Counting programs need co-ordinate sorted files. You are not including the header in your SAM file which is why it is not working. Show us output of head -5 your.sam. BAM files are simply binary representations of SAM files and are identical in all respects except format.

ADD REPLY • link 2.4 years ago by GenoMax 141k

0

Entering edit mode

Thank you very much!

my .sam head 5 are:

SRR031714.2     81      dmel_mitochondrion_genome       12940   50      37M     =       12785   -192    ATCTTTTTTATCGATATGAACTCTCCAAAAAAATTAC   BIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII   AS:i:0       XN:i:0  XM:i:0  XO:i:0  XG:i:0  NM:i:0  MD:Z:37 YT:Z:UU XS:A:-  NH:i:1
SRR031714.2     161     dmel_mitochondrion_genome       12785   50      37M     =       12940   192     CTTTCGTACTAAAATATCACAATTTTTTAAAGATAGA   IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII<I   AS:i:-6      XN:i:0  XM:i:1  XO:i:0  XG:i:0  NM:i:1  MD:Z:19T17      YT:Z:UU XS:A:-  NH:i:1
SRR031714.3     81      2L      2298267 0       37M     =       2298144 -160    TTTAATTTCTTGATCCAGCAATCTATTTTTCACAAAC   IIIIIIIIIIIIIIII'IIIIIIIIIIIIIIIIIIII   AS:i:0  XN:i:0  XM:i:0  XO:i:0       XG:i:0  NM:i:0  MD:Z:37 YT:Z:UU NH:i:17 CC:Z:=  CP:i:11567860   HI:i:0
SRR031714.3     337     2L      11567860        0       37M     =       11567737        -160    TTTAATTTCTTGATCCAGCAATCTATTTTTCACAAAC   IIIIIIIIIIIIIIII'IIIIIIIIIIIIIIIIIIII   AS:i:0  XN:i:0       XM:i:0  XO:i:0  XG:i:0  NM:i:0  MD:Z:37 YT:Z:UU NH:i:17 CC:Z:=  CP:i:14808103   HI:i:1
SRR031714.3     337     2L      14808103        0       37M     =       14807980        -160    TTTAATTTCTTGATCCAGCAATCTATTTTTCACAAAC   IIIIIIIIIIIIIIII'IIIIIIIIIIIIIIIIIIII   AS:i:0  XN:i:0       XM:i:0  XO:i:0  XG:i:0  NM:i:0  MD:Z:37 YT:Z:UU NH:i:17 CC:Z:=  CP:i:17824146   HI:i:2

I flipping back and forth between SAM and BAM because the article I want to reproduce did so, and the article used the generated sam file for HTSeq count…

ADD REPLY • link 2.4 years ago by mujiangxielu ▴ 10

1

Entering edit mode

You are missing headers in your SAM file so that is why you are unable to count. Using samtools view -h -o Untreated-3_sn.sam Untreated-3_sn.bam should fix that.

ADD REPLY • link 2.4 years ago by GenoMax 141k

0

Entering edit mode

Thank you very much!

ADD REPLY • link 2.4 years ago by mujiangxielu ▴ 10

1

Entering edit mode

After aligning the reads with 'hisat2' and generating aligned .bam files as output, I sort them by coordinates (not by name) and then they are ready to be used as input for featureCounts. I think all those steps back and forth between .sam and .bam files are unnecessary. I also don't think you necessarily need to index the files before annotation.

Moreover, I prefer to perform annotation using featureCounts. In case you are going to perform DE analysis afterwards using DESeq2, it implements a function that imports directly the count matrix generated with featureCounts.