Error at the end of HT-seq Analysis and output.txt with 0 bytes in size
1
2
Entering edit mode
7.9 years ago
tiago211287 ★ 1.4k

When using HTseq-count I get an warning at the beguining :

Warning: Read SRR1503280.2 claims to have an aligned mate which could not be found in an adjacent line.


and at the end of the sam file:

74100000 SAM alignment record pairs processed.
74200000 SAM alignment record pairs processed.
74300000 SAM alignment record pairs processed.
74400000 SAM alignment record pairs processed.

Error occured when processing SAM input (line 74468230 of file accepted_hits_filtsort_Controle.sam):
'pair_alignments' needs a sequence of paired-end alignments
[Exception type: ValueError, raised in __init__.py:603]


and my outputfiles were empty:

0       HTseq_Controle_output.txt
0       HTseq_Huntington_Output.txt


I use this code :

python ~/programs/HTSeq-0.6.1/scripts/htseq-count accepted_hits_filtsort_Controle.sam \
~/database/Mus_musculus.primary.gtf > HTseq_Controle_output.txt


I am very newbie but, I really try to look at all the other similar problems people already posted in the past and could not find a solution. Maybe it is trivial and I only need a tip. Could some one help me?

Empty Count HTseq Error • 9.5k views
1
Entering edit mode

Have a look at that line (awk '{if(NR==74468230) print \$0;}' accepted_hits_filtsort_Controle.sam), my guess is that it's truncated. Alternatively, perhaps it's marked as paired but is the last read in the file.

0
Entering edit mode

Thanks for your help. I am still learning this basic commands. Now I learn 'awk'. That line I get with awk is :

SRR1503281.46709390     0       17      8283856 50      50M     *       0       0       GTCTAGGTAGCGGCTTCACCGCCAACGGCACGGCCATGGCTGGAGCGCTG     C@@FFFFFHHHHHIJJJIIJJJIJIJJJIGIJJJJGGEGFGEF>@;<8;>      AS:i:0  XN:i:0  XM:i:0  XO:i:0  XG:i:0  NM:i:0  MD:Z:50YT:Z:UU XS:A:+  NH:i:1

0
Entering edit mode

That looks fine. What happens if you use the -o some_file.sam option with htseq-count? That should output the reads with annotation as to how each is treated. Then you can just tail some_file.sam to confirm where it's failing and then grep -A 1 -B 10 the read name in the original SAM file to get some context. In general, the error you saw suggests that you have a single-end read when htseq-count is expecting a paired-end read. While the read you posted is a singleton, I'd be surprised if that's sufficient to cause the error.

0
Entering edit mode

The last read in the some_file.sam I've got was:

SRR1503281.46709388     177     2       114047431       50      50M     15      102529168       0       ATCCTGAATATAAGGTAGGCTAAGAGAGAGACATCTCAGAAGCACTTGCG    FIIIGHGIGDIGGIHFF<CHGCDIGGHGHHCFBEGEHFFDDF;DFDD<@?      AS:i:0  XN:i:0  XM:i:0  XO:i:0  XG:i:0  NM:i:0  MD:Z:50       YT:Z:UU XS:A:-  NH:i:1  XF:Z:__no_feature


And using grep I've got:

SRR1503281.46709388     177     2       114047431       50      50M     15      102529168       0       ATCCTGAATATAAGGTAGGCTAAGAGAGAGACATCTCAGAAGCACTTGCG    FIIIGHGIGDIGGIHFF<CHGCDIGGHGHHCFBEGEHFFDDF;DFDD<@?      AS:i:0  XN:i:0  XM:i:0  XO:i:0  XG:i:0  NM:i:0  MD:Z:50       YT:Z:UU XS:A:-  NH:i:1
SRR1503281.46709389     177     8       94844877        50      36M5212N14M     =       119585686       24740859        GGCTGGACTGCAGAGGCCATTGCCGAAGGAGCCCAGTCCTTGGGTCTCTC    JIIJJJJIJIFIIJJJJJJJIIIJJJJJJJIJJJJJJHHHHHFFFFF@C@      AS:i:-4 XN:i:0  XM:i:1  XO:i:0  XG:i:0NM:i:1  MD:Z:23T26      YT:Z:UU XS:A:+  NH:i:1

0
Entering edit mode

What happens if you run grep -A 3 -B 3 SRR1503281.46709388 accepted_hits_filtsort_Controle.sam?

0
Entering edit mode

I had to stop for a while because other sequences of other project but, I can not see anything wrong with grep result, can you?

SRR1503281.46709385     161     5       122458132       50      50M     2       14996341        0       CGCCATTGTCATTGGATATGGGGACTCAAAGATTGCACAATCCACTCCAT    @@@DDBDDBDDBBHI@HI>FEDCCFGFFHIFCCFGGIDGGGD@FEGGHGA      AS:i:-5 XN:i:0  XM:i:1  XO:i:0  XG:i:0  NM:i:1  MD:Z:0T49     YT:Z:UU XS:A:-  NH:i:1
SRR1503281.46709386     177     3       152237187       50      50M     6       17340214        0       GCAACACTACACTTGACATTTACAGACTTATTTTTAAAGCTAGAAAATCA    HIFEHDIIGIJIJJIIGGEHGHJGJJHEJIIIEIIJJHHHHDEFFFF@@@      AS:i:0  XN:i:0  XM:i:0  XO:i:0  XG:i:0  NM:i:0  MD:Z:50       YT:Z:UU XS:A:-  NH:i:1
SRR1503281.46709387     129     MT      6178    50      50M     5       124549557       0       GCTTTATTGTATGAGCCCACCACATATTCACAGTAGGATTAGATGTAGAC    CCCFFFFFHHHGHJJJJJJJJJJJJJJJJJJJJGIIJHIIJIIGIHIIJJ      AS:i:0  XN:i:0  XM:i:0  XO:i:0  XG:i:0  NM:i:0  MD:Z:50 YT:Z:UU       XS:A:+  NH:i:1
SRR1503281.46709388     177     2       114047431       50      50M     15      102529168       0       ATCCTGAATATAAGGTAGGCTAAGAGAGAGACATCTCAGAAGCACTTGCG    FIIIGHGIGDIGGIHFF<CHGCDIGGHGHHCFBEGEHFFDDF;DFDD<@?      AS:i:0  XN:i:0  XM:i:0  XO:i:0  XG:i:0  NM:i:0  MD:Z:50       YT:Z:UU XS:A:-  NH:i:1
SRR1503281.46709389     177     8       94844877        50      36M5212N14M     =       119585686       24740859        GGCTGGACTGCAGAGGCCATTGCCGAAGGAGCCCAGTCCTTGGGTCTCTC    JIIJJJJIJIFIIJJJJJJJIIIJJJJJJJIJJJJJJHHHHHFFFFF@C@      AS:i:-4 XN:i:0  XM:i:1  XO:i:0  XG:i:0NM:i:1  MD:Z:23T26      YT:Z:UU XS:A:+  NH:i:1
SRR1503281.46709390     0       17      8283856 50      50M     *       0       0       GTCTAGGTAGCGGCTTCACCGCCAACGGCACGGCCATGGCTGGAGCGCTG    C@@FFFFFHHHHHIJJJIIJJJIJIJJJIGIJJJJGGEGFGEF>@;<8;>      AS:i:0  XN:i:0  XM:i:0  XO:i:0  XG:i:0  NM:i:0  MD:Z:50 YT:Z:UU XS:A:+NH:i:1
SRR1503281.46709391     0       15      36771138        50      50M     *       0       0       GGGGAAGCGGGTATCTTAGACACTGAGTGGAGCCAGAAAGATCATGCGGC    @@BDDFFFHHHFFHIJJIJJJGHIJJJFGGIJJJJJIJJJFIJJIIJJJJ      AS:i:0  XN:i:0  XM:i:0  XO:i:0  XG:i:0  NM:i:0  MD:Z:50 YT:Z:UU       XS:A:-  NH:i:1

0
Entering edit mode

So the first 5 reads were originally paired-end and the last 2 were single-end. I don't think that htseq-count supports mixing the two (you can have orphaned reads mixed in, but not pure single-end reads).

0
Entering edit mode

What script/program can I use to cut off this orphaned reads?

8
Entering edit mode
7.9 years ago

To separate out alignments from paired-end datasets and single-end datasets just use samtools:

samtools view -bf 1 foo.bam > foo.paired-end.bam
samtools view -bF 1 foo.bam > foo.single-end.bam


That should suffice. You could also just pipe that into htseq-count.

0
Entering edit mode

Devon Ryan I can not express all my gratefulness for the help! Thanks! I let you know when it finished.

0
Entering edit mode

It worked like a charm. Thank you.

0
Entering edit mode

You are awesome Devon. Even I faced the above issue. For some samples, I received the above error. For sample X's accepted_hits.bam, after separating the paired-end and single-end reads. It is working great.

Now I am running htseq-count separately for paired-end and single-end reads. In order to get counts of globin genes https://www.biostars.org/p/159585/ , I am summing up the paired-end and single-end reads. Am I correct?

Globin genes     Paired-end     Single-end     Total
HBB              1160           18             1178
HBA1             608            3              611
HBA2             3191           35             3226
HBG1             38765          502            39267
HBG2             75801          948            76749
HBD              0              0              0
HBE1             0              0              0
HBZ              0              0              0
HBQ1             11             0              11
MB               2              0              2
CYGB             8              0              8
NGB              0              0              0
Grandtotal     121,052

Special counters           Paired-end     Single-end     Total
__no_feature               52867663       682833         53550496
__ambiguous                4922           74             4996
__too_low_aQual            0              0              0
__not_aligned              0              0              0
__alignment_not_unique     45853924       273697         46127621
GrandTotal     99683113


Do I need to consider total number of reads processed 100,360,207 (Total reads processed for paired-end 99395040 and for single end 965,167 ) or do I need to subtract any special counters from total number of reads for calculating the percentage of reads?

% of reads for globin gene = (121,052/100,360,207)*100=0.12%. Am I correct?

0
Entering edit mode

0.12% is a reasonable enough ball-park. It might be more useful to get the counts for everything (rather than just a handful of genes) and then reporting "the percent of feature-assignable fragments arising from globin genes", since that removes a lot of the background noise (the resulting number would be higher...most likely).

0
Entering edit mode

Dear Devon,

The reason why I need this percentage of reads for globin genes and rRNA genes is to determine whether our new protocol (including the globin and rRNA depletion) works well.

Can you explain a bit on feature assignable fragments and is that feature available in htseqcount?

Because this is a strand specific RNA-seq, I need to include the sense and antisense information for the corresponding genes. I have seen in one of your answers in this post.

Then I computed counts using REVERSE and YES options for strand in htseq-count How to interpret the difference among these three options in strandedness from HTSeq-count. Paste the content from that post for your reference.

Globin genes     Stranded:No     Stranded:Yes     Stranded:Reverse
HBB              40204           40197            7
HBA1             38811           38795            16
HBA2             129847          129770           77
HBG1             1566            1566             0
HBG2             2750            2750             0
HBD              3               3                0
HBE1             1               0                1
HBZ              0               0                0
HBQ1             9               3                6
MB               4               0                4
CYGB             294             2                354
NGB              289             2                319


how to know the genes that belong + (sense) and - (antisense) strand? I understood in this way, read counts from stranded-YES refers to +(sense) strand and stranded-REVERSE refers to -(antisense) strand. Is my understanding correct?

0
Entering edit mode

If you already know that you have a strand-specific dataset then you know the setting to use, since it's dictated by how the library was made. In your case, "Stranded:Yes" would seem to be the correct setting. "Reverse" has few alignments and if a library is unstranded then you expect "Stranded: Yes" and "Stranded: Reverse" to have similar counts and be roughly half of "Stranded: No".

BTW, + is not sense and - is not anti-sense. These are fundamentally different concepts. If you're not familiar with the difference then I would recommend that you pick up an intro genomics book.

0
Entering edit mode

I apology if I am using it wrongly. Devon, but this book also refers sense as + strand and antisense as - strand.

If I want to fetch sense and antisense read counts for RNA-seq data using htseq-count, how do I get it?

In this post, you mentioned to use either featureCounts or htseq-count. Running them twice on the same dataset with the strand setting reversed (e.g., -s yes and then -s reverse) will produce what you want.

I understood in this way, read counts from "Stranded-YES" refers to sense strand read counts and "Stranded-REVERSE" refers to antisense strand read counts. Is this interpretation correct?

0
Entering edit mode

Yikes, that book is horribly wrong. Sense refers to the strand from which a given gene is transcribed. That could be either + or -, since genes can be on either strand.

Edit: Stated somewhat more exactly the sense strand is the strand whose sequence matches that of the transcript, since technically a gene on the + strand is actually transcribed from the - strand, though obviously the reverse complement (i.e., the + strand) is what's thereby produced.

0
Entering edit mode

Do you know how to get read counts for sense and anti sense strand?

0
Entering edit mode

You already have them, that's what the stranded setting in htseq-count is giving you.

0
Entering edit mode

I have to include the sense and antisense information for the corresponding genes. I am looking for something. do I need to write a script for this task? or htseq-count has an option for this kind of result.

Globin    Actual counts     Expecting to split stranded into below 2
genes     Stranded:Yes      Sense Strand      Antisense strand
HBB       40197             some x counts     some y counts
HBA1      38795             some x counts     some y counts
HBA2      129770            some x counts     some y counts
HBG1      1566              some x counts     some y counts
HBG2      2750              some x counts     some y counts
HBD       3                 some x counts     some y counts
HBE1      0                 0                 0
HBZ       0                 0                 0
HBQ1      3                 some x counts     some y counts
MB        0                 0                 0
CYGB      2                 some x counts     some y counts
NGB       2                 some x counts     some y counts

0
Entering edit mode

The "stranded: yes" setting is only one strand. "Stranded: reverse" is the antisense strand (assuming "stranded: yes" is the appropriate setting for the library type, which it appears to be).

0
Entering edit mode

Hi Devon Ryan, I am facing a similar problem although there are some differences. If you don't mind could you please check the question I have recently posted here and provide some insight? Thank you (Error upon running HTseq with paired end aligned BAM (coordinate sorted) files)