Question: Error at the end of HT-seq Analysis and output.txt with 0 bytes in size
2
gravatar for tiago211287
5.1 years ago by
tiago2112871.1k
USA
tiago2112871.1k wrote:

When using HTseq-count I get an warning at the beguining :

Warning: Read SRR1503280.2 claims to have an aligned mate which could not be found in an adjacent line.


and at the end of the sam file:

 

74100000 SAM alignment record pairs processed.
74200000 SAM alignment record pairs processed.
74300000 SAM alignment record pairs processed.
74400000 SAM alignment record pairs processed.

Error occured when processing SAM input (line 74468230 of file accepted_hits_filtsort_Controle.sam):
  'pair_alignments' needs a sequence of paired-end alignments
  [Exception type: ValueError, raised in __init__.py:603]

and my outputfiles were empty:

0       HTseq_Controle_output.txt
0       HTseq_Huntington_Output.txt

I use this code :

python ~/programs/HTSeq-0.6.1/scripts/htseq-count accepted_hits_filtsort_Controle.sam ~/d
atabase/Mus_musculus.primary.gtf > HTseq_Controle_output.txt

 

I am very newbie but, I really try to look at all the other similar problems people already posted in the past and could not find a solution. Maybe it is trivial and I only need a tip. Could some one help me?

 

 

empty error count htseq • 6.8k views
ADD COMMENTlink modified 4.9 years ago by Biostar ♦♦ 20 • written 5.1 years ago by tiago2112871.1k
1

Have a look at that line (awk '{if(NR==74468230) print $0;}' accepted_hits_filtsort_Controle.sam), my guess is that it's truncated. Alternatively, perhaps it's marked as paired but is the last read in the file.

ADD REPLYlink written 5.1 years ago by Devon Ryan92k

Thanks for your help. I am still learning this basic commands. Now I learn 'awk'.
That line I get with awk is :

SRR1503281.46709390     0       17      8283856 50      50M     *       0       0       GTCTAGGTAGCGGCTTCACCGCCAACGGCACGGCCATGGCTGGAGCGCTG     C@@FFFFFHHHHHIJJJIIJJJIJIJJJIGIJJJJGGEGFGEF>@;<8;>      AS:i:0  XN:i:0  XM:i:0  XO:i:0  XG:i:0  NM:i:0  MD:Z:50YT:Z:UU XS:A:+  NH:i:1

 

ADD REPLYlink written 5.1 years ago by tiago2112871.1k

That looks fine. What happens if you use the "-o some_file.sam" option with htseq-count? That should output the reads with annotation as to how each is treated. Then you can just "tail some_file.sam" to confirm where it's failing and then "grep -A 1 -B 10" the read name in the original SAM file to get some context. In general, the error you saw suggests that you have a single-end read when htseq-count is expecting a paired-end read. While the read you posted is a singleton, I'd be surprised if that's sufficient to cause the error.

ADD REPLYlink written 5.1 years ago by Devon Ryan92k

The last read in the some_file.sam I've got was:

SRR1503281.46709388     177     2       114047431       50      50M     15      102529168       0       ATCCTGAATATAAGGTAGGCTAAGAGAGAGACATCTCAGAAGCACTTGCG    FIIIGHGIGDIGGIHFF<CHGCDIGGHGHHCFBEGEHFFDDF;DFDD<@?      AS:i:0  XN:i:0  XM:i:0  XO:i:0  XG:i:0  NM:i:0  MD:Z:50       YT:Z:UU XS:A:-  NH:i:1  XF:Z:__no_feature

And using grep I've got:

 

SRR1503281.46709388     177     2       114047431       50      50M     15      102529168       0       ATCCTGAATATAAGGTAGGCTAAGAGAGAGACATCTCAGAAGCACTTGCG    FIIIGHGIGDIGGIHFF<CHGCDIGGHGHHCFBEGEHFFDDF;DFDD<@?      AS:i:0  XN:i:0  XM:i:0  XO:i:0  XG:i:0  NM:i:0  MD:Z:50       YT:Z:UU XS:A:-  NH:i:1

SRR1503281.46709389     177     8       94844877        50      36M5212N14M     =       119585686       24740859        GGCTGGACTGCAGAGGCCATTGCCGAAGGAGCCCAGTCCTTGGGTCTCTC    JIIJJJJIJIFIIJJJJJJJIIIJJJJJJJIJJJJJJHHHHHFFFFF@C@      AS:i:-4 XN:i:0  XM:i:1  XO:i:0  XG:i:0NM:i:1  MD:Z:23T26      YT:Z:UU XS:A:+  NH:i:1

 

ADD REPLYlink modified 5.1 years ago • written 5.1 years ago by tiago2112871.1k

What happens if you run "grep -A 3 -B 3 SRR1503281.46709388 accepted_hits_filtsort_Controle.sam"?

ADD REPLYlink written 5.1 years ago by Devon Ryan92k

I had to stop for a while because other sequences of other project but, I can not see anything wrong with grep result, can you?

 

 

SRR1503281.46709385     161     5       122458132       50      50M     2       14996341        0       CGCCATTGTCATTGGATATGGGGACTCAAAGATTGCACAATCCACTCCAT    @@@DDBDDBDDBBHI@HI>FEDCCFGFFHIFCCFGGIDGGGD@FEGGHGA      AS:i:-5 XN:i:0  XM:i:1  XO:i:0  XG:i:0  NM:i:1  MD:Z:0T49     YT:Z:UU XS:A:-  NH:i:1
SRR1503281.46709386     177     3       152237187       50      50M     6       17340214        0       GCAACACTACACTTGACATTTACAGACTTATTTTTAAAGCTAGAAAATCA    HIFEHDIIGIJIJJIIGGEHGHJGJJHEJIIIEIIJJHHHHDEFFFF@@@      AS:i:0  XN:i:0  XM:i:0  XO:i:0  XG:i:0  NM:i:0  MD:Z:50       YT:Z:UU XS:A:-  NH:i:1
SRR1503281.46709387     129     MT      6178    50      50M     5       124549557       0       GCTTTATTGTATGAGCCCACCACATATTCACAGTAGGATTAGATGTAGAC    CCCFFFFFHHHGHJJJJJJJJJJJJJJJJJJJJGIIJHIIJIIGIHIIJJ      AS:i:0  XN:i:0  XM:i:0  XO:i:0  XG:i:0  NM:i:0  MD:Z:50 YT:Z:UU       XS:A:+  NH:i:1
SRR1503281.46709388     177     2       114047431       50      50M     15      102529168       0       ATCCTGAATATAAGGTAGGCTAAGAGAGAGACATCTCAGAAGCACTTGCG    FIIIGHGIGDIGGIHFF<CHGCDIGGHGHHCFBEGEHFFDDF;DFDD<@?      AS:i:0  XN:i:0  XM:i:0  XO:i:0  XG:i:0  NM:i:0  MD:Z:50       YT:Z:UU XS:A:-  NH:i:1
SRR1503281.46709389     177     8       94844877        50      36M5212N14M     =       119585686       24740859        GGCTGGACTGCAGAGGCCATTGCCGAAGGAGCCCAGTCCTTGGGTCTCTC    JIIJJJJIJIFIIJJJJJJJIIIJJJJJJJIJJJJJJHHHHHFFFFF@C@      AS:i:-4 XN:i:0  XM:i:1  XO:i:0  XG:i:0NM:i:1  MD:Z:23T26      YT:Z:UU XS:A:+  NH:i:1
SRR1503281.46709390     0       17      8283856 50      50M     *       0       0       GTCTAGGTAGCGGCTTCACCGCCAACGGCACGGCCATGGCTGGAGCGCTG    C@@FFFFFHHHHHIJJJIIJJJIJIJJJIGIJJJJGGEGFGEF>@;<8;>      AS:i:0  XN:i:0  XM:i:0  XO:i:0  XG:i:0  NM:i:0  MD:Z:50 YT:Z:UU XS:A:+NH:i:1
SRR1503281.46709391     0       15      36771138        50      50M     *       0       0       GGGGAAGCGGGTATCTTAGACACTGAGTGGAGCCAGAAAGATCATGCGGC    @@BDDFFFHHHFFHIJJIJJJGHIJJJFGGIJJJJJIJJJFIJJIIJJJJ      AS:i:0  XN:i:0  XM:i:0  XO:i:0  XG:i:0  NM:i:0  MD:Z:50 YT:Z:UU       XS:A:-  NH:i:1

 

ADD REPLYlink modified 5.0 years ago • written 5.0 years ago by tiago2112871.1k

So the first 5 reads were originally paired-end and the last 2 were single-end. I don't think that htseq-count supports mixing the two (you can have orphaned reads mixed in, but not pure single-end reads).

ADD REPLYlink written 5.0 years ago by Devon Ryan92k

What script/program can I use to cut off this orphaned reads?

ADD REPLYlink written 5.0 years ago by tiago2112871.1k
7
gravatar for Devon Ryan
5.0 years ago by
Devon Ryan92k
Freiburg, Germany
Devon Ryan92k wrote:

To separate out alignments from paired-end datasets and single-end datasets just use samtools:

samtools view -bf 1 foo.bam > foo.paired-end.bam
samtools view -bF 1 foo.bam > foo.single-end.bam

That should suffice. You could also just pipe that into htseq-count.

ADD COMMENTlink written 5.0 years ago by Devon Ryan92k

Devon Ryan I can not express all my gratefulness for the help! Thanks! I let you know when it finished.

ADD REPLYlink written 5.0 years ago by tiago2112871.1k

It worked like a charm. Thank you.

ADD REPLYlink written 5.0 years ago by tiago2112871.1k

You are awesome Devon. Even I faced the above issue. For some samples, I received the above error. For sample X's accepted_hits.bam, after separating the paired-end and single-end reads. It is working great.

Now I am running htseq-count separately for paired-end and single-end reads. In order to get counts of globin genes Counting number of genes using htseq-count tool., I am summing up the paired-end and single-end reads. Am I correct?

Globin genes Paired-end Single-end Total
HBB 1160 18 1178
HBA1 608 3 611
HBA2 3191 35 3226
HBG1 38765 502 39267
HBG2 75801 948 76749
HBD 0 0 0
HBE1 0 0 0
HBZ 0 0 0
HBQ1 11 0 11
MB 2 0 2
CYGB 8 0 8
NGB 0 0 0
    Grandtotal 121,052

 

Special counters Paired-end Single-end Total
__no_feature 52867663 682833 53550496
__ambiguous 4922 74 4996
__too_low_aQual 0 0 0
__not_aligned 0 0 0
__alignment_not_unique 45853924 273697 46127621
    GrandTotal 99683113

Do I need to consider total number of reads processed 100,360,207  (Total reads processed for paired-end 99395040 and for single end 965,167 ) or do I need to subtract any special counters from total number of reads for calculating the percentage of reads?

% of reads for globin gene = (121,052/100,360,207)*100=0.12%. Am I correct?

ADD REPLYlink modified 4.0 years ago • written 4.0 years ago by nalandaatmi70

0.12% is a reasonable enough ball-park. It might be more useful to get the counts for everything (rather than just a handful of genes) and then reporting "the percent of feature-assignable fragments arising from globin genes", since that removes a lot of the background noise (the resulting number would be higher...most likely).

ADD REPLYlink written 4.0 years ago by Devon Ryan92k

Dear Devon,

The reason why I need this percentage of reads for globin genes and rRNA genes is to determine whether our new protocol (including the globin and rRNA depletion) works well. 

Can you explain a bit on feature assignable fragments and is that feature available in htseqcount?

Because this is a strand specific RNA-seq, I need to include the sense and antisense information for the corresponding genes. I have seen in one of your answers in this post How to extract sense and antisense read counts from mapping file (.bam) for RNA-seq data set.

Then I computed counts using REVERSE and YES options for strand in htseq-count How to interpret the difference among these three options in strandedness from HTSeq-count. Paste the content from that post for your reference.

Globin genes Stranded:No Stranded:Yes Stranded:Reverse
HBB 40204 40197 7
HBA1 38811 38795 16
HBA2 129847 129770 77
HBG1 1566 1566 0
HBG2 2750 2750 0
HBD 3 3 0
HBE1 1 0 1
HBZ 0 0 0
HBQ1 9 3 6
MB 4 0 4
CYGB 294 2 354
NGB 289 2 319

how to know the genes that belong + (sense) and - (antisense) strand? I understood in this way, read counts from stranded-YES refers to +(sense) strand and stranded-REVERSE refers to -(antisense) strand. Is my understanding correct?

ADD REPLYlink modified 4.0 years ago • written 4.0 years ago by nalandaatmi70

If you already know that you have a strand-specific dataset then you know the setting to use, since it's dictated by how the library was made. In your case, "Stranded:Yes" would seem to be the correct setting. "Reverse" has few alignments and if a library is unstranded then you expect "Stranded: Yes" and "Stranded: Reverse" to have similar counts and be roughly half of "Stranded: No".

BTW, + is not sense and - is not anti-sense. These are fundamentally different concepts. If you're not familiar with the difference then I would recommend that you pick up an intro genomics book.

ADD REPLYlink written 4.0 years ago by Devon Ryan92k

I apology if I am using it wrongly. Devon, but this book also refers sense as + strand and antisense as - strand https://books.google.ca/books?id=Il_mR7T2PxMC&pg=PA667&lpg=PA667&dq=sense+strand+vs+plus+strand&source=bl&ots=KC1T0aJpqO&sig=5fRmneaKCPymksh6qpYvqSIJc0w&hl=en&sa=X&ved=0CDAQ6AEwAzgKahUKEwi4kMydwZ_IAhXH8x4KHYEHCaU#v=onepage&q=sense%20strand%20vs%20plus%20strand&f=false . 

If I want to fetch sense and antisense read counts for RNA-seq data using htseq-count, how do I get it?

In this post (How to extract sense and antisense read counts from mapping file (.bam) for RNA-seq data set), you mentioned to use either featureCounts or htseq-count. Running them twice on the same dataset with the strand setting reversed (e.g., -s yes and then -s reverse) will produce what you want. 

I understood in this way, read counts from "Stranded-YES" refers to sense strand read counts and "Stranded-REVERSE" refers to antisense strand read counts. Is this interpretation correct? 

ADD REPLYlink modified 4.0 years ago • written 4.0 years ago by nalandaatmi70

Yikes, that book is horribly wrong. Sense refers to the strand from which a given gene is transcribed. That could be either + or -, since genes can be on either strand.

Edit: Stated somewhat more exactly the sense strand is the strand whose sequence matches that of the transcript, since technically a gene on the + strand is actually transcribed from the - strand, though obviously the reverse complement (i.e., the + strand) is what's thereby produced.

ADD REPLYlink modified 4.0 years ago • written 4.0 years ago by Devon Ryan92k

Do you know how to get read counts for sense and anti sense strand?

ADD REPLYlink written 4.0 years ago by nalandaatmi70

You already have them, that's what the stranded setting in htseq-count is giving you.

ADD REPLYlink written 4.0 years ago by Devon Ryan92k

I have to include the sense and antisense information for the corresponding genes. I am looking for something. do I need to write a script for this task? or htseq-count has an option for this kind of result.

  Actual counts Expecting to split stranded into below 2 
Globin genes Stranded:Yes Sense Strand Antisense strand
HBB 40197  some x counts some y counts
HBA1 38795  some x counts some y counts
HBA2 129770  some x counts some y counts
HBG1 1566  some x counts some y counts
HBG2 2750  some x counts some y counts
HBD 3  some x counts some y counts
HBE1 0 0 0
HBZ 0 0 0
HBQ1 3  some x counts some y counts
MB 0 0 0
CYGB 2  some x counts some y counts
NGB 2  some x counts some y counts
ADD REPLYlink written 4.0 years ago by nalandaatmi70

The "stranded: yes" setting is only one strand. "Stranded: reverse" is the antisense strand (assuming "stranded: yes" is the appropriate setting for the library type, which it appears to be).

ADD REPLYlink written 4.0 years ago by Devon Ryan92k

Hi Devon Ryan, I am facing a similar problem although there are some differences. If you don't mind could you please check the question I have recently posted here and provide some insight? Thank you (Error upon running HTseq with paired end aligned BAM (coordinate sorted) files)

ADD REPLYlink written 6 months ago by bnayer260
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 901 users visited in the last hour