Question: HTSeq-count, problem in handling PE data
0
gravatar for barbara.mariotti81
21 months ago by
Italy
barbara.mariotti810 wrote:

Hi everyone,

I am a newborn in the sequencing fields and I have some problem with HTSeq. I know that there are a lot of posts about this problem but, even if I read all of them, I am not able to fix the problem I have.

I have PE RNA-seq data, aligned with TopHat. I want to use HTSeq-count to create the input for DESeq2 analysis.

Once obtained the accepted_hits.bam file what I did was:

samtools flagstat accepted_hits.bam > stat.txt
samtools index accepted_hits.bam
samtools view accepted_hits.bam > accepted_hits.sam
sort -k1,1 -k2,2n accepted_hits.sam > accepted_hits_sort.sam
htseq-count -f sam -m union -r pos -s no -t exon -i gene_id accepted_hits_sort.sam gtfFiles/Homo_sapiens.GRCh38.77.gtf > Sort_Pos.txt

this works fine with an accepted_hits.bam file of 3 GB with around 50 million reads mapped, but whan I moved to a 5 GB, 100 million reads mapped file I got this error:

Error occured when processing SAM input (record #35593187 in file accepted_hits_sort.sam):
  Maximum alignment buffer size exceeded while pairing SAM alignments.
  [Exception type: ValueError, raised in __init__.py:671]

I tried different option:

1- increasing the max buffer size of HTSeq-count (as suggest in http://seqanswers.com/forums/showthread.php?t=41531) ----> same error

2- sorting the bam per name instead of per position ----> Everything work but the feature I count corresponds to reads instead to fragment (probably for the resons described in the last reply of htseq-count for pair-end RNA-seq

3- According to the last reply of previous post I try to rename my reads using

(samtools view -H accepted_hits.bam; accepted_hits.bam | awk '{print substr($1,1,length($1)-1),$0}' | sed 's/ [^   ]*//') | samtools view -bSh - > in.samereadname.bam   

that return me this error:

 

So now my questions are:

1- Is what I did correct?

2- How can I solve the problem with the bigger bam file?

Thank you so much for your responses.

rna-seq htseq • 2.2k views
ADD COMMENTlink written 21 months ago by barbara.mariotti810
2

the correct way of proceeding is the second one. unless you have old data, the read name should be the same for both mates. you can check this by looking at your first bam lines. if you indeed see that the two mates end with \1 and \2, then you can run that one-liner with the correction of my other comment. 

ADD REPLYlink modified 21 months ago • written 21 months ago by Martombo1.7k

Sorry I forgot to report the error obtained from option 3:

accepted_hits.bam: comando non trovato
[samopen] SAM header is present: 194 sequences.
[sam_read1] reference 'ID:TopHat    VN:2.0.13    CL: tophat -p 8 -r 250 -G gtfFiles/Homo_sapiens.GRCh38.77.gtf -o ABC_Repl1 gtfFiles/Bowtie2.2.3Smallindex/Homo_sapiens.GRCh38.dna.primary_assembly ABC_Repl1.R1.fastq.filtered ABC_Repl1.R2.fastq.filtered
' is recognized as '*'.
[main_samview] truncated file.

 

ADD REPLYlink written 21 months ago by barbara.mariotti810
1

the code you used is missing the command: samtools view

which you have to insert after the first semi-colon. like this: 

(samtools view -H accepted_hits.bam; samtools view accepted_hits.bam | awk '{print substr($1,1,length($1)-1),$0}' | sed 's/ [^   ]*//') | samtools view -bSh - > in.samereadname.bam
ADD REPLYlink modified 21 months ago • written 21 months ago by Martombo1.7k

Thank you so much for your precious help...I will inform you about the results.

 

ADD REPLYlink modified 21 months ago • written 21 months ago by barbara.mariotti810
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 656 users visited in the last hour