HTSeq-count, problem in handling PE data
0
0
Entering edit mode
8.4 years ago

Hi everyone,

I am a newborn in the sequencing fields and I have some problem with HTSeq. I know that there are a lot of posts about this problem but, even if I read all of them, I am not able to fix the problem I have.

I have PE RNA-seq data, aligned with TopHat. I want to use HTSeq-count to create the input for DESeq2 analysis.

Once obtained the accepted_hits.bam file what I did was:

samtools flagstat accepted_hits.bam > stat.txt
samtools index accepted_hits.bam
samtools view accepted_hits.bam > accepted_hits.sam
sort -k1,1 -k2,2n accepted_hits.sam > accepted_hits_sort.sam
htseq-count -f sam -m union -r pos -s no -t exon -i gene_id accepted_hits_sort.sam gtfFiles/Homo_sapiens.GRCh38.77.gtf > Sort_Pos.txt

This works fine with an accepted_hits.bam file of 3 GB with around 50 million reads mapped, but whan I moved to a 5 GB, 100 million reads mapped file I got this error:

Error occured when processing SAM input (record #35593187 in file accepted_hits_sort.sam):
  Maximum alignment buffer size exceeded while pairing SAM alignments.
  [Exception type: ValueError, raised in __init__.py:671]

I tried different option:

  1. increasing the max buffer size of HTSeq-count (as suggest in http://seqanswers.com/forums/showthread.php?t=41531) ----> same error
  2. sorting the bam per name instead of per position ----> Everything work but the feature I count corresponds to reads instead to fragment (probably for the resons described in the last reply of htseq-count for pair-end RNA-seq
  3. According to the last reply of previous post I try to rename my reads using

    (samtools view -H accepted_hits.bam; accepted_hits.bam | awk '{print substr($1,1,length($1)-1),$0}' | sed 's/ [^ ]*//') | samtools view -bSh - > in.samereadname.bam

that return me this error:

So now my questions are:

  1. Is what I did correct?
  2. How can I solve the problem with the bigger bam file?

Thank you so much for your responses.

RNA-Seq HTSeq • 6.1k views
ADD COMMENT
2
Entering edit mode

the correct way of proceeding is the second one. unless you have old data, the read name should be the same for both mates. you can check this by looking at your first bam lines. if you indeed see that the two mates end with \1 and \2, then you can run that one-liner with the correction of my other comment.

ADD REPLY
0
Entering edit mode

Sorry I forgot to report the error obtained from option 3:

accepted_hits.bam: comando non trovato
[samopen] SAM header is present: 194 sequences.
[sam_read1] reference 'ID:TopHat    VN:2.0.13    CL: tophat -p 8 -r 250 -G gtfFiles/Homo_sapiens.GRCh38.77.gtf -o ABC_Repl1 gtfFiles/Bowtie2.2.3Smallindex/Homo_sapiens.GRCh38.dna.primary_assembly ABC_Repl1.R1.fastq.filtered ABC_Repl1.R2.fastq.filtered
' is recognized as '*'.
[main_samview] truncated file.
ADD REPLY
1
Entering edit mode

the code you used is missing the command: samtools view

which you have to insert after the first semi-colon. like this:

(samtools view -H accepted_hits.bam; samtools view accepted_hits.bam | awk '{print substr($1,1,length($1)-1),$0}' | sed 's/ [^   ]*//') | samtools view -bSh - > in.samereadname.bam
ADD REPLY
0
Entering edit mode

Thank you so much for your precious help...I will inform you about the results.

ADD REPLY

Login before adding your answer.

Traffic: 3327 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6