Question: How can I extract and count unmapped reads using htseq count?
1
gravatar for farah.dahalan
4.0 years ago by
European Union
farah.dahalan10 wrote:

 

I have few questions regarding htseq count.

I run the htseq in virtual box using this command :

samtools view SMA1.bam | htseq-count -s no -i ID -t gene - Anopheles-gambiae-PEST_BASEFEATURES_AgamP4.2.gff3 > htseq2/SMA1.text

 

biouser@VirtualBox:~/Desktop/sf_farahdata$ samtools view SMA2.bam | htseq-count -s no -i ID -t gene - Anopheles-gambiae-PEST_BASEFEATURES_AgamP4.2.gff3 > htseq2/SMA2.text

100000 GFF lines processed.

175804 GFF lines processed.

Warning: Read HS31_15535:2:2109:18695:7195#14 claims to have an aligned mate which could not be found in an adjacent line.

Warning: Malformed SAM line: RNAME != '*' although flag bit &0x0004 set

Warning: Malformed SAM line: MRNM == '=' although read is not aligned.

Warning: Malformed SAM line: MRNM != '*' although flag bit &0x0008 set

100000 SAM alignment record pairs processed.

200000 SAM alignment record pairs processed.

300000 SAM alignment record pairs processed.

Q1 : What does the warning means?

 

I use this command to pull out unmapped reads :

samtools view -f 4 -c SMA2.bam

 

I attached the results from the reads data that I obtained. It only shows the read information rather than read counts.

<Screen Shot 2015-05-12 at 2.58.00 PM>

Q2 : How can I extracting counts from RNAseq data that does not align to a feature.

Thank you.

 

rna-seq ht-seq count • 1.8k views
ADD COMMENTlink modified 4.0 years ago • written 4.0 years ago by farah.dahalan10
0
gravatar for CraigM
4.0 years ago by
CraigM80
Ireland
CraigM80 wrote:

For your second question:

Remember that in this situation 'counts' refers to the number of reads aligned to a particular gene feature.

If you wish to know the total amount of reads which did not align, this can be found at the bottom of the htseq output using the tail command.

If you mean you wish to extract the actual 'reads' which did not align to a feature there are better methods for doing so such as using samtools view with the '-f 4' flag to show only unaligned reads (However this will not include reads which did align to the genome, but not to gene features unless you used an annotation file during alignment and restricted this happening).

This will give you the sam/bam records of the unmapped reads (if you want the fastq sequence, use the bedtools bamtofastq tool).

Another method of finding the unmapped reads is to use tophat for your alignments. The output of this tool by default includes the unmapped reads.

 

This thread may be useful for more information:

How To Filter Mapped Reads With Samtools

ADD COMMENTlink written 4.0 years ago by CraigM80
0
gravatar for farah.dahalan
4.0 years ago by
European Union
farah.dahalan10 wrote:

Hi Craig,

Thank you for your reply.

Maybe my question is not very clear. I use htseq count using annotated reference GFF file and I got around 3million of 'no_feature' reads. I am interested to get the counts on these unannotated regions. I wonder if I could do this using htseq count or any other tools?

Thank you in advance. :)

ADD COMMENTlink written 4.0 years ago by farah.dahalan10

Use the -o option, and then grep the resulting SAM file for no_feature. You'll likely find that these are in introns, around UTRs, and in repeat regions.

ADD REPLYlink written 4.0 years ago by Devon Ryan90k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1443 users visited in the last hour