Question

Feature count is very low using htseq-count

0

Entering edit mode

2.1 years ago

salmon ▴ 10

Hello all,

I performed bbmap on my RNA-seq paired sequence data using following cmd

bbmap.sh in1=J2_R1.fastq in2=J2_R2.fastq out=output_J2.sam ref=im4.fasta nodisk

The header of generated sam file is

@HD VN:1.4  SO:unsorted
@SQ SN:k141_1006    LN:2503
@SQ SN:k141_5512    LN:5393
@SQ SN:k141_4772    LN:4387
@SQ SN:k141_3267    LN:4531

I then used htseq-count using following cmd

 htseq-count -s no -a 10 output_J2.sam im4.gff > J2.count

The J2.count shows following

__no_feature    68032
__ambiguous 0
__too_low_aQual 45285
__not_aligned   21494716
__alignment_not_unique  0

The number of features are very less as compared to reads not aligned. Is it due to less mapping during bbmap (0.52%) Any help highly appreciated!

Thanks

htseq bbmap • 1.6k views

ADD COMMENT • link 2.1 years ago by salmon ▴ 10

1

Entering edit mode

before we go any further, do check if the sequence names are the same between the gff file and the BBmap output file (or thus the fasta reference file)

wait, do you mean that you only have a 0,52% mapping rate from BBmap ? If so, then yes likely that is the cause of those low counts you see (== practically no alignments thus)

ADD REPLY • link 2.1 years ago by lieven.sterck 15k

0

Entering edit mode

Thank you for your reply. Yes mapping from bbmap shows only 0.52%.

ADD REPLY • link 2.1 years ago by salmon ▴ 10

1

Entering edit mode

well, there you go then ... (== is the issue of the low counts) so nothing htseq-count messed up thus.

so we moved the issue on to why BBmap is only mapping 0,52% ?

ADD REPLY • link 2.1 years ago by lieven.sterck 15k

0

Entering edit mode

are you using a transcriptome as reference?

you are using the correct reference for those samples to map to?

ADD REPLY • link 2.1 years ago by lieven.sterck 15k

1

Entering edit mode

and not really a solution for this but just some advice: you might consider changing htseq-count for FeatureCount (more robust, quicker, ... )

ADD REPLY • link 2.1 years ago by lieven.sterck 15k

0

Entering edit mode

Is it due to less mapping during bbmap (0.52%) Any help highly appreciated!

As lieven.sterck said tell us what you are aligning to i.e. type of reference. Do you need to account for splicing? If you do then you need to add relevant options for maxindel=NNNNN and intronlen=NN. Are reads in R1/R2 files in sync? Did you keep them together during trimming (if you did trim). If you have samtools in your $PATH then you can directly write BAM files (use out=file.bam), no need to use SAM files.

Can you show us a sample of your fasta headers? grep ">" im4.fasta | head 4. If there are spaces in your fasta headers they will be passed on to the BAM file. To avoid that you may want to add option trd=t at time of alignment.

ADD REPLY • link 2.1 years ago by GenoMax 141k

0

Entering edit mode

I am using reference downloaded from Patric after performing genome annotation. Genome annotation was performed on the DNA sequence of the same bacteria (for which I am trying to do RNA seq analysis)
Reads R1/R2 are for bacterial RNA sequence samples. I made sure they are aligned well while performing adapter trimming and contaminant filtration. Yes I kept them together during trimming.

I tried to use BAM file. But when I am trying to see the header of it is shows encrypted message like following. So i tried using SAM file. I am not sure why I am getting these encrypted codes.

�����>?׹�u�{�[���������?+v��j��?7]�������G??Vn�i׊�vn�o��k�I+�ݹ�E�ŋ?�?^D_勐��"�����i�?��P��8��l3��<���}�o$
            �R?_Do勜���??�'�����$��~x�]��^˛�?����1[�=iۧ"

head -5 im4.fasta shows following. It shows no spaces.

>k141_1006
ctccggcaccggcagtaacttgcaggccctgatcgatagcacgcgcagcggcgacagccc
ggtgcgcatcgctgcggtgctactggcaatacaccccttctgagcaacaccgacctaatg
tggtcgggggcttgctcccgatggcggcggatctgccacaaatcagtaactgacccaccg
cgatcgggagcaagccccctcccacatttggttttgcggtgttcttgaataggtgtcgga
tcaataaacaccactcgtccgcgcacccgtcctacggttgaaatcccccccgccaccccc
atatgagttctcatcagggcattttcgccccgctgcgtggagactctccccttgaccacc
atcgtttcagtacgtcgccacggcaaagtcgtcatgggcggcgacggccaggtttccctg
ggcaacaccgtgatgaaaggcaacgcgaaaaaagtgcgtcgcctgtaccacggccaggta
cttgccggtttcgcaggcgctaccgccgacgcctttaccctcttcgaacgtttcgaaggc
cagcttgagaaacatcagggccacctggtgcgcgccgctgtcgaacttgccaaagaatgg
cgcaccgaccgttccctcagccgcctggaggccatgctcgccgtcgcgaacaaagacgct
tccctgatcatcaccggtaacggcgatgtggtagagcccgagcatggcctgatcgccatg
ggttccggcggtggctatgcccaagcggcagccagtgcgctgctgaagaaaaccgacctg
tcggcccgtgaaatcgtcgagaccgccctgggtatcgccggcgatatctgcgtgttcacc
aaccacaaccagaccattgaggagcaggacctcgccgagtaagccgtaggcttattcccg
cttgaggccgccagacactatgtccatgaccccccgcgaaatcgtccatgaactcaatcg
ccatatcatcggccaggacgatgccaagcgcgccgttgccattgcgctgcgtaaccgctg
gcgccggatgcaactgcccgaagaactgcgcgttgaagtaacccccaagaacatcctgat
gatcggcccgaccggtgtcggtaaaaccgaaatcgcccggcgcctggccaaattggcgaa
cgccccgttcatcaaggtcgaagcgaccaagttcaccgaagtcggctacgtgggccgtga
tgtcgagtcgatcatccgtgacctggccgatgccgccctgaagatgctgcgcgaacagga
agtgaccaaggtcagccaccgtgccgaagacgcagcggaagagcgcatcctcgacgccct
gctgccaccggcccgcatgggtttcaacgaagacgccgcaccgtcgtcggattccaatac
tcgccagctgttccgcaagcgcctgcgtgaaggccagctggatgacaaggaaatcgagat
cgaagtggctgaagtgtccggcgtggatatttctgccccgcctggcatggaagaaatgac
cagccagttgcagaacctgttcgccaacatgggcaagggcaagaagaaaagccgcaagct
caaggtgaaagaagcgctgaagctggtgcgcgacgaagaagccgggcgcctggtcaatga
ggaagaactcaaggccaaggccctggaagcggtcgagcagcacggcatcgtgtttatcga
tgagatcgacaaagtggccaagcgtggcaactccggtggcgtcgatgtatcccgcgaagg
tgtgcagcgtgacctgttgccgctgatcgaaggctgcacagtcaacaccaagctgggcat
ggtcaagactgaccacattctgtttatcgcttccggtgcgttccacctgagcaagccaag
cgacctggtgcccgagttgcaaggccgcctgccgattcgcgtggaactcaaggcactgac
ccctggcgacttcgagcgcatcctcagcgaaccccacgcctcgctcaccgagcagtaccg
tgagttgctgaaaaccgagggcctgggtatcgagttccaggcggacggtatcaagcgttt
ggcggaaatcgcctggcaggtcaacgagaagaccgagaacatcggtgcccgacgcctgca
taccttgcttgagcgcttgcttgaggaagtgtcgttcagtgccggcgacatggcgggtgc
gcagaatggcgaagcgatcaagatcgatgctgactacgtcaacagccacttgggcgaatt
ggcgcagaacgaagacttgtctcgttatatcctgtaagcctcacacagaactgctttgct
tcaaatgtgggagggggcttgctcccgatgaggtagtgtcaatcaatacatctgtgactg
acccaccgccatcgggggcaagccccctcccactcgcgatcttcacaattttttggaacc
gtgtttctacacccagccgccgtcaacgataaagttttgcgccggaacacatcgccaata
tggccctgtttctggcctcggatgcctcggcgatgtgttcggc

head -5 im4.gff shows:

k141_1006   rast2_export    CDS 2   103 .   +   2   ID=fig|270368.6.peg.1;Name=Phosphoribosylglycinamide formyltransferase (EC 2.1.2.2);Ontology_term=KEGG_ENZYME:2.1.2.2
    k141_1006   rast2_export    CDS 352 882 .   +   1   ID=fig|270368.6.peg.2;Name=ATP-dependent protease subunit HslV (EC 3.4.25.2);Ontology_term=KEGG_ENZYME:3.4.25.2
    k141_1006   rast2_export    CDS 920 2257    .   +   2   ID=fig|270368.6.peg.3;Name=ATP-dependent hsl protease ATP-binding subunit HslU
    k141_5512   rast2_export    CDS 154 1296    .   +   1   ID=fig|270368.6.peg.4;Name=Erythronate-4-phosphate dehydrogenase (EC 1.1.1.290);Ontology_term=KEGG_ENZYME:1.1.1.290

ADD REPLY • link 2.1 years ago by salmon ▴ 10

0

Entering edit mode

yes, you can't just open a BAM file with cat or a text editor or head or such, it is a binary format (aka not human readable thus). You can still do so but then you'll need to go through that samtools tool which can 'translate' it on the fly to human readable format.

Keeping files in BAM format has a huge storage advantage, they are much smaller than their SAM counterparts. So advice is to keep those files as much as possible in BAM format. (just a side node)

The head of your fasta file looks a bit strange. Can you confirm that the sequence starts on the same line as the header? (or is this a copy-paste typo when posting?) If it is on the same line: that's not valid fasta format (well technically it is but it just will not contain all of the sequence you might think it does)

ADD REPLY • link 2.1 years ago by lieven.sterck 15k

0

Entering edit mode

There was a typo for the fasta format of Im4. I updated it as the reads starts from different line.

Also, one thing to mention when I checked the taxonomy (Through GTDB-TK and PATRIC) of bin (that was an input to PATRIC for genome annotation), it showed as Pseudomonas (gammaproteobacteria). But my pure culture sample belongs to different class (betaproteobacteria) and species. I referred this paper where they showed that Pseudomonas is phylo-genetically related to my bacteria. Still, I think due to unrelated taxonomy, my mapping percentage is very low. But I am not sure how to proceed further. I request if you can please help me.

Thank you so much

ADD REPLY • link 2.1 years ago by salmon ▴ 10

0

Entering edit mode

that is of course also an option , that your sample is contaminated (or a sample mix-up). That would indeed also be reflected in the very low mapping percentage . If I would map a plant sample against human genome it will also reflect in very low mapping percentages.

ADD REPLY • link 2.1 years ago by lieven.sterck 15k

0

Entering edit mode

Just one question though as I am using the annotated gff file generated by PATRIC. It shows the featureID as fig|270368.6.peg.1, fig|270368.6.peg.2, fig|270368.6.peg.3 ...so on for each contig ID. I tried to convert these each feature ID to gene ID using ID mapper but I am unable to. Do you think low mapping might be due to these feature ID names.

ADD REPLY • link 2.1 years ago by salmon ▴ 10