Question: HTSeq not finding any features in mapped rnaseq file
0
gravatar for AGC
19 months ago by
AGC0
AGC0 wrote:

**I am using HTSeq to do counts on rnaseq mapped onto a bacterial genome but it is not finding any features. I will share the HTSeq output, my sam file (generated with bbmap), the gff file, and the HTSeq command. It feels like there must be something obvious that I am missing but I haven't been able to figure out what. Thanks for your help.

HTSeq output:

...
gene991 0
gene992 0
gene993 0
gene994 0
gene995 0
gene996 0
gene997 0
gene998 0
gene999 0
no_feature  26940
ambiguous   0
too_low_aQual   0
not_aligned 76674
alignment_not_unique    0

This is a bit of my bbmap generated sam file:

@HD VN:1.4  SO:unsorted
@SQ SN:NC_009523.1 Roseiflexus sp. RS-1, complete genome    LN:5801598
@PG ID:BBMap    PN:BBMap    VN:37.77    CL:java -Djava.library.path=/home/bbmap/jni/ -ea -Xmx49292m align2.BBMap build=1 overwrite=true fastareadlen=500 in=/home/final.contigs_March5.fa ref=/home/Rosei_NCBI_genome.fasta fastareadlen=500 ambig=best printunmappedcount=t outu=combined.input_Rosei_unmapped.fasta.gz out=Rosei_mapped.sam bamscript=bs.sh
qnqme     flag                                  rname                                               pos    mapq cigar mrnm/rnext mpos/pnext    isize/tlen seq
k123_2375 flag=0 multi=11.3491 len=1329_part_2  4   *   0   0   *   *   0   0   CTGGCAATGCTTGGTTTGATAGGTCTACCACTCCTGGTGCGCCGCAGCCCGGTGTTGGGCGGCGCTTTGATTGTGGCAGCAGTCACATACATCGGCTACAACGGGTTACTGAGTGACTGGCATGGAAGCGGTGCATTTGGCGTGCGTCGCCTGACCAGCCTGGCCGCCTGGTATGCGCTTGGTCTTGCGGCTCTGGCAGAAGCGTTGATCCGTCGCAGGCATACGTTGACAATGCTGATCGGTATGGTTGCCGGGAGCTGGTGGATGCTGATGCTGCTCGTTCGCAAAGGTGATCTAGGCGACACCAGCGCAGGCGCTCTTGAAATGCTGAGCATTGCCGATCTCTACCTGGGACGCACTGCGCTTCCAGCCTTGCAGTTGGGGGATTTTCTTAGAAGTGGTTTTGTCTGGGATACGGTGACGACGGCGCCGCTTGAATTCTCGGTGAAAGTGCTTGGTTATTTTGCGATTCTGAGCGGACTGGCGCTGTGGGCGTCCTG    *
k123_2375 flag=0 multi=11.3491 len=1329_part_3  4   *   0   0   *   *   0   0   GAAACTCGCTGCTGGAAGAGCGCCAGAGCGCGCAGCGCCAGGCGTCGGGACGACGATTACCGAATCAATGCACGTGCGTGATTGATGTGCGTCTCAAGATGCGCGTCCCCAAACTGCCTGGGTGCGCAGGCGTCGCGCCCGCACACGACGCTGCGATCCATCACGGAAGCGAGGGCATCTTGCCTGCCGCCACTCCACGGGAGCGGGACGCCTTCGCGCCGTGGCAACCAGTGAACGAACACCAGTGTGTCGCGCAACGTGTTCTGGCTGATGGGATTGACCCCGCCCGTGCGGGCTGAAGCCCTCCCTGAGCGTGTGGAAGCCCCTGC   *
k123_2377 flag=1 multi=5.0000 len=429   4   *   0   0   *   *   0   0   ATTGCCGAAGTGCGTCCGCGCAGCAGCCTCCCTGACAGCGACGCGCTCGACTTCGGCGACGCGATTGTGATGCCCGCGCTCGTCAACGGGCACTGCCATCTGGAATACACCGCGCTACGCGGGCGGAACGACCGCGCGCCGTTTTTCGAGTGGATTCGCGCGTTGGTCGCCCTGAAGGCGCAGTGCCCACCTGAGCTGTGGCTGCCGTCCGCGCTGCTGGGCGCGGCGGAGCTGATTGCCAGCGGCGTCGGCTTCGTGTCGGACAACACCGACGCAGGCGTGAGCGTAGAGGCGTTGGCGCGGGCAGGGCTGCGCGGGCGCGTCTACCAAGAGGTGTTCGGGATTGAGCCTGAACCCGACGATGCGACCATCCTGCGCGAGCTGACCGCCAAGCTGGAGGCGCACCGACAGACGCTGCAGCGCTACGAG   *
k123_2378 flag=0 multi=172.6574 len=339 4   *   0   0   *   *   0   0   CCCGGAAATAGTATCTGAGACTACTTTTCTTGGTCTTCTCTCCCCGGCTTGCGGAGGCAAAACGCTCCGCCCCTTTTGGGGAACCTTTTGTGGAAGAGGAAGAACCTAGGCGCCTTGGGCCGCCCCCTGTCCACCCATCTCCCCAACTAAGGCTCTTGATGGAAATCCAGGGATCCCTGCCAAGCCAGACTGGGGGATCCCTCTCCGGCCAGATTGTCCTGATACATTTACGCTGAACACTGCTGAACATCACTGTTCAACTATGCAGCGATAGCATTCTGTTTGTTTCCATTCGCGTTAGCGGGACGGAGTCCAAACAGAATGGCGGTATCCCGCAAA *
k123_2387 flag=1 multi=3.0000 len=301   4   *   0   0   *   *   0   0   CAGGTGGACGCGGTGGCGTTCCGCCAGCGCGGTGACTTCATGGTAGGCACGGATCTTTTCGCCCTTGTCGGGCGGGTTGGGGACGCAGTGGGACAGGTACAGCAGCTCCACTGCGCCGCCTCAGCTCTTGGCTGCGTCGCCGCTCTTGGCTGTGGACTCTGCCTTGGATCCGCTGTCCGACGATTTCGTCTCCCGGCTCTTGCCGTTGCTTCCGCTGCTGCCGGAACGGCCGTAGTCGGTGACGTACCACCCCGTGCCCTTGAACTGGAGCGCGGGAACCGAGAGCAGCCGCCGCACCGCG   *
k123_2389 flag=1 multi=11.0000 len=500  4   *   0   0   *   *   0   0   CACATAGTTCGCCACCGTGTCCGCCAACTGGCGGAAGGCGACGATGTCGAAGTAATCCACCTGCTCTTCTGCGCCGTTCAGGCGGCTGGGGTGGCGGTGCTGCTGCAGGTCGCTGTACCGTGCGGGACGGCGGGGGCGGCGCGGGGGCGGCGGCAGCCGCTGCGGGCATATCTGCGCCCGCTTCAGGCTCGCCCTCGCGCAGTCGCTCCAGAAACCGTACCGTGTCGGCGACGACTTCGTATGCCTTGCGACGGTTGCCCTCACGGTCGGTGTAGCTGCGCGTCTGCAGTTTGCCCTCCACCAGCACCAGCCTGCCTTTGGTCACATAGTTCGCCACCGTGTCCGCCAACTGGCGGAAGGCGACGATGTCGAAGTAATCCACCTGCTCTTCTGCGCCTGGGCGCACGGGACGGTTGACGGCAATCGCGAAGCGCACCACCGAAGTGCCATCTGGGGTTGCCCGCAGTTCGGGGTCGCGCGTCAGACGCCCTACCAAGATG    *
k123_2391 flag=1 multi=4.0000 len=416   4   *   0   0   *   *   0   0   CAGAACGGCGACGTCGGGCAGTTGTTGGGCAAACGTGCCGTCGTAGGAGTAGGCGATACGGTCCTCTATGCTGGTGAGGACTTTGGACGGACCCAGAACTTCCTGCAATCGGGCAATCCAGACTGGCTCGCTCATACGGGACTCCTCTGATTTTCACCACGAAGACACAAAGACACGAAGAATCAGCAATTGTTAATTTTTTAATTGTTAATTTAACTTCGTGCCTTCGTGTCTTTGTGGTTTTTTACTTCATCGCCTCCAGCACAATTTCGGCGATGTCTTTGACCCGCAGCCGGAGGCGCTCGGTGCGGGCGGCATTGAAGAGGGTGCGCTCGCACTGCTGGCAGGCGGAGACGACGACCTGGGCGCCGGTCTCCGCGGCCTGGCGGATGCGCCGGGCCGCGACGGCCTGGCTG    *
k123_2395 flag=0 multi=48482.4661 len=241   16  NC_009523.1 Roseiflexus sp. RS-1, complete genome   1416779 3   241M    *   0   0   GCGCGTGGAGGCCCGCACCGGTGTGGATTGCAAACCGCTCGGATGACCTGTGGGTAGGGGTGAAATGCCAATCGAAAGCGGAGATAGCTGGTTCTCCCCGAAATGCATTGAGGTGCAACCCGCGGAAGGGGCGCGCGGAGGTAGAGCGCTGGTGTGGTGCGGGGGCTTCACCGCCTGCCAAACGACGCCAAACTGCGAATGCCGCGCGGTCGCCCGCGGAGTGAGACGTGCGGCGCAAACG   *   XT:A:R  NM:i:4  AM:i:3
k123_2399 flag=1 multi=6.4322 len=757   4   *   0   0   *   *   0   0   CTCCACTACGCCTTTGAGCGCGTAGAAGTCGGCAGGTTGGGGCTTGCTTGCCCAGTGCGGCGCACTAAGCAGCCCCGTCATCAGGATCCCCAAATGCGTCCACTCGTGGTAGTTGCCCTGTTCGTCTTGGGCGAACACGCGCCCGATTTCGAACAGGTGCAGGTCGCGGCGGTTGCGACGGCGGTTGTGGTCGGCGGCGCGGATGAGGCTGGACATCAGCGAGTGGCGCAGGCAGCTGAGGTCTTCGCTCATCGGGTTCTGTAGCGATATGGGCTGGTACATCAGCTCGCCGACGCCCTCGCCGCCGCTTTGCAGCGGAGATGGGGCTTCCAGCGTGTGCCCGACGATTTCCTGCAGCCCCGCGCTGAGGAGGATGGTGCGCACGCGCTCGGCGAACGCGCCTTCGGGGCTGTCCTTGCCCTGCGTGGTGTAGCCTGCGGGCGGGACGGCGGGGATGCGCTCGTAGCCCAGAATGCGCCCCAGCTCCTCGATGAGGTCTT    *
k123_2399 flag=1 multi=6.4322 len=757_part_2    4   *   0   0   *   *   0   0   CCTCGCGCTGCAGGTCGGCGCGGTAGGGGGGAACCTGCACGCGGAAGCCGCCCTCTATCGGCTCTGGGCGCAGATTCAGTCCTCCAGCCACGCGCCGCCCGCCTCGCGGATGGTCTGCTCCAACGCGGCGTAAGGGACGCTCTGCGCGACCACAATCGCCGACCGCTGCCATCGCGGCGCCCAACGCGCGGGAGCCCAAAGTCTCGCCGACCGCTTTGGTGAACGCCTGCCAGCCTTCAGCGTCCCAGATGCCAAAG   *
k123_2402 flag=1 multi=3.0000 len=346   4   *   0   0   *   *   0   0   CCAAAGACATAATCAGGCAGAAGCTCAAAAAGCTTTTTCTTTATTTCGTTTTCCATTTCATATCTCACTTCTTAATTTTTTTGATAGTTTTTTTATTGCTTGGAAATAATTTGCTTTAAGTCCCCCAACACTTGTTCCTAAAAGTTTTGAAATTTCTTCATAGCTTAAATTATCAAAATATCGTAGTGCAAATGTTTCACGTTGTTTTTCTGGTAATTTCGTAAGTTCCAAAAGAAACCTTCTTTCCAACTCGTTTCCTAAATAATATTGTTCAGGACTAAAATCTTTTTGTTTTTTAATTACGTCATTATTATCGTCGATTCTCAAAAAACTCATAAGTTTTTTC  *
k123_2418 flag=0 multi=12738.1129 len=247   4   *   0   0   *   *   0   0   CCCCTACCACCCGGGCCCGGCGTACCGGACCCGGATCCACAGCTTCGGTAGGCGGTTTGATCGCCAATCATTTTCGGCGCGACATCGCTCGATGAGTCAGCTGTTACGCACTGTTTAAATGATAGCTGCCTCTAAGCCAACATCCTCACTGTCTAAGCGACGTCACATCCTTTCCACTGAACCGCCTTGAGGCACCTTAGCTGGTGGTCTGGGCTGTTGCCCTCTCGGCCACGGACCTTATCGCTCG *
k123_2419 flag=0 multi=43053.0189 len=229   4   *   0   0   *   *   0   0   ATACACCAGAAGGGTGGTATTTCACTGGCGGCTCCACCCTGACCGGAGCCAGGGCTTCTCAGCCTCCCACCTATGCTACGCATCCAATGCACGCGGCCAATACCAAGTTGTAGTGAAGGTTCAAGGGGTCTTTTCGTCCCGTCGCGGGTAACCGGCATCTTCACCGGTACTACAATTTCACCGAGCTCGCGGCCGAGACAGTGCCCAAGTCGTTATGCCATTCGTGCAG   *
k123_2420 flag=1 multi=6.0000 len=992   4   *   0   0   *   *   0   0   AACTCTCTCAGATGAAGGGATTCTTTGCCGCAGGAGACAAGATCATGCTAATGGTGGAAAGGCACGAAATGCACTCGCACGACATACCCAGGATCGTTTCACTGTTGAGGGAATACGGGATAGACGTTGCTCAGATTCTGATAGGCGATTCTTTTCAGAACGATTTGAGTTCCATGAGCAGGATGAAGCTGCTCGAGGAGCGAGAAACGAAATCCGGCACCAAGGTTGTGAAAAAGCACGTGAGATCGGGTCAGGCTGTGGTACATTCGGGAGACATCGTGGTTCTGGGCAACGTTCATGCAGGCGCAGAGTTGCTCGCGGGTGGATCCATCGTCGTTTTCGGCAACGTCAAGGGTACACTCAGAGCCGGTTTGAACGAGGGAGACAGCGCGATCATCGCCGCTCTGAGCATGCAACCGACATTGATACAGATTTCTGGATATACGCTCAGAGAAGTGTCGAACTACGAAGAACCAGTCATAGTTCAAGTGAAAAATGGT    *
k123_2420 flag=1 multi=6.0000 len=992_part_2    4   *   0   0   *   *   0   0   AAGATCGTTGTGGAAAAAGCCAAAGCTTGATAGGTGTACTGGTCTCCTCATCACAAGCCATCGTTGGAGGAGTCTTAGGAGCGGCAGAGGTGAACGTTTATCATAGCCAAGGATGGTCATAGAATACTGGTGGCATGGGTCGAAACCCCTGTCGTGTTTGGTTGGTATGACTTCTTTTTTGCTCAAAACGTTGTTTGTTGGTAACCAACTTGAATCAACTGGTTTTTGGAGGTAACATGATAGAGATCGAAGTAGTCGAGCTTTGTCTTGAGTCATTGAGGTCAGGTTGTTTTCCGGTCGGTTCGGTAATTACAAATGCTGAAGGACAGATAATCTCGAAGGGTAGAAACACTATTCACGAATCAGAGCCTAAAGGTTATCCAGTCTTTGGTAACAGCATTGCTCATGCTGAATTGAATGCACTAGTTTCTATGAAAAGAATCCCAGACGATTTAGACGTCTCTGATTACGTTATTTATACTTCTATGGAAC    *
k123_2422 flag=0 multi=51274.5645 len=247   0   NC_009523.1 Roseiflexus sp. RS-1, complete genome   1418144 3   247M    *   0   0   CGCTCCGCGCCTGCGTAGGATAGGTGGGAGCCTGAGAACGCGCCCTTGCGGGGGCGCGGGAGGCGCCGGTGAAATACCACTCTGGTGCGGCGTGCGCACTCACCCGTGGACAACGGGGACCGTCGCTGGCGGGTAGTTTGTCTGGGGCGGACGCCTCCTAAAGAGTAACGGAGGCGCGCAACGGCGCCCTCAGGCGGGATGGCAATCCGCCGTGGAGTGCAAAGGCAGAAGGGCGCTTGACTGCAAG *   XT:A:R  NM:i:9  AM:i:3
k123_2425 flag=0 multi=50392.5161 len=247   4   *   0   0   *   *   0   0   ATGCAAAAGGCACGCGGTCACACCCTCTGGCCGAGACCAGAGGAAGCTCCCACTGTTTTGTAGGCATCTGGTTTCAGGTTCTATTTCACTCCCTTCTTAAGGGTGCTTTTCACCTTTCCCTCATGGTACTTGTGCACTATCGGTCATCAGCGAGTATTTAGCCTTACCAGATGGTTCTGGCAGATTCACACGGGACCTCACGTACCCCGTGCTACTCAGGAACACCCACGCGCCGCCTTGGATTTCG *

and this a bit of the gff file that I am using:

##gff-version 3
#!gff-spec-version 1.21
#!processor NCBI annotwriter
#!genome-build ASM1666v1
#!genome-build-accession NCBI_Assembly:GCF_000016665.1
#!annotation-date 05/15/2017 13:12:56
#!annotation-source NCBI 
##sequence-region NC_009523.1 1 5801598
##species https://www.ncbi.nlm.nih.gov/Taxonomy/Browser/wwwtax.cgi?id=357808
NC_009523.1 RefSeq  region  1   5801598 .   +   .   ID=id0;Dbxref=taxon:357808;Is_circular=true;Name=ANONYMOUS;gbkey=Src;genome=chromosome;mol_type=genomic DNA;strain=RS-1
NC_009523.1 RefSeq  gene    415 1860    .   +   .   ID=gene0;Name=ROSERS_RS00005;gbkey=Gene;gene_biotype=protein_coding;locus_tag=ROSERS_RS00005;old_locus_tag=RoseRS_0001
NC_009523.1 Protein Homology    CDS 415 1860    .   +   0   ID=cds0;Parent=gene0;Dbxref=Genbank:WP_011954804.1;Name=WP_011954804.1;gbkey=CDS;inference=COORDINATES: similar to AA sequence:RefSeq:WP_011954804.1;product=chromosomal replication initiator protein DnaA;protein_id=WP_011954804.1;transl_table=11
NC_009523.1 RefSeq  gene    2021    3196    .   +   .   ID=gene1;Name=ROSERS_RS00010;gbkey=Gene;gene_biotype=protein_coding;locus_tag=ROSERS_RS00010;old_locus_tag=RoseRS_0002
NC_009523.1 Protein Homology    CDS 2021    3196    .   +   0   ID=cds1;Parent=gene1;Dbxref=Genbank:WP_011954805.1;Name=WP_011954805.1;gbkey=CDS;inference=COORDINATES: similar to AA sequence:RefSeq:WP_012122338.1;product=class II aldolase;protein_id=WP_011954805.1;transl_table=11
NC_009523.1 RefSeq  gene    3243    3956    .   +   .   ID=gene2;Name=ROSERS_RS00015;gbkey=Gene;gene_biotype=protein_coding;locus_tag=ROSERS_RS00015;old_locus_tag=RoseRS_0003
NC_009523.1 Protein Homology    CDS 3243    3956    .   +   0   ID=cds2;Parent=gene2;Dbxref=Genbank:WP_011954806.1;Name=WP_011954806.1;gbkey=CDS;inference=COORDINATES: similar to AA sequence:RefSeq:WP_011954806.1;product=NAD(P)-dependent oxidoreductase;protein_id=WP_011954806.1;transl_table=11
NC_009523.1 RefSeq  gene    4002    4865    .   -   .   ID=gene3;Name=ROSERS_RS00020;gbkey=Gene;gene_biotype=protein_coding;locus_tag=ROSERS_RS00020;old_locus_tag=RoseRS_0004
NC_009523.1 Protein Homology    CDS 4002    4865    .   -   0   ID=cds3;Parent=gene3;Dbxref=Genbank:WP_011954807.1;Name=WP_011954807.1;gbkey=CDS;inference=COORDINATES: similar to AA sequence:RefSeq:WP_011954807.1;product=hypothetical protein;protein_id=WP_011954807.1;transl_table=11
NC_009523.1 RefSeq  gene    5453    6856    .   -   .   ID=gene4;Name=ROSERS_RS00025;gbkey=Gene;gene_biotype=protein_coding;locus_tag=ROSERS_RS00025;old_locus_tag=RoseRS_0005
NC_009523.1 Protein Homology    CDS 5453    6856    .   -   0   ID=cds4;Parent=gene4;Dbxref=Genbank:WP_011954808.1;Name=WP_011954808.1;gbkey=CDS;inference=COORDINATES: similar to AA sequence:RefSeq:WP_011954808.1;product=NADP oxidoreductase;protein_id=WP_011954808.1;transl_table=11
NC_009523.1 RefSeq  gene    7035    7766    .   -   .   ID=gene5;Name=ROSERS_RS00030;gbkey=Gene;gene_biotype=protein_coding;locus_tag=ROSERS_RS00030;old_locus_tag=RoseRS_0006
NC_009523.1 Protein Homology    CDS 7035    7766    .   -   0   ID=cds5;Parent=gene5;Dbxref=Genbank:WP_011954809.1;Name=WP_011954809.1;gbkey=CDS;inference=COORDINATES: similar to AA sequence:RefSeq:WP_011954809.1;product=hypothetical protein;protein_id=WP_011954809.1;transl_table=11
NC_009523.1 RefSeq  gene    7768    8280    .   -   .   ID=gene6;Name=ROSERS_RS00035;gbkey=Gene;gene_biotype=protein_coding;locus_tag=ROSERS_RS00035;old_locus_tag=RoseRS_0007
NC_009523.1 Protein Homology    CDS 7768    8280    .   -   0   ID=cds6;Parent=gene6;Dbxref=Genbank:WP_041332586.1;Name=WP_041332586.1;gbkey=CDS;inference=COORDINATES: similar to AA sequence:RefSeq:WP_012122701.1;product=DUF3090 domain-containing protein;protein_id=WP_041332586.1;transl_table=11
NC_009523.1 RefSeq  gene    8314    9039    .   -   .   ID=gene7;Name=ROSERS_RS00040;gbkey=Gene;gene_biotype=protein_coding;locus_tag=ROSERS_RS00040;old_locus_tag=RoseRS_0008
NC_009523.1 Protein Homology    CDS 8314    9039    .   -   0   ID=cds7;Parent=gene7;Dbxref=Genbank:WP_011954811.1;Name=WP_011954811.1;gbkey=CDS;inference=COORDINATES: similar to AA sequence:RefSeq:WP_011954811.1;product=phosphoglycerate mutase;protein_id=WP_011954811.1;transl_table=11
NC_009523.1 RefSeq  gene    9360    10886   .   +   .   ID=gene8;Name=ROSERS_RS00045;gbkey=Gene;gene_biotype=protein_coding;locus_tag=ROSERS_RS00045;old_locus_tag=RoseRS_0009

and here is my HTSeq command:

python -m HTSeq.scripts.count -i ID -t gene mappedfile.sam annotationfile.gff >outfile.txt

I have played with the -i and -t designations (a bit blindly) changing them to CDS, but it has not changed the outcome.

Thanks!

rna-seq htseq • 860 views
ADD COMMENTlink modified 19 months ago by h.mon28k • written 19 months ago by AGC0
1

You realize that the tiny fragment of bam you showed shows all but one of those reads to be unaligned? I tried blasting a couple against nr, and got nothing.

ADD REPLYlink written 19 months ago by swbarnes26.9k

Thanks, I've also tried blasting and did get hits that made sense: the unaligned to other organisms in my system, and the aligned to the organism that I am using as ref genome, Roseiflexus. I am using a metatranscriptome and mapping it to a reference genome to learn the expression of just that one organism. I was assuming that the sam file reflected that, reads from other organisms that are unaligned, and reads from 'my' organism that are aligned..

ADD REPLYlink written 19 months ago by AGC0
2
gravatar for h.mon
19 months ago by
h.mon28k
Brazil
h.mon28k wrote:

BEGIN facepalm

The contig names on your bam: k123_2375, etc.

The chromosome name on your gff: NC_009523.1.

A common error is using a gff and reference genome from the same species, but with different naming conventions (like chr1 and 1). This is not even the case here, you are using the annotation of a complete genome from NCBI, and mapping to scaffolds (which you assembled yourself, I guess?).

END facepalm

The first field of a sam file record is the query name, not the reference name, and my answer above is complete bollocks.

Anyway, you are trying to map a metatranscriptome to one reference genome. Unless this is the genome of the most dominant species, you should expect few reads (or contigs, as you are mapping here) to map. And mapping to only one reference genome may induce wrong mappings, from reads / contigs that would align better to other genomes but in their absence, may map to this genome.

ADD COMMENTlink modified 19 months ago • written 19 months ago by h.mon28k

Isn't it the other way around? Fasta format scaffolds are being mapped to the NCBI genome. My feeling is that the spaces in the scaffold names is causing htseq-count to fail. @AGC should have aligned using trd=t option to truncate the read names after the first space. reformat.sh has an option to do this after the alignment. I can't look at the in-line help at the moment.

ADD REPLYlink modified 19 months ago • written 19 months ago by genomax73k

Isn't it the other way around? Fasta format scaffolds are being mapped to the NCBI genome.

(facepalm) Indeed, you are correct.

@AGC:

I am using HTSeq to do counts on rnaseq mapped onto a bacterial genome

And:

I am using a metatranscriptome and mapping it to a reference genome to learn the expression of just that one organism.

You are not mapping RNAseq reads, you are mapping the assembled metatranscriptome contigs onto a reference genome. If I am not mistaken, it is a MEGAHIT assembly. If you want to quantify the expression, you need to map the raw reads. Mapping the transcriptome contigs gives you only what genes are being expressed or not, but not level of expression.

I think swbarnes2 is right, almost nothing mapped. Check the sam flags:

samtools view mappedfile.sam | cut -f2 | sort | uniq -c

The first column are the counts, the second are the sam flags. Yours will be mostly 4 (unmapped), then 0 and 16, I guess.

ADD REPLYlink written 19 months ago by h.mon28k

Thank you so much. This is awesome. It all makes sense! I think I have the raw reads as well (I am new to this project) and will try your suggestions. Thanks!

ADD REPLYlink written 19 months ago by AGC0
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1168 users visited in the last hour