RNAseq: Mapping to ORFs
2
0
Entering edit mode
5.2 years ago
chimeric • 0

Hi all,

I conducted an RNAseq experiment, and have mapped my sequences to ORFs for the reference organisms (bacteria). I would like to now generate a count table. Looking through the packages (e.g., HTSeq-count, summerizeOverlaps, Verse), it appears they typically require a GTF or GFF file. However, since I mapped to ORFs, my reads are already annotated and assigned to a gene.

I'm wondering... --Are there any issues with mapping to ORFs? I'm working with bacteria, so I don't think I need to worry about issues with overlapping genes or exons being used in multiple different transcripts (as you would with Eukaryotes). --Are there any tools that can handle BAM files that are already "annotated" because they were mapped to ORFs? --Is there any reason why I should not parse the SAM files and make a count table using Python?

Mostly I'm wondering if I should start over or not, as the mapping is very time consuming. For future work, I would request the scaffolds and GFF files.

Thank you for any input!

RNA-Seq HTSeq-count mapping counts • 1.5k views
1
Entering edit mode
5.2 years ago

You can use samtools idxstats to get the number of alignments per ORF in the file. Note that you should filter out multimappers first (just use a reasonable MAPQ threshold, like 5). Alternatively, a little script could also do the trick nicely.

1
Entering edit mode
5.2 years ago
Asaf 8.6k

That's an interesting approach. I think that you can straight-forward count the number of reads mapped to each gene. However, there are several issues you might want to handle:

• What happens if a read is half in one gene and half in the next gene (genes on the same transcription unit)?
• Did you take the transcribed region (including UTRs) or just CDS?
• Did you map to non-coding RNAs (sRNAs mainly, they might be interesting)

I wonder how much reads you were able to map to the ORFs.