Hello, new to RNA-seq here. I have mapped my fastq reads to a reference genome (transcript.fasta) with HISAT2. The BAM files look very good with >85% reads mapped. However, when I try to use featurecounts, both local and on Galaxy, the results always yield 0 mapped reads. From lurking around the web, it seems there may be a disconnect between my BAM file Headers and the .GFF file that I am using in feature counts.
Example line of .GFF
Seqname Source Feature Start End Score Strand Frame Group
Niben101Ctg00054 maker exon 167 487 . + . ID=Niben101Ctg00054g00001.1:exon:001;Parent=Niben101Ctg00054g00001.1;Alias=snap_masked-Niben101Ctg00054-processed-gene-0.0-mRNA-1:exon:4812
Example line of .BAM
A00247:109:GW190602181st_novaxp:1:1402:19651:8030 419 Niben101Ctg00109Ctg001 108 3 141M = 120 151 CTTCAGACATCTTGGCATGGCATTTGACATGTATCAATCGTTGACACCATAAACAATACCAAGTTGGAGATGCATCAAGGAAAGGAACACCACAAGGTTCATCACAGTAGAAACAAAAGGCAGACATCTCAGGATTCTCAT FFFFFFFF:FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF:FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF AS:i:0 XN:i:0 XM:i:0 XO:i:0 XG:i:0 NM:i:0 MD:Z:141 YT:Z:UU NH:i:2 CC:Z:Niben101Scf04375Ctg112 CP:i:5129 HI:i:0
Here is a sample output summary:
Status accepted_hits_Mock-1.bam
Assigned 0
Unassigned_Unmapped 0
Unassigned_MappingQuality 0
Unassigned_Chimera 7331650
Unassigned_FragmentLength 0
Unassigned_Duplicate 0
Unassigned_MultiMapping 10893853
Unassigned_Secondary 0
Unassigned_Nonjunction 0
Unassigned_NoFeatures 26615011
Unassigned_Overlapping_Length 0
Unassigned_Ambiguity 0
Is there any obvious discrepancy that could cause this? In featurecounts, I have used 'ID' as my gene identifier, not gene_id as it is not present in my .gff file. I'm hoping there is a simple error that is causing this complete failure.
Thanks!
How to add images to a Biostars post
I added markup to your post for increased readability. You can do this by selecting the text and clicking the 101010 button. When you compose or edit a post that button is in your toolbar, see image below:
That is indeed likely the most obvious issue.
From the example you provided I think the reference sequences IDs do not seem to correspond (naming schema-wise)
can you double check if they are OK? Did the formatting of the index for the alignment went ok?
The BAM files were generated using a "contigs" .fasta and a "transcripts" .fasta. The transcript format wasx as such:
Is there a way to edit the BAM or .GFF files to create uniformity between these files? All files were downloaded from a single source ftp://ftp.solgenomics.net/genomes/Nicotiana_benthamiana/ It seems odd that they would not work together for this analysis.
hard to tell without having our hands on your data.
why are you using a (genome?) contig and a transcript file as reference? If you merged 2 files into one you have to make sure the IDs are unique over both files.
can you check if the transcript/contig ID I posted before does exist in your contig/transcript file? eg by doing
grep Niben101Ctg00109Ctg001 <contig/transcript file>
Thanks for the advice. I am going to re-align one test sample using only 1 .fa file. I will double check the Genome.fa and transcripts.fa to make sure they have the same ID's as my .GFF file. I am mainly trying to differential gene expression analysis. Would you recommend using only the transcripts.fa? Or should I only use the contigs.fa?
what is the difference between the contigs and the transcript file?
I simplified this. I re-ran HISAT2 with my paired-end fastq files and only the transcripts.fa file as a reference. here is the format of the transcripts.fa
This produced the .BAM file. Here is a sample for idxStats (had ~65% mapped reads)
I then tried to use featureCounts on galaxy with the BAM file shown below:
I used the following .GFF file
I am so confused by the nomenclature. there are no chromosomes, sometimes there are contigs (Ctg) , sometimes there are what I assume scaffolds (Scf).
And did this work better?
You have to absolutely make sure you are using the GFF files that corresponds to the genomic/transcript file you used to do the mapping. Is the GFF file linked to your genomic file or transcript file.
In case of mapping to the genome you will indeed need to use the GFF file, in case of transcripts you don't need to use it, you can then just count the number of reads mappping to each transcript (eg with idxStats), no need to use GFF/FeatureCount then.
Thanks for your help. I think I accomplished what I needed to do with your suggestions. Here is what I did:
The .GFF file that is associated with the genome.fa is unusable. Since I am only interested in known transcripts, I think this was the simplest way forward.
There is a small problem with using idxstats when mapping to the transcriptome. Depending on your organism (which I'm not familiar with), most genes have multiple transcripts overlapping the same sequence. Thus a very large proportion of your reads are likely to be multi-mapping.
If you use
idxstats
, directly on the BAM, then any read that comes from a portion of a gene that is covered by more than one transcript will be double counted. You could filter out the multi-mappers first, usingsamtools view
, reindex, and then useidxstats
again. However, my feeling is that you would probably loose a very large proportion of your reads.Luckily people have thought long and hard about this problem, and there are algorithmns out their for counting reads mapped to transcriptomes to get an accurate quantification:
salmon
andRSEM
. Both can run from a transcriptome BAM file (although the default for salmon is to run from a fastq) and use EM to allocate multimapping reads proportionally to the likely source transcripts.Thanks for pointing this out. I repeated my analysis using Salmon. Overall, the general trends seem highly similar, but this is definitely a more accurate approach.