Hello all,
I have a differential expression analysis pipeline that goes - 1) alignment with HISAT2, 2) featureCounts, 3) Differential expression analysis with DESeq2. I have a list of DE genes from the last step, and now I need to extract the sequences correlating to these genes. The format of my DE genes is 000000F.g00.t0. However, when I try to extract the sequences from my reference genome (which I used in HISAT2), I see that the gene IDs are in the format 000000F.
When I check the SAM files, the IDs are still in the 000000F format. It's only in the featureCounts output that the 000000F.g00.t0 format. So my questions are:
1) Why does featureCounts alter my IDs?
2) How can I get the sequences related to my DE genes? /do I have to remove the .00.t0 and just use the original ID?
Thanks in advance!
It sounds like you have a name mismatch between your reference and annotation files.
featureCountsis probably taking the ID's that are in your GTF files. I am actually surprised that this is working (i.e. you have counts?).Can you show us example lines from your reference and annotation files?
A possible explanation
and
If you have
000000F.g00.t0id's in your reference (and presumably annotation) then it is HISAT2 that is truncating them000000Fin your alignment files. Assuming you are referring to column 3 in SAM record.Reference example lines:
Gene ID examples (
grep '^>' genome.fa | head):Annotation example lines:
I do have 000000F.g00.t0 IDs in my annotation (in column 9, not column 1), and 000000F IDs in my reference.
Also, example SAM lines:
FeatureCounts was run with -g set to Parent, which seems to explain the ID format disparity.
For this example
ID = 000000F.g139.t1so featureCounts is simply using that as gene ID.Since this is annotated as a transcript with a start/stop (4099240 4100215 ) on negative strand you can extract that interval from
000000Fusing examples here: Extract User Defined Region From An Fasta File (and similar threads)