Question

FeatureCounts alters gene IDs?

0

Entering edit mode

2.0 years ago

Shraddha ▴ 90

Hello all,

I have a differential expression analysis pipeline that goes - 1) alignment with HISAT2, 2) featureCounts, 3) Differential expression analysis with DESeq2. I have a list of DE genes from the last step, and now I need to extract the sequences correlating to these genes. The format of my DE genes is 000000F.g00.t0. However, when I try to extract the sequences from my reference genome (which I used in HISAT2), I see that the gene IDs are in the format 000000F.

When I check the SAM files, the IDs are still in the 000000F format. It's only in the featureCounts output that the 000000F.g00.t0 format. So my questions are:

1) Why does featureCounts alter my IDs?

2) How can I get the sequences related to my DE genes? /do I have to remove the .00.t0 and just use the original ID?

Thanks in advance!

hisat2 deseq2 featurecount differential-expression-analysis • 798 views

ADD COMMENT • link updated 2.0 years ago by GenoMax 141k • written 2.0 years ago by Shraddha ▴ 90

0

Entering edit mode

It sounds like you have a name mismatch between your reference and annotation files. featureCounts is probably taking the ID's that are in your GTF files. I am actually surprised that this is working (i.e. you have counts?).

Can you show us example lines from your reference and annotation files?

A possible explanation

FeatureCounts alters gene IDs?

and

When I check the SAM files, the IDs are still in the 000000F format.

If you have 000000F.g00.t0 id's in your reference (and presumably annotation) then it is HISAT2 that is truncating them 000000F in your alignment files. Assuming you are referring to column 3 in SAM record.

ADD REPLY • link 2.0 years ago by GenoMax 141k

0

Entering edit mode

Reference example lines:

>000000F
ctatcttcgaggttgccacctgtatcgaggagttggcgtctagatcacgaacatgtattt
tagctatcgtgagctcacacctgacggatccagctttcgaggtcacatcctcaagtctcg

Gene ID examples (grep '^>' genome.fa | head):

>000000F
>000001F
>000002F
>000003F
>000004F
>000005F

Annotation example lines:

000000F AUGUSTUS        transcript      4099240 4100215 0.08    -       .       ID=000000F.g139.t1;Parent=000000F.g139
000000F AUGUSTUS        exon    4099240 4099781 .       -       .       Parent=000000F.g139.t1
000000F AUGUSTUS        intron  4099782 4099859 0.77    -       .       Parent=000000F.g139.t1

I do have 000000F.g00.t0 IDs in my annotation (in column 9, not column 1), and 000000F IDs in my reference.

Also, example SAM lines:

NB501358:121:HGNHHBGX2:1:11101:10096:8988       0       000023F 2306773 1       75M     *       0       0       CAGCAATATATCTAGAAGAGAATAATAACGTTCAATAGCCTGATTGGTATTACTATCTCCATCTGAGTAAACTCG      AAAAAEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEE     AS:i:0  ZS:i:0  XN:i:0  XM:i:0  XO:i:0  XG:i:0  NM:i:0   MD:Z:75 YT:Z:UU NH:i:2
NB501358:121:HGNHHBGX2:1:11101:10096:8988       272     000023F 2310206 1       75M     *       0       0       CGAGTTTACTCAGATGGAGATAGTAATACCAATCAGGCTATTGAACGTTATTATTCTCTTCTAGATATATTGCTG      EEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEAAAAA     AS:i:0  ZS:i:0  XN:i:0  XM:i:0  XO:i:0  XG:i:0  NM:i:0   MD:Z:75 YT:Z:UU NH:i:2
NB501358:121:HGNHHBGX2:1:11101:10172:10888      16      001188F 509800  60      75M     *       0       0       GTTCTTCATCTTAATTTTTTGTTTTTTTCTTTGTTTTGGTAGAATATATAAAGTGGGTAGATAGATAGAGAGAGG      EEEEEEEEEEEEEEEEEEEEEEEEEEEAEEEEEEEAEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEAAAAA     AS:i:0  XN:i:0  XM:i:0  XO:i:0  XG:i:0  NM:i:0  MD:Z:75  YT:Z:UU NH:i:1

FeatureCounts was run with -g set to Parent, which seems to explain the ID format disparity.

ADD REPLY • link 2.0 years ago by Shraddha ▴ 90

2

Entering edit mode

For this example

000000F AUGUSTUS        transcript      4099240 4100215 0.08    -       .       ID=000000F.g139.t1;Parent=000000F.g139

ID = 000000F.g139.t1 so featureCounts is simply using that as gene ID.

Since this is annotated as a transcript with a start/stop (4099240 4100215 ) on negative strand you can extract that interval from 000000F using examples here: Extract User Defined Region From An Fasta File (and similar threads)

ADD REPLY • link 2.0 years ago by GenoMax 141k