Running htseq-count to "grab" long non coding gene_id names
9 months ago
dimitrischat ▴ 180

hi all,

new to bioinformatics. so bare with me.. I am trying find long non coding RNA from RNA-seq data. As i checked the human gtf file there are 2 different types of long non coding RNA, "lnc_RNA" and "lncRNA", like so:

NC_000001.11    Gnomon  transcript  29926   31295   .   +   .   gene_id "MIR1302-2HG"; transcript_id "XR_001737835.1"; db_xref "GeneID:107985730"; gbkey "ncRNA"; gene "MIR1302-2HG"; model_evidence "Supporting evidence includes similarity to: 100% coverage of the annotated genomic feature by RNAseq alignments, including 8 samples with support for all annotated introns"; product "MIR1302-2 host gene, transcript variant X2"; transcript_biotype "lnc_RNA"; 

NC_000001.11    BestRefSeq  gene    34611   36081   .   -   .   gene_id "FAM138A"; transcript_id ""; db_xref "GeneID:645520"; db_xref "HGNC:HGNC:32334"; description "family with sequence similarity 138 member A"; gbkey "Gene"; gene "FAM138A"; gene_biotype "lncRNA"; gene_synonym "F379"; gene_synonym "FAM138F";

"lnc_RNA" is on the "transcript" line, and "lncRNA" is on the "gene" line. My first question is should I choose "lncRNA" ?

And most importantly, how do i get only the "gene_id" names of the ones that have "lncRNA" ?

edit: for the 2nd question i did: grep 'lncRNA' GRCh38.p13_genomic.gtf > GRCh38.p13_genomic_lnc.gtf and proceeded as usual.

But is my choice correct of the lncRNA?

lncRNA htseq • 281 views
In the example you posted above one is a gene_biotype and other transcript_biotype. Biotypes should be applicable to both Gene/Transcripts. I am not sure why there is an extra _ in your example for transcript. Is that convention followed for all transcripts? If you are doing analysis at the gene level then you should only select those entries.


