Hi all,
I'm working with some high-depth, 100bp PE RNA-Seq data and we'd like to look at both mRNA and lncRNA.
Right now my workflow looks like the following:
Align reads to human genome (GRCh38.primary_assembly.genome.fa) via STAR.
STAR --runMode alignReads --runThreadN 8 --genomeDir INDEX_DIR_HERE --outSAMtype BAM Unsorted --readFilesIn FASTQ_FILEPATHS_HERE
Generate count tables via featureCounts, I have been doing this twice for my annotations, once to generate a count table from the gencodev36 primary assembly annotation, and again to generate a count table from the lncipedia 5.2 annotations.
featureCounts -T 8 -a GENCODE_OR_LNCIPEDIA_GTF -t exon -s 2 -p -g gene_id -o Counts.txt BAM_FILES
I then use DESeq2 to get differential genes.
The issue I'm running into right now is cutting down on redundancy between the gencode dataset and lncipedia. Since some of the lncRNAs are also in the gencode annotations, those get included twice. I've tried using biomaRt to convert ensembl gene IDs to HGNC symbols, but this is not proving very effective as not all of the ensemble lncRNAs IDs in gencode have hgnc symbols.
What would be the easiest way for me to ensure I get accurate counts of mRNA and lncRNA in one table?
GENCODE GTF does have lncRNA's in it. Are you excluding those during counting?
No, I'm not sure how to exclude those from gencode. That's sort of the problem. I want the more extensive listing of the lncRNAs provided by lncipedia while still getting all the "standard" genes from Gencode.
You could simply
grep -v
those entries out