I know this question has been answered a couple of times, though I am confused about how the indexing should be done.
I have RNA-seq data and two conditions. I am planning to get both DE mRNAs and LncRNAs using HISAT2.
To identify DE LncRNAs from RNA-seq data, I know that I should use the GTF file from the GeneCode website. Below is the order of what I did:
I have two GTF files
- known_lncRNA.gtf (obtained from Genecode)
- gencode.v35.annotation.gtf (obtained from Genecode)
To identify known DE LncRNA, I performed the below steps:
- make an index by
taking first the splice sites from the known_lncRNA.gtf file:
hisat2_extract_splice_sites.py known_lncRNA.gtf > known_lncRNA_splicSite.ss
extracting exons from the whole GTF file:
hisat2_extract_exons.py gencode.v35.annotation.gtf > genome.exon (or should I used the known_lncRNA.gtf here instead of gencode.v35.annotation.gtf)
Then make the index file:
- hisat2-build -p 16 --exon genome.exon --ss known_lncRNA_splicSite.ss genome.fa ./genome_tran
Is this the correct way of making the index for specifically LncRNAs?
I then performed 1. QC reads and remove adapters 2. HISAT2 3. feature counts 4. DESEq or EdgeR
Also, for the featurecounts step, should I used the integrated GTF file: known_lncRNA.gtf+gencode.v35.annotation.gtf or just the "known_lncRNA.gtf"
I really appreciated any hint as I am stuck in this step.