Reference genome index for LncRNAs (from RNA-seq data)
1
1
Entering edit mode
12 months ago
SHN ▴ 30

Hello All,

I know this question has been answered a couple of times, though I am confused about how the indexing should be done.

I have RNA-seq data and two conditions. I am planning to get both DE mRNAs and LncRNAs using HISAT2.

To identify DE LncRNAs from RNA-seq data, I know that I should use the GTF file from the GeneCode website. Below is the order of what I did:

I have two GTF files

1. known_lncRNA.gtf (obtained from Genecode)
2. gencode.v35.annotation.gtf (obtained from Genecode)

To identify known DE LncRNA, I performed the below steps:

• make an index by
• taking first the splice sites from the known_lncRNA.gtf file:

hisat2_extract_splice_sites.py known_lncRNA.gtf > known_lncRNA_splicSite.ss

• extracting exons from the whole GTF file:

• hisat2_extract_exons.py gencode.v35.annotation.gtf > genome.exon (or should I used the known_lncRNA.gtf here instead of gencode.v35.annotation.gtf)

• Then make the index file:

• hisat2-build -p 16 --exon genome.exon --ss known_lncRNA_splicSite.ss genome.fa ./genome_tran

Is this the correct way of making the index for specifically LncRNAs?

I then performed 1. QC reads and remove adapters 2. HISAT2 3. feature counts 4. DESEq or EdgeR

Also, for the featurecounts step, should I used the integrated GTF file: known_lncRNA.gtf+gencode.v35.annotation.gtf or just the "known_lncRNA.gtf"

I really appreciated any hint as I am stuck in this step.

RNA-Seq sequencing assembly sequence • 453 views
0
Entering edit mode

Tutorial tag is reserved for actual tutorials that show users how to do something. You are asking questions about what you need to do so please don't use that tag.

0
Entering edit mode
12 months ago
Qiongyi ▴ 130

Since you want to get both DE mRNAs and LncRNAs at the same time, you can use gencode.v35.annotation.gtf (obtained from Genecode). This GTF contains both protein-coding genes and long non-coding RNAs. So, there is no need to use the "known_lncRNA.gtf" (obtained from Genecode).

With HISAT2 alignment, you can provide a list of known splice sites using the option of "--known-splicesite-infile". In your case, you need the below command to generate the known splice sites.

python hisat2_extract_splice_sites.py gencode.v35.annotation.gtf  > splicesites.txt


Regarding the indexing step, you can just use the default one:

hisat2-build [options]* <reference_in> <ht2_base>
For example: hisat2-build genome.fa genome

0
Entering edit mode

Thanks for your response. For the featurecount step, should I used the "known_lncRNA.gtf"? what should I use the splicesites.txt you mentioned above for?

If I want to identify just the DE LncRNA, should I do the step mentioned above for indexing?

hisat2_extract_splice_sites.py known_lncRNA.gtf > known_lncRNA_splicSite.ss

hisat2_extract_exons.py gencode.v35.annotation.gtf > genome.exon

hisat2-build -p 16 --exon genome.exon --ss known_lncRNA_splicSite.ss genome.fa ./genome_tran

(similar to the post https://www.biostars.org/p/288274/) Thanks

0
Entering edit mode

For the featurecount step, you should use the "gencode.v35.annotation.gtf " for both protein-coding genes and lncRNAs.

What should I use the splicesites.txt you mentioned above for? The "splicesites.txt" should be used in the HISAT2 alignment. Read the manual of HISAT2...

If I want to identify just the DE LncRNA, should I do the step mentioned above for indexing? If you want to only identify DE lncRNAs, then you may just use the "known_lncRNA.gtf" file in the featurecount step. No need to do the steps that you mentioned above.

0
Entering edit mode

Great, thank you for your help.

Traffic: 2163 users visited in the last hour
FAQ
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.