Problems with the reference genome in Stringtie
1
0
Entering edit mode
5.1 years ago
iraia.munoa ▴ 130

Hi everybody! I am using RNA-Seq protocol for identifying differentially expressed lncRNAs. I have used the reference genome from gencode: gencode.vM20.lncRNA_transcripts.fa I have build the index and then run hisat2:

> hisat2 --dta -q -x mm10_lncRNA_genome -U C-P1_28454_ACAGTG_trimmed.fq.gz -S C-P1_54_L4.sam

Then I have converted sam files to bam, and then sorted them and created the bai index.

> samtools view -bS -o C-P1_54_L4_lncRNA.bam C-P1_54_L4_lncRNA.sam
> samtools sort -o C-P1_54_L4_lncRNA_sorted.bam C-P1_54_L4_lncRNA.bam
> samtools index -b C-P1_54_L4_lncRNA_sorted.bam C-P1_54_L4_lncRNA_sorted.bai

Finally I have tryied to use stringtie with the gtf file which is also available in gencode for lncRNA: gencode.vM20.long_noncoding_RNAs.gtf

But when running stringtie I have a WARNING mesage:

> stringtie -G gencode.vM20.long_noncoding_RNAs.gtf -l C-P1_54_lncRNA_sorted -B -C C-P1_54_lncRNA_cov.gtf -o C-P1_54_lncRNA_transcripts.gtf -A C-P1_54_lncRNA_gene-abundance.tsv C-P1_54_lncRNA_sorted.bam

WARNING: no reference transcripts were found for the genomic sequences where reads were mapped!
Please make sure the -G annotation file uses the same naming convention for the genome sequences.

I don't understand why i am having this problem as I am using both reference files (.fa and .gtf) from the same source.

Can someone help me?

Thanks in advance,

Iraia

Stringtie RNA-Seq lncRNA Reference Genome • 2.4k views
ADD COMMENT
0
Entering edit mode
5.1 years ago

You need to download this fasta file and redo the mapping. The results will make vastly more sense then.

ADD COMMENT
0
Entering edit mode

Thanks Devon for your answerd, So this is the general reference genome? The last option in gencode werb page?(Genome sequence, primary assembly (GRCm38), Nucleotide sequence of the GRCm38 primary genome assembly (chromosomes and scaffolds)) This one? I use the lncRNA reference genome described in the question in reference to a comment from here in biostar: A: Any One please provide protocol for Analysing long noncoding RNA illumina NGS da

If someone could tell me why it didn't work or an explanation for that?

Thanks again Devon, I will try it!

ADD REPLY
0
Entering edit mode

The links in that answer are to the lincRNA annotation file. As a rule, annotation files refer to genomes rather than transcriptomes.

ADD REPLY
0
Entering edit mode

Well, when doing the mapping with your file and then stringtie with the lincRNA annotation file, the warning disapears. But, if I look to the output of stringtie, I only see the name of the genes of the lincRNA annotation file. And my question, which maybe cames from some fault on understanding the file what I am working with, is if all the genes that match the annotation are lncRNAs. I mean when I open the annotation file in gtf, there is an attribute "gene_type" that gives the information of TEC, lincRNA, antisense, procesed transcript.... Is there an option in stringtie to maintain this information in the output file?

Another thing is that I have a bed file from a lncRNA database (coordinates and feature description), Is there a way to use it for annotation in RNA-seq pipeline?, maybe when identifing the DEGs to performe a bedtools intersect between coordinates of this genes and the lncRNA bed file? Or this is not the correct way to have differentially expressed lncRNAs, and the correct way is doing it as described at the first question of the post.

ADD REPLY
0
Entering edit mode

Is there an option in stringtie to maintain this information in the output file?

I don't think such an option exists

You can parse the GTF file to just subset it for lincRNAs. That's easier if transcripts/exons also have the gene_type annotation, of course.

ADD REPLY

Login before adding your answer.

Traffic: 2668 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6