Question

Help for extraction of fasta sequences

0

Entering edit mode

2.5 years ago

Johan Largo • 0

Hello everyone, I hope you are well.

I am writing this post because I have a question or rather I have a problem with my workflow.

Perform a workflow for RNA-seq processing as follows:

quality control - Hisat2 - Stringtie - Deseq2

A simple, normal workflow that threw me important differential expression data. However, when using Hisat2 and Stringtie, with Hisat2 I get .SAM files that I obviously compress with Samtools to .bam so that stringtie can work with them. Then Stringtie generates gtf output files for me.

In the gtf annotation file that Stringtie throws at me, there are obviously no sequences of the genes it is annotating. Stringtie assigns id to these genes and as I continue in my workflow, Deseq2 continues to use them.

Unfortunately, the annotation files can be limited and Stringtie simply assigns an ID's to a possible gene.

In Deseq2 I can do the differential expression analysis and it tells me which genes are overexpressing and which are not. But when I see which genes are the ones with the most activity, I see that there are the id assigned by Stringtie.

I would like to extract the sequence "fasta" of those ID's to carry out an alignment (it can be in blast) that tells me which gene would "be" presenting there.

I hope I'm not crazy and think that what I'm saying can be done.

extraction anotation fasta • 1.5k views

ADD COMMENT • link updated 2.5 years ago by Juke34 8.5k • written 2.5 years ago by Johan Largo • 0

1

Entering edit mode

What's the reference (fasta) file? Probably you can follow below approach:

Filter gtf with MSTRG IDs of interest (let us call it new gtf)
Use getFasta from bedtools with new gtf (for the IDs of interest) and reference fasta.

It would help posting the data instead of explaining the data, to understand the issue.

ADD REPLY • link 2.5 years ago by cpad0112 21k

0

Entering edit mode

the reference fasta file is version 3 of the canis lupus familiaris genome, this is located on the UCSC portal as well as the annotation file that appears there.

Ok, I'm going to document myself about getfasta to see how it goes and I'll tell you.

Thanks ;D

ADD REPLY • link 2.5 years ago by Johan Largo • 0

0

Entering edit mode

I think I could use the coordinates that Stringtie returns and use them in getfasta with the reference genome. I'll try. Thanks

ADD REPLY • link 2.5 years ago by Johan Largo • 0

2

Entering edit mode

Then you can use gffread (LINK) utility to extract transcript sequences with GTF file you get.

ADD REPLY • link 2.5 years ago by GenoMax 141k

0

Entering edit mode

Wow thank you, I had not seen that application on the CCB website. Thanks, I will also try this option.

ADD REPLY • link 2.5 years ago by Johan Largo • 0

1

Entering edit mode

The same here with AGAT: Extracting genomic feature sequences from GTF/GFF files with AGAT

There are many tools to perform this task

ADD REPLY • link 2.5 years ago by Juke34 8.5k