Extracting sequences from fasta by gene ID
3
1
Entering edit mode
3.7 years ago
pipe_cvm2 ▴ 10

Hello,

I am trying to extract the fasta sequences corresponding to the MSTRG genes (RNA sequence/ tuxedo packages) identified during the DESeq2 analysis.

I´ve created a fasta file with all my genes IDs and MSTRGs and the corresponding sequences (transcripts.fa):

>rna-gnl|WGS:LPUQ|mrna.ZEAMMB73_Zm00001d048578 gene=MSTRG.1459
ATGCCAGGCGCCCGCGGCAAGGTCGCCGGAGCCGCCGAGGCGGGCGT...

and I have a list with all the DE MSTRG (res05_all.txt):

MSTRG.34995                     
MSTRG.16561                     
...                     
MSTRG.26406

I´ve tried this and it works for single MSTRG but not when I try to loop it:

grep -a "MSTRG..34995" stringtie_merged.gtf | head -n 1 > MSTRG..34995.bed

bedtools getfasta -fi genome.fa -bed MSTRG..34995.bed -fo MSTRG..34995.bed.fa.out

I´ve also tried samtools faidx without success. Any recommendations?

sequencing RNA-Seq alignment sequence • 1.9k views
ADD COMMENT
0
Entering edit mode
3.7 years ago
GenoMax 141k

Try one of the solutions here: Extract fasta sequences based on ids file

ADD COMMENT
0
Entering edit mode
3.7 years ago

Since you're using R already for DESeq2, I'll give you an R option for posterity.

library("Biostrings")
library("stringr")

# Load genes as character vector.
# You can skip this step by just using the DEseq2 output directly.
genes <- read.table("res05_all.txt", header = FALSE, stringsAsFactors = FALSE)[[1]]
# Load fasta file as Biostrings.
seqs <- readDNAStringSet("transcripts.fa")
# Find sequences with header that contains gene name.
matching_seqs <- seqs[str_detect(names(seqs), str_c(genes, collapse = "|"))]
# Write to fasta file.
writeXStringSet(matching_seqs, "matching_seqs.fasta")
ADD COMMENT
0
Entering edit mode

Thanks for this. really useful.

just a small correction, there is a missing ")" :

Find sequences with header that contains gene name.

matching_seqs <- seqs[str_detect(names(seqs), str_c(genes, collapse = "|"))]

ADD REPLY
0
Entering edit mode

Thanks! I'll fix it.

ADD REPLY
0
Entering edit mode
3.7 years ago
pipe_cvm2 ▴ 10

It´s sorted I used:

cat res05_all.txt | xargs samtools faidx transcripts.fa >> genes_sequences.fa

I used the gene_count_matrix for the DESeq2 analysis so the MSTRG were for the genes and the fasta I was using to extract the sequences were transcripts (different MSTRG). Gene MSTRG: MSTRG18473 Transcript MSTRG: MSTRG18473.1

Thanks!

ADD COMMENT

Login before adding your answer.

Traffic: 1930 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6