Find sequences with header that contains gene name.

Question

Extracting sequences from fasta by gene ID

1

Entering edit mode

3.7 years ago

pipe_cvm2 ▴ 10

Hello,

I am trying to extract the fasta sequences corresponding to the MSTRG genes (RNA sequence/ tuxedo packages) identified during the DESeq2 analysis.

I´ve created a fasta file with all my genes IDs and MSTRGs and the corresponding sequences (transcripts.fa):

>rna-gnl|WGS:LPUQ|mrna.ZEAMMB73_Zm00001d048578 gene=MSTRG.1459
ATGCCAGGCGCCCGCGGCAAGGTCGCCGGAGCCGCCGAGGCGGGCGT...

and I have a list with all the DE MSTRG (res05_all.txt):

MSTRG.34995                     
MSTRG.16561                     
...                     
MSTRG.26406

I´ve tried this and it works for single MSTRG but not when I try to loop it:

grep -a "MSTRG..34995" stringtie_merged.gtf | head -n 1 > MSTRG..34995.bed

bedtools getfasta -fi genome.fa -bed MSTRG..34995.bed -fo MSTRG..34995.bed.fa.out

I´ve also tried samtools faidx without success. Any recommendations?

sequencing RNA-Seq alignment sequence • 1.9k views

ADD COMMENT • link 3.7 years ago by pipe_cvm2 ▴ 10

score 0 · Answer 1 · 2020-07-24

0

Entering edit mode

3.7 years ago

GenoMax 141k

Try one of the solutions here: Extract fasta sequences based on ids file

ADD COMMENT • link 3.7 years ago by GenoMax 141k

score 0 · Answer 2 · 2020-07-24

0

Entering edit mode

3.7 years ago

rpolicastro 13k

Since you're using R already for DESeq2, I'll give you an R option for posterity.

library("Biostrings")
library("stringr")

# Load genes as character vector.
# You can skip this step by just using the DEseq2 output directly.
genes <- read.table("res05_all.txt", header = FALSE, stringsAsFactors = FALSE)[[1]]
# Load fasta file as Biostrings.
seqs <- readDNAStringSet("transcripts.fa")
# Find sequences with header that contains gene name.
matching_seqs <- seqs[str_detect(names(seqs), str_c(genes, collapse = "|"))]
# Write to fasta file.
writeXStringSet(matching_seqs, "matching_seqs.fasta")

ADD COMMENT • link 3.2 years ago by rpolicastro 13k

0

Entering edit mode

Thanks for this. really useful.

just a small correction, there is a missing ")" :

Find sequences with header that contains gene name.

matching_seqs <- seqs[str_detect(names(seqs), str_c(genes, collapse = "|"))]

ADD REPLY • link 3.2 years ago by pipe_cvm2 ▴ 10

0

Entering edit mode

Thanks! I'll fix it.

ADD REPLY • link 3.2 years ago by rpolicastro 13k

score 0 · Answer 3 · 2020-07-24

It´s sorted I used:

cat res05_all.txt | xargs samtools faidx transcripts.fa >> genes_sequences.fa

I used the gene_count_matrix for the DESeq2 analysis so the MSTRG were for the genes and the fasta I was using to extract the sequences were transcripts (different MSTRG). Gene MSTRG: MSTRG18473 Transcript MSTRG: MSTRG18473.1

Thanks!