How to match fasta header list of name?
2
0
Entering edit mode
5.3 years ago
fec2 ▴ 50

Hi,

I have a multi fasta file that contains 453 fasta sequences that looks like the following:

>M.Bce12308ORF4755P   GTWWAC 
ATGCGTGACCTGATCGAAGAGCCGGGCGGCGGCGCCGCGAGCGAGGCGGAGGCGGTTCAGCCCGCCGCTGCCGTGCCGCGCGCGCTGCCGTCCGGTATCG

>M.Bce1254ORF9725P   GTWWAC
ATGCGTGACCTGATCGAAGACCCGGGCGGCGGCGCCGCGAGCGAGGCGGAGGCGGTTCAGCCCGCCGCTGCCGTGCCGCGCGCGCTGCCGTCCGGTATCG

And I have sequence name list that contains 461 name as below:

M.Bce12308ORF4755P
M.Bce122ORF1082P
M.Bce12308ORF4755P
M.Bce1254ORF9725P

May I know how to match the name list to the fasta file, so that I can know which of the sequence from the name list is missing in the fasta file?

Thank you!

Felix

sequence • 3.2k views
ADD COMMENT
0
Entering edit mode

Did you try searching the forum at all? This is one of the most widely addressed problems here.

ADD REPLY
0
Entering edit mode

Hi, I have tried searching it, but couldn't found exactly same issue, as what I want is a list of name of the missing fasta file. Do you have any idea what is the key word should I use to search for this issue? Thanks.

ADD REPLY
1
Entering edit mode

exactly same issue

Although I highly doubt you could not find the exact same case (I recall addressing multi-part sequence identifiers and how to deal with them a few years ago), this approach is not helpful. What you need is not something that you can copy-and-paste and "just works" - such solutions are rare and don't teach us anything. You need a "pointer", which is a hint that takes you one step closer to a solution than you are right now. That way, you get to solve the problem yourself while overcoming an obstance that might have taken quite some time to solve on your own.

ADD REPLY
5
Entering edit mode
5.3 years ago
# IDs in seqs.fa
$ grep '^>' seqs.fa | awk '{print $1}' | sed 's/^>//'
M.Bce12308ORF4755P
M.Bce1254ORF9725P

# IDs not in list.txt
$ grep -w -v -f <(grep '^>' seqs.fa | awk '{print $1}' | sed 's/^>//') list.txt
M.Bce122ORF1082P
ADD COMMENT
0
Entering edit mode

Thank you very much!

ADD REPLY
4
Entering edit mode
5.3 years ago
Chirag Parsania ★ 2.0k
library(Biostrings)
library(tidyverse)


## get target seq names 
target_seq_names <- c("M.Bce12308ORF4755P", "M.Bce122ORF1082P", "M.Bce12308ORF4755P", "M.Bce1254ORF9725P")


## get seq names from fasta file 
fasta_seq_names <- Biostrings::readDNAStringSet("input.fasta") %>% 
        names() %>% ## get names 
        gsub(pattern = "\\s.*" ,replacement = "" ,x = .) ##  clean headers. remove stuff after first space

fasta_seq_names

[1] "M.Bce12308ORF4755P" "M.Bce1254ORF9725P"


## present in both 
intersect(fasta_seq_names , target_seq_names)

[1] "M.Bce12308ORF4755P" "M.Bce1254ORF9725P"
ADD COMMENT

Login before adding your answer.

Traffic: 965 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6