Question: Extract fasta sequences from a file using a list in another file.
0
gravatar for EBP91
22 months ago by
EBP910
EBP910 wrote:

I have a question concerning the extraction of sequences from a fasta file (>7000 sequences) using a reference .txt file with sequence headers. I have been playing around and been looking all over the internet to find a solution for this problem, but surprisingly, nothing really matches what I want to do. So, I have two files:

1) a fasta file which looks like this:

>Zotu1

ACTGACAAAGCA

TGCACGTCATTTT

>Zotu2

ATGCATCAGCATA

TGACCCCCGTTTA

>Zotu10

CGTCGAAAAATTT

CGATACACCCTAT

>Zotu22

CGTACGTCCCCTT

CGATATAATATATA

2) a .txt file with a list of sequence names:

Zotu1

Zotu2

Now, I want to use the .txt file to select sequences from the .fasta file. I have two semi-solutions that do part of the job.

OPTION 1:

cat list.txt | awk '{gsub("_","\\_",$0);$0="(?s)^>"$0".*?(?=\\n(\\z|>))"}1' | pcregrep -oM -f - sequences.fas > newfile.fas

Problem: this function gives me the full sequences, but extracts too many sequences since everything that partially matches the strings in the .txt file will be selected. In this case, it means that also Zotu10 and Zotu22 are selected.

OPTION 2:

grep -A 1 -wFf list.txt sequences.fas > newfile2.fas

Problem: this function correctly selects only the sequences that completely match the strings in the .txt file, but does not return the full fasta sequences, but only the part of the sequence on the first line. An output thus looks like this:

>Zotu1

ACTGACAAAGCA

>Zotu2

ATGCATCAGCATA

I tried combining both solutions but that somehow did not end well. I would be much helped by an elegant solution for this problem, preferably using the codes I already obtained.

Many thanks!

awk extract header grep fasta • 9.5k views
ADD COMMENTlink modified 22 months ago by cpad011212k • written 22 months ago by EBP910

Please format your fasta sequences appropriately, using the formatting button. You can do this by selecting the text and clicking the 101010 button. When you compose or edit a post that button is in your toolbar, see image below:

code_formatting

ADD REPLYlink modified 3 months ago by genomax80k • written 22 months ago by WouterDeCoster43k

I just did that for the OP. But yes from next time he or she should do it. Thanks WouterDeCoster

ADD REPLYlink written 22 months ago by lakhujanivijay4.8k

Thanks! I adjusted it a bit. It should be fine now.

ADD REPLYlink written 22 months ago by EBP910

Did you try to google it/ tried any solution?

I copy pasted your title and the first link I got is this: Extract fasta sequences from a large file using a list of names

ADD REPLYlink written 22 months ago by venu6.4k

Yes, I have been entirely through that thread (and several others) before posting here.

ADD REPLYlink written 22 months ago by EBP910
4
gravatar for cpad0112
22 months ago by
cpad011212k
India
cpad011212k wrote:

with seqtk:

 $ seqtk subseq test.fa test.txt 
>Zotu1
ACTGACAAAGCATGCACGTCATTTT
>Zotu2
ATGCATCAGCATATGACCCCCGTTTA

with grep:

$ grep -w -A 2 -f  test.txt test.fa --no-group-separator
>Zotu1
ACTGACAAAGCA
TGCACGTCATTTT
>Zotu2
ATGCATCAGCATA
TGACCCCCGTTTA

inputs:

$ cat test.fa 
>Zotu1
ACTGACAAAGCA
TGCACGTCATTTT
>Zotu2
ATGCATCAGCATA
TGACCCCCGTTTA
>Zotu10
CGTCGAAAAATTT
CGATACACCCTAT
>Zotu22
CGTACGTCCCCTT
CGATATAATATATA

$ cat test.txt
Zotu1
Zotu2
ADD COMMENTlink modified 22 months ago • written 22 months ago by cpad011212k

Did the job! Thanks!

ADD REPLYlink modified 22 months ago • written 22 months ago by EBP910

grep -w -A 2 -f test.txt test.fa --no-group-separator doesn't work if there are special characters in the header, which is common. Use grep -w -A 2 -Ff test.txt test.fa --no-group-separator instead. -F searchers for a fixed string.

For anyone that only has one sequence per header in their fasta file, use grep -w -A 1 -f test.txt test.fa --no-group-separator instead.

ADD REPLYlink modified 3 months ago • written 3 months ago by alowi330
4
gravatar for Joe
22 months ago by
Joe16k
United Kingdom
Joe16k wrote:

This has been asked a lot, so an existing solution will almost certainly match what you need to do.

You will usually need to linearise your fasta:

awk '/^>/ {printf("%s%s\t",(N>0?"\n":""),$0);N++;next;} {printf("%s",$0);} END {printf("\n");}' < myseqs.fasta > linear.fasta

then:

while read IDS ; do grep "\b$IDS\b" linear.fasta ; done < listofids.txt

You can use \b in the grep command to mark word boundaries so that Seq1 doesn't match Seq11 for example.

Then you can rewrap your sequences (partially) by using tr '\t' '\n'.

The code I use is below (using biopython though, so it is a more robust method).

ADD COMMENTlink modified 22 months ago • written 22 months ago by Joe16k
1
gravatar for genomax
22 months ago by
genomax80k
United States
genomax80k wrote:

Step 1: Get faSomeRecords utility from Jim Kent at UCSC. http://hgdownload.soe.ucsc.edu/admin/exe/linux.x86_64/faSomeRecords (Linux link, OS X or source available).

Step 2: Make the file executable

$ chmod u+x faSomeRecords

Step 3: Run faSomeRecords

$ ./faSomeRecords
faSomeRecords - Extract multiple fa records
usage:
   faSomeRecords in.fa listFile out.fa

in.fa = Your sequence file

listfile = file with fasta headers/names (one per line)

out.fa = file to store the result

ADD COMMENTlink modified 22 months ago • written 22 months ago by genomax80k
1
gravatar for finswimmer
22 months ago by
finswimmer13k
Germany
finswimmer13k wrote:

Strange (or funny) the third time today I recommended seqkit.

$ seqkit grep -n -f list.txt sequences.fas > newfile2.fas

should do the job.

fin swimmer

ADD COMMENTlink modified 22 months ago • written 22 months ago by finswimmer13k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1028 users visited in the last hour