Question: extract multiple sequences from a fasta file with a list of numbers in header
0
gravatar for barvazduck
4.4 years ago by
barvazduck10
Israel
barvazduck10 wrote:

Hi all,

 

I got a large faste file that I need to extract multiple sequences from the header for each sequence is composed of sevreral parameters such as Chromosome, Genetic distance etc, and also has an ID number in the end. I need to extract sequences based on the ID number.

 

I tried using 'grep':

grep -A 2 -wFf LIST.txt IN.fa > OUT.fa

 

But this matches also non-specific numbers. For example if I search for ID: 79695 I also get sequences with IDs such as 1379695 and 7969522.

 

Any idea how to solve this or other solutions?

 

Thanks

sequence • 9.0k views
ADD COMMENTlink modified 4.4 years ago by robert.davey260 • written 4.4 years ago by barvazduck10

Can you post example of your header? Is it ID: 79695 or ID:79695 or just 79695?

ADD REPLYlink modified 4.4 years ago • written 4.4 years ago by PoGibas4.8k

>chr7B, genetic location: 72.6 cM, bin num: 3629, phase id: 1, scaf14, marker id: 1079695

ADD REPLYlink written 4.4 years ago by barvazduck10

This is probably simplest in either biopython or bioperl. They'll handle the presence of multi-line entries transparently and allow you to just use a regex to get out the ID (just store the target IDs in a hash/dict and query that).

ADD REPLYlink written 4.4 years ago by Devon Ryan90k
6
gravatar for robert.davey
4.4 years ago by
robert.davey260
European Union
robert.davey260 wrote:

As long as I'm understanding your question accurately, the simplest way would be to combine grep and sed. If your LIST.txt contains IDs, and you are specifically looking for "marker id: "+<exact ID match>, then the following should work:

grep -A 1 -wFf <( sed -r 's/^/marker id: /' LIST.txt ) IN.fa > OUT.fa

Edit: apologies, missed the -w flag off in my retype. The previous answer failed when the start of the ID field contained a search ID.

ADD COMMENTlink modified 4.4 years ago • written 4.4 years ago by robert.davey260

Thanks,

But this still wont get the excect ID number but also all the nested ones

ADD REPLYlink written 4.4 years ago by barvazduck10

EDIT: Fixed missing -w flag

Seems to work properly for me:

test.fa:

>chr7B, genetic location: 72.6 cM, bin num: 3629, phase id: 1, scaf14, marker id: 1079695
ATGCTAGCTAGCTAGCCGGCTAGCTACTATCGGCTATCGTACGTAGC
>chr7B, genetic location: 72.6 cM, bin num: 7969, phase id: 1, scaf14, marker id: 1079696
GTATCTGGCATCTTACTGACGGCGATCGATGCGCGCTAGCTAGCTAT

When LIST.txt:

davey:~/temp$ cat LIST.txt
7969

davey:~/temp$ grep -A 1 -wFf <( sed -r 's/^/marker id: /' LIST.txt ) test.fa
davey:~/temp$

I get no output. With LIST.txt:

davey:~/temp$ cat LIST.txt
1079696

I get:

davey:~/temp$ grep -A 1 -wFf <( sed -r 's/^/marker id: /' LIST.txt ) test.fa
>chr7B, genetic location: 72.6 cM, bin num: 3629, phase id: 1, scaf14, marker id: 1079696
GTATCTGGCATCTTACTGACGGCGATCGATGCGCGCTAGCTAGCTAT

What bash/grep versions are you using?

ADD REPLYlink modified 4.4 years ago • written 4.4 years ago by robert.davey260

bash  version 4.3.11(1)-release

grep 2.16

Anyway, it works partially because if you use 107969 then you get

>chr7B, genetic location: 72.6 cM, bin num: 3629, phase id: 1, scaf14, marker id: 1079695
ATGCTAGCTAGCTAGCCGGCTAGCTACTATCGGCTATCGTACGTAGC
>chr7B, genetic location: 72.6 cM, bin num: 7969, phase id: 1, scaf14, marker id: 1079696
GTATCTGGCATCTTACTGACGGCGATCGATGCGCGCTAGCTAGCTAT

 

Thanks for the efforts.

 

 

ADD REPLYlink written 4.4 years ago by barvazduck10

I've fixed my previous post. Sorry about the omission of the -w option.

ADD REPLYlink written 4.4 years ago by robert.davey260

Yeah I tried that, seems like something is wrong with my LIST.txt
 

ADD REPLYlink written 4.4 years ago by barvazduck10

Do you have non-standard (i.e. Windows) carriage returns in your file? You can use dos2unix to get rid of them.

It does indeed fail with the following:

1079696^M
1079695^M 

If you open up your LIST.txt in vi, the status bar will help tell you if you have a DOS file or not: "testlist.txt" [dos] 2L, 18C

ADD REPLYlink modified 4.4 years ago • written 4.4 years ago by robert.davey260

Actually it was a very stupid mistake, I just didnt use -w when I checked the result file using another grep line

ADD REPLYlink written 4.4 years ago by barvazduck10
4
gravatar for Pierre Lindenbaum
4.4 years ago by
France/Nantes/Institut du Thorax - INSERM UMR1087
Pierre Lindenbaum120k wrote:

sort your list of ID  LIST.txt

linearize the fasta, extract the id after marker_id, sort on the first column, join with the LIST, convert back to fasta.

awk '/^>/ {i=index($0,"marker id:"); printf("%s%s\t%s\t",(N==0?"":"\n"),substr($0,i+11),$0);++N;next;} {printf("%s",$0);} END{printf("\n");}' input.fa|\
sort -t '    '  -k1,1 |\
join -t '    ' -1 1 -2 1 LIST.txt - |\
cut -f 2,3 |\
tr "\t" "\n"
ADD COMMENTlink written 4.4 years ago by Pierre Lindenbaum120k
0
gravatar for Bara'a
4.4 years ago by
Bara'a230
Amman - Jordan
Bara'a230 wrote:

If you are aiming at retrieving certain sequences by ID for further manipulation , you can use the Bio.SeqIO.index()function of Bio.SeqIO module provided in Biopython .

This functions allows you to index moderately large sequence files without consuming much memory as it stores where each record is within the file and parse it on demand given identifiers .

A short Biopython code will do the work :

from Bio import SeqIO

def get_ID (identifier) :

parts= identifier.split (",")

assert len (parts)==6 and parts[0]=="chr" and parts[1]=="genetic location:" and parts[2]=="bin num:" and parts[3]=="phase id:" and parts[4]=="scaf" and parts[5]=="marker id:"

return parts [5]

file_dictionary=SeqIO.index(" your_file.fasta ", " fasta ", key_function=get_ID)

file_dictionary.keys()

handle=open(" selected.fasta" , "w")

for ID in [ "id1", "id2", "id3", ...  ] :

      handle.write( file_dictionary(ID).seq )

handle.close()

You can refer to Biopython tutorial section 5.4.2 for more details and : http://biopython.org/DIST/docs/tutorial/Tutorial.html

Also , you can find other indexing methods for extremely large sequence files in section 5.4 .

Hope you find this useful :)

EDITED to suit your file's header format .

 
 
ADD COMMENTlink modified 4.4 years ago • written 4.4 years ago by Bara'a230
1

I'd hate to write out the IDs by hand in your for loop :) Consider reading them in by file:

with open('LIST.txt','r') as ids:
  for ID in ids:
    ID = ID.rstrip()
    handle.write( file_dictionary(ID).seq )
ADD REPLYlink modified 4.4 years ago • written 4.4 years ago by robert.davey260
1

I wrote that command assuming that she wants to retrieve very few sequences , so why bother reading them from a file ?!

Anyways , All roads lead to Rome !!  :D

 
 
ADD REPLYlink written 4.4 years ago by Bara'a230
1

Quite so! :)

ADD REPLYlink written 4.4 years ago by robert.davey260
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1540 users visited in the last hour