Question

Matching The Entries And Printing Data

0

Entering edit mode

12.7 years ago

NB ▴ 960

Hi ,

I have a file with Id which I want to compare it with other file to get the sequence of a particular id.

File 1

CCDS2.2
CCDS3.1
CCDS30550.1
CCDS30551.1

File 2

>CCDS2.2|Hs37.3|chr1 
MSKGILQVHPPICDCPGCRISSPVNRGRLADKRTVALPAARNLKKERTPSFSASDGDSDG
SGPTCGRRPGLKQEDGPHIRIMKRRVHTHWDVNISFREASCSQDGNLPTLISSVHRSRHL
VMPEHQSRCEFQRGSLEIGLRPAGDLLGKRLGRSPRISSDCFSEKRA
>CCDS3.1|Hs37.3|chr1
MAAAGSRKRRLAELTVDEFLASGFDSESESESENSPQAETREAREAARSPDKPGGSPSAS
RRKGRASEHKDQLSRLKDRDPEFYKFLQENDQSLLNFSDSDSSEEEEGPFHSLPDVLEEA
SEEEDGAEEGEDGDRVPRGLKGKKNSVPVTVAMVERWKQAAKQRLTPKLFHEVVQAFRAA
VATTRGDQESAEANKFQVTDSAAFNALVTFCIRDLIGCLQKLLFGKVA.
>CCDS4.1|Hs37.3|chr1
MGNSHCVPQAPRRLRASFSRKPSLKGNREDSARMSAGLPGPEAARSGDAAANKLFHYIPG
TDILDLENQRENLEQPFLSVFKKGRRRVPVRNLGKVVHYAKVQLRFQHSQDVSDCYLELF
PAHLYFQAHGSEGLTFQGLLPLTELSVCPLEGSREHAFQITGPLPAP

I want these two files to be compared by comapring ID in the first file with the ID encoded in the second file >CCDS#. If it is same then print the complete sequence.

For example, CCDS2.2 and CCDS3.1 is found in first file and in the second file. So in the output I should have something like this given below

Expected output

column1      column2
CCDS2.2   >CCDS2.2|Hs37.3|chr1
MSKGILQVHPPICDCPGCRISSPVNRGRLADKRTVALPAARNLKKERTPSFSASDGDSDG
SGPTCGRRPGLKQEDGPHIRIMKRRVHTHWDVNISFREASCSQDGNLPTLISSVHRSRHL
VMPEHQSRCEFQRGSLEIGLRPAGDLLGKRLGRSPRISSDCFSEKRA

CCDS3.1   >CCDS3.1|Hs37.3|chr1
MAAAGSRKRRLAELTVDEFLASGFDSESESESENSPQAETREAREAARSPDKPGGSPSAS
RRKGRASEHKDQLSRLKDRDPEFYKFLQENDQSLLNFSDSDSSEEEEGPFHSLPDVLEEA
SEEEDGAEEGEDGDRVPRGLKGKKNSVPVTVAMVERWKQAAKQRLTPKLFHEVVQAFRAA
VATTRGDQESAEANKFQVTDSAAFNALVTFCIRDLIGCLQKLLFGKVA

CCDS30550.1    NULL

CCDS30551.1    NULL

Can this be done using awk or sed ?

Thank you,

Nandini

sequence extraction id script • 2.7k views

ADD COMMENT • link updated 12.7 years ago by Andrzej Zielezinski 11k • written 12.7 years ago by NB ▴ 960

1

Entering edit mode

This question falls into the general category "I want to to parse a fasta file and do something to it." awk/sed are unlikely to cut it here. As a.zielezinski suggests below, you need to learn the libraries used to parse sequence formats. Any of the Bio* projects (Bioperl, BioPython, BioRuby, BioJava...) will do this.

ADD REPLY • link 12.7 years ago by Neilfws 49k

0

Entering edit mode

Also, and for the record, I will suggest the caption of the question to be a bit more specific. You will get good answers if you ask the right questions.

ADD REPLY • link 12.7 years ago by miquelduranfrigola ▴ 790

score 7 · Answer 1 · 2012-10-25

7

Entering edit mode

12.7 years ago

Andrzej Zielezinski 11k

It can be easily done using BioPython.

from Bio import SeqIO
ids = [line.strip() for line in open('file1.txt') if line.strip()]
out = open('results.txt','w')
for seq_record in SeqIO.parse(open('file2.txt'),'fasta'):
    id = seq_record.id.split('|')[0]
    if id in ids:
        ids.remove(id)
        out.write('%s >%s\n%s\n\n' % (id, seq_record.description, str(seq_record.seq)))
for id in ids:
    out.write('%s NULL\n' % id)

ADD COMMENT • link 12.7 years ago by Andrzej Zielezinski 11k

0

Entering edit mode

thank you very much. the script works for me but I also need entries against the IDs which do not have any sequence, is there any way for that ?

ADD REPLY • link 12.7 years ago by NB ▴ 960

1

Entering edit mode

glad to be helpful. I already edited the source code in order to see ids which do not have any sequence.