Question: Matching The Entries And Printing Data
0
gravatar for Nandini
8.4 years ago by
Nandini900
Nandini900 wrote:

Hi ,

I have a file with Id which I want to compare it with other file to get the sequence of a particular id.

File 1

CCDS2.2
CCDS3.1
CCDS30550.1
CCDS30551.1

File 2

>CCDS2.2|Hs37.3|chr1 
MSKGILQVHPPICDCPGCRISSPVNRGRLADKRTVALPAARNLKKERTPSFSASDGDSDG
SGPTCGRRPGLKQEDGPHIRIMKRRVHTHWDVNISFREASCSQDGNLPTLISSVHRSRHL
VMPEHQSRCEFQRGSLEIGLRPAGDLLGKRLGRSPRISSDCFSEKRA
>CCDS3.1|Hs37.3|chr1
MAAAGSRKRRLAELTVDEFLASGFDSESESESENSPQAETREAREAARSPDKPGGSPSAS
RRKGRASEHKDQLSRLKDRDPEFYKFLQENDQSLLNFSDSDSSEEEEGPFHSLPDVLEEA
SEEEDGAEEGEDGDRVPRGLKGKKNSVPVTVAMVERWKQAAKQRLTPKLFHEVVQAFRAA
VATTRGDQESAEANKFQVTDSAAFNALVTFCIRDLIGCLQKLLFGKVA.
>CCDS4.1|Hs37.3|chr1
MGNSHCVPQAPRRLRASFSRKPSLKGNREDSARMSAGLPGPEAARSGDAAANKLFHYIPG
TDILDLENQRENLEQPFLSVFKKGRRRVPVRNLGKVVHYAKVQLRFQHSQDVSDCYLELF
PAHLYFQAHGSEGLTFQGLLPLTELSVCPLEGSREHAFQITGPLPAP

I want these two files to be compared by comapring ID in the first file with the ID encoded in the second file >CCDS#. If it is same then print the complete sequence.

For example, CCDS2.2 and CCDS3.1 is found in first file and in the second file. So in the output I should have something like this given below

Expected output

column1      column2
CCDS2.2   >CCDS2.2|Hs37.3|chr1
MSKGILQVHPPICDCPGCRISSPVNRGRLADKRTVALPAARNLKKERTPSFSASDGDSDG
SGPTCGRRPGLKQEDGPHIRIMKRRVHTHWDVNISFREASCSQDGNLPTLISSVHRSRHL
VMPEHQSRCEFQRGSLEIGLRPAGDLLGKRLGRSPRISSDCFSEKRA

CCDS3.1   >CCDS3.1|Hs37.3|chr1
MAAAGSRKRRLAELTVDEFLASGFDSESESESENSPQAETREAREAARSPDKPGGSPSAS
RRKGRASEHKDQLSRLKDRDPEFYKFLQENDQSLLNFSDSDSSEEEEGPFHSLPDVLEEA
SEEEDGAEEGEDGDRVPRGLKGKKNSVPVTVAMVERWKQAAKQRLTPKLFHEVVQAFRAA
VATTRGDQESAEANKFQVTDSAAFNALVTFCIRDLIGCLQKLLFGKVA

CCDS30550.1    NULL

CCDS30551.1    NULL

Can this be done using awk or sed ?

Thank you,

Nandini

sequence extraction id script • 1.5k views
ADD COMMENTlink modified 8.4 years ago by a.zielezinski9.6k • written 8.4 years ago by Nandini900
1

This question falls into the general category "I want to to parse a fasta file and do something to it." awk/sed are unlikely to cut it here. As a.zielezinski suggests below, you need to learn the libraries used to parse sequence formats. Any of the Bio* projects (Bioperl, BioPython, BioRuby, BioJava...) will do this.

ADD REPLYlink written 8.4 years ago by Neilfws49k

Also, and for the record, I will suggest the caption of the question to be a bit more specific. You will get good answers if you ask the right questions.

ADD REPLYlink written 8.4 years ago by miquelduranfrigola770
7
gravatar for a.zielezinski
8.4 years ago by
a.zielezinski9.6k
a.zielezinski9.6k wrote:

It can be easily done using BioPython.

from Bio import SeqIO
ids = [line.strip() for line in open('file1.txt') if line.strip()]
out = open('results.txt','w')
for seq_record in SeqIO.parse(open('file2.txt'),'fasta'):
    id = seq_record.id.split('|')[0]
    if id in ids:
        ids.remove(id)
        out.write('%s >%s\n%s\n\n' % (id, seq_record.description, str(seq_record.seq)))
for id in ids:
    out.write('%s NULL\n' % id)
ADD COMMENTlink modified 8.4 years ago • written 8.4 years ago by a.zielezinski9.6k

thank you very much. the script works for me but I also need entries against the IDs which do not have any sequence, is there any way for that ?

ADD REPLYlink written 8.4 years ago by Nandini900
1

glad to be helpful. I already edited the source code in order to see ids which do not have any sequence.

ADD REPLYlink written 8.4 years ago by a.zielezinski9.6k

That's really kind, thank you so much!

ADD REPLYlink written 8.4 years ago by Nandini900
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1177 users visited in the last hour
_