Matching The Entries And Printing Data
1
0
Entering edit mode
11.5 years ago
NB ▴ 960

Hi ,

I have a file with Id which I want to compare it with other file to get the sequence of a particular id.

File 1

CCDS2.2
CCDS3.1
CCDS30550.1
CCDS30551.1

File 2

>CCDS2.2|Hs37.3|chr1 
MSKGILQVHPPICDCPGCRISSPVNRGRLADKRTVALPAARNLKKERTPSFSASDGDSDG
SGPTCGRRPGLKQEDGPHIRIMKRRVHTHWDVNISFREASCSQDGNLPTLISSVHRSRHL
VMPEHQSRCEFQRGSLEIGLRPAGDLLGKRLGRSPRISSDCFSEKRA
>CCDS3.1|Hs37.3|chr1
MAAAGSRKRRLAELTVDEFLASGFDSESESESENSPQAETREAREAARSPDKPGGSPSAS
RRKGRASEHKDQLSRLKDRDPEFYKFLQENDQSLLNFSDSDSSEEEEGPFHSLPDVLEEA
SEEEDGAEEGEDGDRVPRGLKGKKNSVPVTVAMVERWKQAAKQRLTPKLFHEVVQAFRAA
VATTRGDQESAEANKFQVTDSAAFNALVTFCIRDLIGCLQKLLFGKVA.
>CCDS4.1|Hs37.3|chr1
MGNSHCVPQAPRRLRASFSRKPSLKGNREDSARMSAGLPGPEAARSGDAAANKLFHYIPG
TDILDLENQRENLEQPFLSVFKKGRRRVPVRNLGKVVHYAKVQLRFQHSQDVSDCYLELF
PAHLYFQAHGSEGLTFQGLLPLTELSVCPLEGSREHAFQITGPLPAP

I want these two files to be compared by comapring ID in the first file with the ID encoded in the second file >CCDS#. If it is same then print the complete sequence.

For example, CCDS2.2 and CCDS3.1 is found in first file and in the second file. So in the output I should have something like this given below

Expected output

column1      column2
CCDS2.2   >CCDS2.2|Hs37.3|chr1
MSKGILQVHPPICDCPGCRISSPVNRGRLADKRTVALPAARNLKKERTPSFSASDGDSDG
SGPTCGRRPGLKQEDGPHIRIMKRRVHTHWDVNISFREASCSQDGNLPTLISSVHRSRHL
VMPEHQSRCEFQRGSLEIGLRPAGDLLGKRLGRSPRISSDCFSEKRA

CCDS3.1   >CCDS3.1|Hs37.3|chr1
MAAAGSRKRRLAELTVDEFLASGFDSESESESENSPQAETREAREAARSPDKPGGSPSAS
RRKGRASEHKDQLSRLKDRDPEFYKFLQENDQSLLNFSDSDSSEEEEGPFHSLPDVLEEA
SEEEDGAEEGEDGDRVPRGLKGKKNSVPVTVAMVERWKQAAKQRLTPKLFHEVVQAFRAA
VATTRGDQESAEANKFQVTDSAAFNALVTFCIRDLIGCLQKLLFGKVA

CCDS30550.1    NULL

CCDS30551.1    NULL

Can this be done using awk or sed ?

Thank you,

Nandini

sequence extraction id script • 2.2k views
ADD COMMENT
1
Entering edit mode

This question falls into the general category "I want to to parse a fasta file and do something to it." awk/sed are unlikely to cut it here. As a.zielezinski suggests below, you need to learn the libraries used to parse sequence formats. Any of the Bio* projects (Bioperl, BioPython, BioRuby, BioJava...) will do this.

ADD REPLY
0
Entering edit mode

Also, and for the record, I will suggest the caption of the question to be a bit more specific. You will get good answers if you ask the right questions.

ADD REPLY
7
Entering edit mode
11.5 years ago

It can be easily done using BioPython.

from Bio import SeqIO
ids = [line.strip() for line in open('file1.txt') if line.strip()]
out = open('results.txt','w')
for seq_record in SeqIO.parse(open('file2.txt'),'fasta'):
    id = seq_record.id.split('|')[0]
    if id in ids:
        ids.remove(id)
        out.write('%s >%s\n%s\n\n' % (id, seq_record.description, str(seq_record.seq)))
for id in ids:
    out.write('%s NULL\n' % id)
ADD COMMENT
0
Entering edit mode

thank you very much. the script works for me but I also need entries against the IDs which do not have any sequence, is there any way for that ?

ADD REPLY
1
Entering edit mode

glad to be helpful. I already edited the source code in order to see ids which do not have any sequence.

ADD REPLY
0
Entering edit mode

That's really kind, thank you so much!

ADD REPLY

Login before adding your answer.

Traffic: 2591 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6