Removing Redundant Sequence Based On Genus Name
2
2
Entering edit mode
11.4 years ago
macmath ▴ 170

Dear colleagues,

Need to remove reduancy sequence based on taxonomic name (genus name) example [Serratia plymuthica A30] and [Serratia sp. AS12] consider only the first or based on sort. Leading to a unique set of sequences with genus name along with the accession info (ZP_07379498)

Sample Input file

>EKF64793 phenylalanine--tRNA ligase, beta subunit [Serratia plymuthica A30].
MKFSELWLREWVNPAISSEALSDQITMAGLEVDGVEPIAGVFNGVVVGHVVECGQHPNADKLRVTKVNVGGDRLLDIVCGAPNCRTGL
>ZP_07379498 phenylalanyl-tRNA synthetase, beta subunit [Pantoea sp. aB].
MKFSELWLREWVNPALDSAALSEQITMAGLEVDGVEPVAGAFHGVVVGEVVECGQHPNADKLRVTKINVGGERLLDIVCGAPNCRQG>YP_004500523 Phenylalanyl-tRNA synthetase subunit beta [Serratia sp. AS12].
MKFSELWLREWVNPAISSEALSDQITMAGLEVDGVEPVAGVFNGVVVGHVVECGQHPNADKLRVTKVNVGGDRLLDIVCGAPNCRTG>ZP_04615044 Phenylalanyl-tRNA synthetase beta chain [Yersinia ruckeri ATCC 29473].
MKFSELWLREWVNPAISSDELAHQITMAGLEVDGVEAVAGEFNGVVVGEVVECGQHPNADKLRVTKVNVGGERLLDIVCGAPNCRQG>ZP_10294785 phenylalanyl-tRNA ligase subunit beta [Pseudoalteromonas rubra ATCC 29570].
MKFSEKWLREWVNPAIDTEALSEQLSMAGLEVDGVDPVAGDFEGVVIGEVVECGQHPDADKLRVTKVNVGEDELLDIVCGAANCRTG>ZP_04635334 Phenylalanyl-tRNA synthetase beta chain [Yersinia intermedia ATCC 29909].
MKFSELWLREWVNPAISSDDLAHQITMAGLEVDGVDAVAGEFNGVVIGHVVECGQHPNADKLRVTKIDVGGDRLLD>ZP_04626227 Phenylalanyl-tRNA synthetase beta chain [Yersinia kristensenii ATCC 33638].
MKFSELWLREWVNPAISSDDLAHQITMAGLEVDGVDAVAGEFNGVVIGHVVECGQHPNADKLRVTKIDVGGERLLDIVCGAPNCRQGLKVAVATVGAVLPGDFKIKAAKLRGEPSEGMLCSFSELAIAEDHDGIIELPADAPIGVDLREYLKLDDKTIEISVTPNRAD
>ZP_04630893 Phenylalanyl-tRNA synthetase beta chain [Yersinia frederiksenii ATCC 33641].
MKFSELWLREWVNPAISSDDLAHKITMAGLEVDGIDPVAGEFNGVVVGHVVECGQHPNADKLRVTKIDVGGDRLLDIVCGAPNCRQGLKVAVATVGAVLPGDFKIKAAKLRGEPSEGMLCSFSELAISEDHDGIIELPADAPIGVDLREYLHLDDKTIEISVTPNRAD
>ZP_09390203 phenylalanine--tRNA ligase, beta subunit [Yokenella regensburgei ATCC 43003].
MKFSELWLREWVNPAVDSEALSDQITMAGLEVDGVEPVAGEFHGVVVGEVVECGQHPNADKLRVTKINVGGERLL
>ZP_04610982 Phenylalanyl-tRNA synthetase beta chain [Yersinia rohdei ATCC 43380].
MKFSELWLREWVNPAISSDDLAHQITMAGLEVDGIDAVAGEFNGVVVGQVVECGQHPNADKLRVTKIDVGGDRLLD
>ZP_04640876 Phenylalanyl-tRNA synthetase beta chain [Yersinia mollaretii ATCC 43969].
MKFSELWLREWVNPAISSDELAHQITMAGLEVDGVESVAGEFNGVVVGHVVECGQHPNADKLRVTKIDVGGERLLDIVCGAPNCRQGLKVAVATVGAVLPGDFKIKAAKLRGEPSEGMLCSFSELAISDDHDGIIELPADAPIGVDVREYLQLNDKTIEISVTPNRAD
>ZP_04627403 Phenylalanyl-tRNA synthetase beta chain [Yersinia bercovieri ATCC 43970].
MKFSELWLREWVNPAISSDALAHQITMAGLEVDGVESVAGEFNGVVVGHVVECGQHPNADKLRVTKIDVGGDRLLD
>ZP_09375718 phenylalanine--tRNA ligase, beta subunit [Hafnia alvei ATCC 51873].
MKFSELWLREWVNPAISSEALSEQITMAGLEVDGVEPVAGEFNGVFVGEVVECGQHPNADKLRVTKVNVGGERLLD
>CBX80523 phenylalanyl-tRNA synthetase, beta subunit [Erwinia amylovora ATCC BAA-2158].
MKFSELWLREWVSPAIDSAALCEQITMAGLEVDGVDAVAGAFHGVVVGDVVECAQHPNADKLRVTKINVGGDRLLDI
>ZP_08825428 Phenylalanyl-tRNA synthetase beta chain [Thiorhodococcus drewsii AZ1].
MRFSEAWLREWVNPPVDTQQLADQLSMAGLEVDAVEPAASAFSGVFVGLVRAIAPHPDAAKLRICSVDVGQGDPLQIICGAANVAEGMRVPVATIGARLPGDFKIKRAKLRGVESFGMICSAKELGLAESSDGILPLPADAPLGEDFRAWLALDDQCIEVDLTPDRG
>ZP_10495587 phenylalanyl-tRNA synthetase subunit beta [Alishewanella aestuarii B11].
MKFSESWLREWVNPALDSTALSEQLSMAGLEVDGMDKVAGDFHGVVVGEVVECGKHPEADKLQVTKVNIGGAELLDIVCGARNCRLGLKVAVATVGAVLPGNFEIKQAKLRGQPSHGMLCSFSELGMADDSDGIIELPADAPIGQDLRQYLALDDLSIEVDLTPND
>ZP_10115070 phenylalanyl-tRNA synthetase, beta subunit [Beggiatoa alba B18LD].
MKFSEQWLRTWVNPQMTTTELVDCLTMAGLEVDDVETVAPAFDNVVVGEVLTIERHPDAEKLKVCQVNTGTESPLTIVCGASNVQAG>ZP_10063574 phenylalanyl-tRNA synthetase subunit beta [gamma proteobacterium BDW918].
MKFSEQWLREWVNPAVGTDELAAQITMAGLEVDAIDPVAGVFSGVVVAEIVATAPHPDAEKLQVCRVNAGSEEVQIVCGAANARPGIKVPLATLGAVLPGDFKIKKAKLRGVESFGMLCAEEELGLAEKSDGLMELPLDAPVGEDIRVFLGLDDSIIELGLTPNRADC
>ZP_10350440 phenylalanyl-tRNA ligase subunit beta [Alishewanella agri BL06].
MKFSESWLREWVNPALDSTALSEQLSMAGLEVDGMDKVAGDFHGVVVGEVVECGKHPEADKLQVTKVNIGGAELLDIVCGARNCRL>>ZP_09228799 phenylalanyl-tRNA synthetase beta chain [Pseudoalteromonas sp. BSi20311].
MKFSEKWLREWVNPAIDTQALSEQLSMAGLEVDGVEPAAAKFNGVVVGEVIECGQHPDADKLRVTKINVGGDELLDIVCGAPNCRQGI>ZP_09240506 phenylalanyl-tRNA synthetase beta chain [Pseudoalteromonas sp. BSi20480].
MKFSEKWLREWVNPAIDTQALSEQLSMAGLEVDGVEPAAAKFNGVLVGEVVECGQHPDADKLRVTKINVGGDELLDIVCGAPNCREGI>ZP_09243405 phenylalanyl-tRNA synthetase beta chain [Pseudoalteromonas sp. BSi20495].
MKFSEKWLREWVNPAIDTQALSEQLSMAGLEVDGVEPAAAKFNGVVVGEVVECGQHPDADKLRVTKINVGGDELLDIVCGAANCRLGI

example output file

>EKF64793 phenylalanine--tRNA ligase, beta subunit [Serratia plymuthica A30].
MKFSELWLREWVNPAISSEALSDQITMAGLEVDGVEPIAGVFNGVVVGHVVECGQHPNADKLRVTKVNVGGDRLLDIVCGAPNCRTGL
>ZP_07379498 phenylalanyl-tRNA synthetase, beta subunit [Pantoea sp. aB].
MKFSELWLREWVNPALDSAALSEQITMAGLEVDGVEPVAGAFHGVVVGEVVECGQHPNADKLRVTKINVGGERLLDIVCGAPNCRQG>ZP_04615044 Phenylalanyl-tRNA synthetase beta chain [Yersinia ruckeri ATCC 29473].
MKFSELWLREWVNPAISSDELAHQITMAGLEVDGVEAVAGEFNGVVVGEVVECGQHPNADKLRVTKVNVGGERLLDIVCGAPNCRQG>ZP_10294785 phenylalanyl-tRNA ligase subunit beta [Pseudoalteromonas rubra ATCC 29570].
MKFSEKWLREWVNPAIDTEALSEQLSMAGLEVDGVDPVAGDFEGVVIGEVVECGQHPDADKLRVTKVNVGEDELLDIVCGAANCRTG
>ZP_09390203 phenylalanine--tRNA ligase, beta subunit [Yokenella regensburgei ATCC 43003].
MKFSELWLREWVNPAVDSEALSDQITMAGLEVDGVEPVAGEFHGVVVGEVVECGQHPNADKLRVTKINVGGERLL
>ZP_09375718 phenylalanine--tRNA ligase, beta subunit [Hafnia alvei ATCC 51873].
MKFSELWLREWVNPAISSEALSEQITMAGLEVDGVEPVAGEFNGVFVGEVVECGQHPNADKLRVTKVNVGGERLLD
>CBX80523 phenylalanyl-tRNA synthetase, beta subunit [Erwinia amylovora ATCC BAA-2158].
MKFSELWLREWVSPAIDSAALCEQITMAGLEVDGVDAVAGAFHGVVVGDVVECAQHPNADKLRVTKINVGGDRLLDI
>ZP_08825428 Phenylalanyl-tRNA synthetase beta chain [Thiorhodococcus drewsii AZ1].
MRFSEAWLREWVNPPVDTQQLADQLSMAGLEVDAVEPAASAFSGVFVGLVRAIAPHPDAAKLRICSVDVGQGDPLQIICGAANVAEGMRVPVATIGARLPGDFKIKRAKLRGVESFGMICSAKELGLAESSDGILPLPADAPLGEDFRAWLALDDQCIEVDLTPDRG
>ZP_10495587 phenylalanyl-tRNA synthetase subunit beta [Alishewanella aestuarii B11].
MKFSESWLREWVNPALDSTALSEQLSMAGLEVDGMDKVAGDFHGVVVGEVVECGKHPEADKLQVTKVNIGGAELLDIVCGARNCRLGLKVAVATVGAVLPGNFEIKQAKLRGQPSHGMLCSFSELGMADDSDGIIELPADAPIGQDLRQYLALDDLSIEVDLTPND
>ZP_10115070 phenylalanyl-tRNA synthetase, beta subunit [Beggiatoa alba B18LD].
MKFSEQWLRTWVNPQMTTTELVDCLTMAGLEVDDVETVAPAFDNVVVGEVLTIERHPDAEKLKVCQVNTGTESPLTIVCGASNVQAG>ZP_10063574 phenylalanyl-tRNA synthetase subunit beta [gamma proteobacterium BDW918].
MKFSEQWLREWVNPAVGTDELAAQITMAGLEVDAIDPVAGVFSGVVVAEIVATAPHPDAEKLQVCRVNAGSEEVQIVCGAANARPGIKVPLATLGAVLPGDFKIKKAKLRGVESFGMLCAEEELGLAEKSDGLMELPLDAPVGEDIRVFLGLDDSIIELGLTPNRADC
>ZP_09228799 phenylalanyl-tRNA synthetase beta chain [Pseudoalteromonas sp. BSi20311].
MKFSEKWLREWVNPAIDTQALSEQLSMAGLEVDGVEPAAAKFNGVVVGEVIECGQHPDADKLRVTKINVGGDELLDIVCGAPNCRQGI
protein • 2.0k views
ADD COMMENT
3
Entering edit mode
ADD REPLY
9
Entering edit mode
11.4 years ago

This will do the trick:

import re
from Bio import SeqIO

taxonomy = []

fh = open('redundant_genus.fa')
out = open('nonredundant_genus.fa','w')
for seq_record in SeqIO.parse(fh,'fasta'):
    genus = re.findall('\s\[([^\s]+)\s', seq_record.description)[0]
    if genus not in taxonomy:
        taxonomy.append(genus)
        out.write(seq_record.format('fasta'))

fh.close()
out.close()
ADD COMMENT
0
Entering edit mode
11.4 years ago
macmath ▴ 170

a.zielezinski thank you for your suggestion

ADD COMMENT

Login before adding your answer.

Traffic: 1820 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6