Hi guys :D
I'm working with distance matrices produced by clustal omega for moderately large fasta files combining sequences of two different plant species in each .
When I was about to finish the script and code the final pipeline step ; which is retrieving the actual sequences corresponding to ID's given in the distance matrices using the biopython function SeqIO.index()
... I realized that the original fasta files have duplicate ID's for different sequences resulting from different positions of SSR's on the same sequence , in which I extracted the left and right flanking regions for each SSR .
Traceback (most recent call last): File "C:\Users\Al-Hammad\Desktop\Test Sample\dictionary.py", line 9, in <module> dictionary=SeqIO.index("Left(Brachypodium_Brachypodium).fasta","fasta",IUPAC.unambiguous_dna) File "C:\Python34\lib\site-packages\Bio\SeqIO\__init__.py", line 856, in index key_function, repr, "SeqRecord") File "C:\Python34\lib\site-packages\Bio\File.py", line 275, in __init__ raise ValueError("Duplicate key '%s'" % key) ValueError: Duplicate key 'BRADI5G06067.1' Tool completed with exit code 1
Here's a sample of one of my files :
>BRADI5G06067.1 cdna:novel chromosome:v1.0:5:7642747:7642899:-1 gene:BRADI5G06067 transcript:BRADI5G06067.1 description:"" Startpos_in_parent=24 Startpos_here=24 Length=26 ATGTATCTCCAACAACAACAACA >BRADI5G06067.1 cdna:novel chromosome:v1.0:5:7642747:7642899:-1 gene:BRADI5G06067 transcript:BRADI5G06067.1 description:"" Startpos_in_parent=54 Startpos_here=54 Length=34 ATGTATCTCCAACAACAACAACAACGACGACGACGACGACGACGACGACAACG >BRADI5G06067.1 cdna:novel chromosome:v1.0:5:7642747:7642899:-1 gene:BRADI5G06067 transcript:BRADI5G06067.1 description:"" Startpos_in_parent=102 Startpos_here=102 Length=26 ATGTATCTCCAACAACAACAACAACGACGACGACGACGACGACGACGACAACGACAACAACAACAACAACAACAACAACAACAACAAGAACGACGACGACG
My question is : what is the best , safest and most efficient way to rename the duplicate ID's for different sequences ?! and do I have to recompute the distance matrices again with the unique ID's after renaming or can I simply map the duplicates with their corresponding new unique values on the surface ?!
I'm really confused about that , and a little worried about the recomputing if considered since it's time consuming and takes nearly 4 days to produce the matrices .
I found this : http://stackoverflow.com/questions/7815553/iterate-through-fasta-entries-and-rename-duplicates/7836747#7836747 but it wasn't useful in my case , I'm working on a windows 7 64bit platform and python 3.4
Also I found this : Is There A Way To Skip Existing Keys In Seq.Io.To_Dict? Or Is There A Better Way Altogether? but I believe it was the opposite of my case , I tried it though and ran on my files infinitely !! It wasn't that clear to me , for my bad luck :\
I desperately need this :( 😔
Any help would be appreciated , thanks in advance .