Question: Rename Duplicate ID's for Different Fasta Sequences on Windows 7
0
gravatar for Bara'a
5.6 years ago by
Bara'a250
Amman - Jordan
Bara'a250 wrote:

Hi guys :D

I'm working with distance matrices produced by clustal omega for moderately large fasta files combining sequences of two different plant species in each .

When I was about to finish the script and code the final pipeline step ; which is retrieving the actual sequences corresponding to ID's given in the distance matrices using the biopython function SeqIO.index() ... I realized that the original fasta files have duplicate ID's for different sequences resulting from different positions of SSR's on the same sequence , in which I extracted the left and right flanking regions for each SSR .

Traceback (most recent call last):
  File "C:\Users\Al-Hammad\Desktop\Test Sample\dictionary.py", line 9, in <module>
    dictionary=SeqIO.index("Left(Brachypodium_Brachypodium).fasta","fasta",IUPAC.unambiguous_dna)
  File "C:\Python34\lib\site-packages\Bio\SeqIO\__init__.py", line 856, in index
    key_function, repr, "SeqRecord")
  File "C:\Python34\lib\site-packages\Bio\File.py", line 275, in __init__
    raise ValueError("Duplicate key '%s'" % key)
ValueError: Duplicate key 'BRADI5G06067.1'

Tool completed with exit code 1

Here's a sample of one of my files :

>BRADI5G06067.1 cdna:novel chromosome:v1.0:5:7642747:7642899:-1 gene:BRADI5G06067 transcript:BRADI5G06067.1 description:"" Startpos_in_parent=24 Startpos_here=24 Length=26 
ATGTATCTCCAACAACAACAACA 
>BRADI5G06067.1 cdna:novel chromosome:v1.0:5:7642747:7642899:-1 gene:BRADI5G06067 transcript:BRADI5G06067.1 description:"" Startpos_in_parent=54 Startpos_here=54 Length=34 
ATGTATCTCCAACAACAACAACAACGACGACGACGACGACGACGACGACAACG 
>BRADI5G06067.1 cdna:novel chromosome:v1.0:5:7642747:7642899:-1 gene:BRADI5G06067 transcript:BRADI5G06067.1 description:"" Startpos_in_parent=102 Startpos_here=102 Length=26 ATGTATCTCCAACAACAACAACAACGACGACGACGACGACGACGACGACAACGACAACAACAACAACAACAACAACAACAACAACAAGAACGACGACGACG

My question is : what is the best , safest and most efficient way to rename the duplicate ID's for different sequences ?! and do I have to recompute the distance matrices again with the unique ID's  after renaming or can I simply map the duplicates with their corresponding new unique values on the surface ?!

I'm really confused about that , and a little worried about the recomputing if considered since it's time consuming and takes nearly 4 days to produce the matrices .

I found this : http://stackoverflow.com/questions/7815553/iterate-through-fasta-entries-and-rename-duplicates/7836747#7836747 but it wasn't useful in my case , I'm working on a windows 7 64bit platform and python 3.4

Also I found this : Is There A Way To Skip Existing Keys In Seq.Io.To_Dict? Or Is There A Better Way Altogether? but I believe it was the opposite of my case , I tried it though and ran on my files infinitely !! It wasn't that clear to me , for my bad luck :\

I desperately need this :( 😔

Any help would be appreciated , thanks in advance .

 
 
 
ADD COMMENTlink modified 5.6 years ago • written 5.6 years ago by Bara'a250
4
gravatar for RamRS
5.6 years ago by
RamRS30k
Baylor College of Medicine, Houston, TX
RamRS30k wrote:

For the moment, I'd suggest replacing blank spaces in the header with double underscore characters (just a placeholder, you can use anything but underscores are among the most docile of characters)

You can install GnuWin32 tools for windows. Then, use this command:

sed -re "/^>/s/ /__/g" input.fa | sed -e "s/\_\_$//" >trInput.fa
ADD COMMENTlink modified 11 months ago • written 5.6 years ago by RamRS30k

@RamRS ... Would you please explain more?!

I didn't get it clearly of how replacing with double underscore would solve the problem!!

ADD REPLYlink modified 11 months ago by RamRS30k • written 5.6 years ago by Bara'a250
1

With black spaces, FASTA parsers split the header into ID and description at the first blank space. When no spaces are encountered in the header, the entire header is taken as the ID.

All your headers are unique, the ID segments parsed by the FASTA parser are not. Replaced with underscores, the IDs become unique and get rid of the duplicate ID problem.

ADD REPLYlink written 5.6 years ago by RamRS30k

@RamRS... So, you're saying that following the approach you provided the header will be taken as whole to be unique, if so then can I access the ID segments later by str.split("__") or simply seq.id?! I need them to be written to a csv file as a reference to their sequences.

BTW: how can I do what you suggested in python?! obviously, I can't do that manually.

ADD REPLYlink modified 11 months ago by RamRS30k • written 5.6 years ago by Bara'a250
2

Like Brian said, step line by line in the file, if the line starts with a '>', replace blank spaces with __ and write to output file. Else, write input line as such.

And you can access the individual segments by seq.id.split('__')

Also, if you're doing bioinformatics on Windows, you're doing it wrong. Dual boot Ubuntu or install it on a VM. It will mark your official entry into the world of bioinformatics :)

ADD REPLYlink modified 5.6 years ago • written 5.6 years ago by RamRS30k

@RamRS ... I managed to solve this issue below based on your guidance and instructions , seriously I can't thank you enough :)

Regarding your golden advice about Ubuntu and bioinformatics best practices and supportive platforms ... I really appreciate it and will take it into serious consideration in my next Phd. research with GOD willing :D

ADD REPLYlink modified 11 months ago by RamRS30k • written 5.6 years ago by Bara'a250
2

You can use the "key_function" parameter of SeqIO.to_dict to substitute the white space with double underscores.

from Bio import SeqIO
handle = open("yourFile.fa", "rU")
record_dict = SeqIO.to_dict(SeqIO.parse(handle, "fasta"), key_function = lambda rec : rec.description.replace(" ", "__"))
handle.close()
print(sorted(record_dict))
ADD REPLYlink modified 11 months ago by RamRS30k • written 5.6 years ago by Siva1.7k
1

This looks like a good option, Siva. But this will still parse duplicate IDs and a description with spaces replaced by __s. I think you might need to join ID and desc with a __ as well.

record_dict = SeqIO.to_dict(SeqIO.parse(handle, "fasta"), key_function = lambda rec : rec.id + "__" + rec.description.replace(" ", "__")
ADD REPLYlink modified 11 months ago • written 5.6 years ago by RamRS30k
1

Thanks for the input, Ram. What you are saying makes sense. However, it seems the whole FASTA header line except the > symbol is read as description (without the need for concatenating the ID). I used the sample FASTA sequences from the OP and this is the output of

print(sorted(record_dict))

['BRADI5G06067.1__cdna:novel__chromosome:v1.0:5:7642747:7642899:-1__gene:BRADI5G06067__transcript:BRADI5G06067.1__description:""__Startpos_in_parent=102__Startpos_here=102__Length=26', 'BRADI5G06067.1__cdna:novel__chromosome:v1.0:5:7642747:7642899:-1__gene:BRADI5G06067__transcript:BRADI5G06067.1__description:""__Startpos_in_parent=24__Startpos_here=24__Length=26', 'BRADI5G06067.1__cdna:novel__chromosome:v1.0:5:7642747:7642899:-1__gene:BRADI5G06067__transcript:BRADI5G06067.1__description:""__Startpos_in_parent=54__Startpos_here=54__Length=34']

And this is the output of printing the value of first key (EDIT: first sequence in the example. I had to escape the double quotes)

print record_dict["BRADI5G06067.1__cdna:novel__chromosome:v1.0:5:7642747:7642899:-1__gene:BRADI5G06067__transcript:BRADI5G06067.1__description:\"\"__Startpos_in_parent=24__Startpos_here
=24__Length=26"]

ID: BRADI5G06067.1
Name: BRADI5G06067.1
Description: BRADI5G06067.1 cdna:novel chromosome:v1.0:5:7642747:7642899:-1 gene:BRADI5G06067 transcript:BRADI5G06067.1 description:"" Startpos_in_parent=24 Startpos_here=24 Length=26
Number of features: 0
Seq('ATGTATCTCCAACAACAACAACA', SingleLetterAlphabet())
ADD REPLYlink modified 11 months ago by RamRS30k • written 5.6 years ago by Siva1.7k
1

That is so weird! This is the first time I'm seeing ID being recorded as part of the description!

ADD REPLYlink written 5.6 years ago by RamRS30k

@RamRS ... why do you think that's happening in your opinion ?!

ADD REPLYlink modified 11 months ago by RamRS30k • written 5.6 years ago by Bara'a250
1

I wish I knew. I'll probably take a look into the why when I get some free time. Thank you for this unique case, BTW!

ADD REPLYlink written 5.6 years ago by RamRS30k

@Siva ... I truly appreciate your effort in this solution, it seems very promising.

I will try it on my files some other time when I'm not in a rush for fast solutions.

Many thanks.

ADD REPLYlink modified 11 months ago by RamRS30k • written 5.6 years ago by Bara'a250
3
gravatar for Brian Bushnell
5.6 years ago by
Walnut Creek, USA
Brian Bushnell17k wrote:

You can do this with the BBMap package's reformat tool. If a file already has unique names it won't do anything. Otherwise, it will append _2 to the second instance of a name, and so forth.

In Linux:

reformat.sh in=sequences.fasta out=renamed.fasta uniquenames

In Windows you would replace "reformat.sh" like this:

java -ea -Xmx1g -cp C:\bbmap\current\ jgi.ReformatReads in=sequences.fasta out=renamed.fasta uniquenames

...where that would be the correct command if you extracted BBMap to the C: root.

Edit - looking back, I'm not sure if this is relevant or not. My comment is for situations where you actually have identical names. If the problem is just that a parser is truncating your names due to the presence of whitespace, then RamRS's suggestion of replacing whitespace with underscores would be the way to go.

ADD COMMENTlink modified 11 months ago by RamRS30k • written 5.6 years ago by Brian Bushnell17k

@BrianBushnell ... I don't think it's a matter of truncating name's whitespaces, it's about identical names (identifiers) for different sequences.

I need them to be renamed, so I can refer to them as dictionaries ... that's all.

Is there anyway rather than installing external tools ... python related solutions, maybe?! like modules with renaming facilities .

ADD REPLYlink modified 11 months ago by RamRS30k • written 5.6 years ago by Bara'a250
1

If you don't want to install addition tools, then you could just write a program step through the file line-by-line, and add "_(line number)" to lines that start with ">".

ADD REPLYlink written 5.6 years ago by Brian Bushnell17k

@Brian Bushnell ... that's exactly how I solved this issue, thank you very much :)

ADD REPLYlink modified 11 months ago by RamRS30k • written 5.6 years ago by Bara'a250
3
gravatar for Bara'a
5.6 years ago by
Bara'a250
Amman - Jordan
Bara'a250 wrote:

So , After looking at the options I have so far from the answers above ... I picked this approach to solve the issue for the reasons listed below and it works very fine for me :D

from Bio import SeqIO

from Bio.Alphabet import IUPAC

files=["Left(Aestivum_Japonica).fasta" , "Right(Aestivum_Japonica).fasta"]

for i in range(len(files)):
    output_handle = open("$"+files[I],"w")
    for idx,record in enumerate(SeqIO.parse(files[I],"fasta",IUPAC.unambiguous_dna)):
        record.id=record.id+"_"+str(idx)
        SeqIO.write(record,output_handle,"fasta")
    output_handle.close()

dictionary=SeqIO.index("$Left(Brachypodium_Brachypodium).fasta","fasta",IUPAC.unambiguous_dna)

Reasons why I picked this solution:

  1. It's more robust, integrated and consistent with the pythonic complicated pipeline I'm working on.
  2. It's more safe - to me - to rename the duplicates than taking the risk of assigning the whole header as a unique identifier like suggested by RamRS (I hesitated 'cause I also found it a little confusing to see the ID as part of the description)!!
  3. Renaming all sequences based on sequence No. in the file is much faster and more efficient - to me - than checking every single id in the file for duplicates and renaming with unique ones.

I'm pretty sure that the other provided solutions will work just as expected, I haven't tried them though.

Again , thank you guys so much ... I'm so grateful for your help.

ADD COMMENTlink modified 11 months ago by RamRS30k • written 5.6 years ago by Bara'a250
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 734 users visited in the last hour