Question: Removing illegal Characters from an Alignment file for RaxML
gravatar for apulunuj
2.2 years ago by
apulunuj20 wrote:

Hi everyone,

I am trying to run raxml on a directory of alignments. Unfortunately, my alignment sequence ID's contain illegal characters( the characters " ; " , " : " and " , "). I figured that I could make a regular expression to match these characters and replace them with an underscore.

I don't have any experience with regular expressions and would like help on this?

Thanks in advance!

unix python • 907 views
ADD COMMENTlink modified 2.2 years ago by Alex Reynolds30k • written 2.2 years ago by apulunuj20
gravatar for Alex Reynolds
2.2 years ago by
Alex Reynolds30k
Seattle, WA USA
Alex Reynolds30k wrote:

This demonstrates a regex pattern that may help with replacements:

$ echo "foo;bar:baz,bing\$beep*bop;bloop" | sed -e 's/[;|:|,]/_/g'

That [;|:|,] is a pattern specifying matches on those characters. The underscore character is what sed replaces them with when there is a match, or "hit". The g specifier does a global replacement, where there are multiple hits. Without it, sed will replace the first hit and leave the rest untouched, e.g., see the following result and compare it with the first result:

$ echo "foo;bar:baz,bing\$beep*bop;bloop" | sed -e 's/[;|:|,]/_/'

If you are using Python, take a look at the re library:

>>> import re
>>> str = "foo;bar:baz,bing\$beep*bop;bloop"
>>> re.sub('[;|:|,]', '_', str)
ADD COMMENTlink modified 2.2 years ago • written 2.2 years ago by Alex Reynolds30k

Thanks Alex much appreciated

ADD REPLYlink modified 2.2 years ago • written 2.2 years ago by apulunuj20

Could use this to look at an entire directory as well? @Alex or am I not understanding how to use the sed command?

ADD REPLYlink modified 2.2 years ago • written 2.2 years ago by apulunuj20

Perhaps combine a for loop with walk, to loop over all files in a directory:

Inside the loop, you could perhaps run a function to modify the alignments:

  1. If alignments are in BAM format, convert to SAM (use subprocess.check_call in Python).
  2. Open the SAM file to read line by line (with open("reads.sam", "r") as in, open("readsModified.sam", "w") as out: etc.).
  3. Per line, break up the line into elements, rename sequence IDs in the specific field containing the ID (QNAME?) using the regex substitution. You probably don't want to run the regex on the entire line as it will likely corrupt the quality string and possibly the attributes, as well as the record.
  4. Write the modified SAM line out to the readsModified.sam file.
  5. Convert the modified SAM file back to BAM, using different filenames at each write and conversion step, so that you don't clobber the original files.
  6. Cleanup: remove the SAM files, as they aren't needed any longer.

The biostars search engine can help with how to convert BAM->SAM and back again. Stack Overflow and the answer above for the rest.

ADD REPLYlink written 2.2 years ago by Alex Reynolds30k
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1466 users visited in the last hour