Question: Removing illegal Characters from an Alignment file for RaxML
gravatar for apulunuj
18 months ago by
apulunuj0 wrote:

Hi everyone,

I am trying to run raxml on a directory of alignments. Unfortunately, my alignment sequence ID's contain illegal characters( the characters " ; " , " : " and " , "). I figured that I could make a regular expression to match these characters and replace them with an underscore.

I don't have any experience with regular expressions and would like help on this?

Thanks in advance!

unix python • 651 views
ADD COMMENTlink modified 18 months ago by Alex Reynolds29k • written 18 months ago by apulunuj0
gravatar for Alex Reynolds
18 months ago by
Alex Reynolds29k
Seattle, WA USA
Alex Reynolds29k wrote:

This demonstrates a regex pattern that may help with replacements:

$ echo "foo;bar:baz,bing\$beep*bop;bloop" | sed -e 's/[;|:|,]/_/g'

That [;|:|,] is a pattern specifying matches on those characters. The underscore character is what sed replaces them with when there is a match, or "hit". The g specifier does a global replacement, where there are multiple hits. Without it, sed will replace the first hit and leave the rest untouched, e.g., see the following result and compare it with the first result:

$ echo "foo;bar:baz,bing\$beep*bop;bloop" | sed -e 's/[;|:|,]/_/'

If you are using Python, take a look at the re library:

>>> import re
>>> str = "foo;bar:baz,bing\$beep*bop;bloop"
>>> re.sub('[;|:|,]', '_', str)
ADD COMMENTlink modified 18 months ago • written 18 months ago by Alex Reynolds29k

Thanks Alex much appreciated

ADD REPLYlink modified 18 months ago • written 18 months ago by apulunuj0

Could use this to look at an entire directory as well? @Alex or am I not understanding how to use the sed command?

ADD REPLYlink modified 18 months ago • written 18 months ago by apulunuj0

Perhaps combine a for loop with walk, to loop over all files in a directory:

Inside the loop, you could perhaps run a function to modify the alignments:

  1. If alignments are in BAM format, convert to SAM (use subprocess.check_call in Python).
  2. Open the SAM file to read line by line (with open("reads.sam", "r") as in, open("readsModified.sam", "w") as out: etc.).
  3. Per line, break up the line into elements, rename sequence IDs in the specific field containing the ID (QNAME?) using the regex substitution. You probably don't want to run the regex on the entire line as it will likely corrupt the quality string and possibly the attributes, as well as the record.
  4. Write the modified SAM line out to the readsModified.sam file.
  5. Convert the modified SAM file back to BAM, using different filenames at each write and conversion step, so that you don't clobber the original files.
  6. Cleanup: remove the SAM files, as they aren't needed any longer.

The biostars search engine can help with how to convert BAM->SAM and back again. Stack Overflow and the answer above for the rest.

ADD REPLYlink written 18 months ago by Alex Reynolds29k
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1895 users visited in the last hour