Question: Removing columns coded by -,N or ? in Alignments
0
gravatar for NicoN64
15 months ago by
NicoN640
United Kingdom/London/qmul
NicoN640 wrote:

Hello,

I was wondering if there is a method using perl or python to remove columns from MSA that have only gaps - and/or N and/or ? (everything except a nucleotide). It will be to run it on several MSA.

seq1  
ATCGNN-??ATCG  
seq2  
ATCGNN--ATCGCG  
seq3  
ATCG-?NCGAAAAA

(Remove columns 5,6,7)

I know that there is some tool like TrimAl who has the option to remove gappy positions but I dont know if I can adapt it to look at positions with only characters '-' 'N' '?'. I want to keep ?,N,- in others part of the alignments, so I don't think that the command tr can help?

Thank you

alignment • 378 views
ADD COMMENTlink modified 15 months ago by RamRS25k • written 15 months ago by NicoN640
1

did you try degap from scikit-bio python library? : http://scikit-bio.org/docs/0.2.3/generated/skbio.alignment.Alignment.degap.html ?

ADD REPLYlink modified 15 months ago • written 15 months ago by cpad011212k

Your problem is sophisticated enough to warrant a custom script. You can use a couple of python dicts for this, one dict to hold ids-sequences as key value pairs and the other to hold position-unique_bases_at_position as key value pairs. The second dict is created using the first. By filtering the second dict, you can pick positions to display and write out these positions alone.

A word of caution with the above approach, you'll run into problems with long sequences or when you have a ton of sequences in the MSA, not to mention cases of unequal length sequences that you'll have to handle.

Also, please use the formatting bar (especially the code option) to present your post better. I've done it for you this time.
code_formatting

ADD REPLYlink modified 15 months ago • written 15 months ago by RamRS25k

You can have a look at this. Alternatively, you may wish to try UGENE for removing gaps. For other characters, it might be possible to convert them into '-' first?

ADD REPLYlink written 15 months ago by Sishuo Wang190

You can use sed command if you just want to remove these characters sed 's/[N|?|-]//g' filename

ADD REPLYlink modified 15 months ago • written 15 months ago by cvu150

How would sed account for the across-all-sequences ("columns") part? sed can only process one line at a time.

ADD REPLYlink modified 15 months ago • written 15 months ago by RamRS25k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1693 users visited in the last hour