Question

Remove site with only missing data

0

Entering edit mode

6 months ago

Leane • 0

Hi! I have a multi-individual fasta file, and from this file I'm trying to remove all sites (i.e. nucleotides at the same position in the different fasta files) that contain only Ns. Do you know how I can do this? Thanks

Fasta • 1.0k views

ADD COMMENT • link updated 6 months ago by Pierre Lindenbaum 166k • written 6 months ago by Leane • 0

0

Entering edit mode

that's not clear to me. Give us an example.

ADD REPLY • link 6 months ago by Pierre Lindenbaum 166k

0

Entering edit mode

For example, if I have 4 individuals in my files:

>seq1
AAATTTGNNNNNNCN
>seq2
AAAATTCNNNNNGCA
>seq3
AAATTAGNNNANGCA
>seq4
AAATTACNNNNNNCA

I want to remove positions with 'N' from all my individuals' sequences, and obtain something like this:

>seq1
AAATTTGNNCN
>seq2
AAAATTCNGCA
>seq3
AAATTAGAGCA
>seq4
AAATTACNNCA

And keep my alignment intact.

ADD REPLY • link updated 6 months ago by Pierre Lindenbaum 166k • written 6 months ago by Leane • 0

score 1 · Answer 1 · 2025-03-21

1

Entering edit mode

6 months ago

Pierre Lindenbaum 166k

assuming two lines per fasta record.

linearize the fasta, print one base per sequence, get the places where there is a N, count the places where there is 4 'N'. create a pattern for cut

CUT=`cat input.fa | paste - - | cut -f2 | while read S; do echo "${S}" | fold -w1 | awk '($1=="N") {print NR}' ; done  | sort | uniq -c | awk '($1==4) {print $2;}' | paste -sd','`

cat input.fa | while read N; do echo $N; read S ; echo "${S}" | cut --complement -c "${CUT}" ; done

>seq1
AAATTTGNNCN
>seq2
AAAATTCNGCA
>seq3
AAATTAGAGCA
>seq4
AAATTACNNCA

ADD COMMENT • link 6 months ago by Pierre Lindenbaum 166k

0

Entering edit mode

This script fails to detect the CUT object (which is therefore empty) and as a result, it cannot find the positions where all individuals have missing data (N). Sorry, I'm a beginner, but is it because I have long sequences that span more than one line? I've tried concatenating my sequences, and it still doesn't work. I also tried extracting the headers and sequences of the file using awk (with this command: awk 'BEGIN {RS=">"; ORS=""; FS="\n"} {if (NF>1) {print $2}}'), but with no result. Do you know how to fix my issue? Thanks in advance!

ADD REPLY • link 6 months ago by Leane • 0

0

Entering edit mode

e I have long sequences that span more than one line?

yes, see : to convert a fasta to two-lines records

ADD REPLY • link 6 months ago by Pierre Lindenbaum 166k