Hi! I have a multi-individual fasta file, and from this file I'm trying to remove all sites (i.e. nucleotides at the same position in the different fasta files) that contain only Ns. Do you know how I can do this? Thanks
Hi! I have a multi-individual fasta file, and from this file I'm trying to remove all sites (i.e. nucleotides at the same position in the different fasta files) that contain only Ns. Do you know how I can do this? Thanks
assuming two lines per fasta record.
linearize the fasta, print one base per sequence, get the places where there is a N, count the places where there is 4 'N'. create a pattern for cut
CUT=`cat input.fa | paste - - | cut -f2 | while read S; do echo "${S}" | fold -w1 | awk '($1=="N") {print NR}' ; done | sort | uniq -c | awk '($1==4) {print $2;}' | paste -sd','`
cat input.fa | while read N; do echo $N; read S ; echo "${S}" | cut --complement -c "${CUT}" ; done
>seq1
AAATTTGNNCN
>seq2
AAAATTCNGCA
>seq3
AAATTAGAGCA
>seq4
AAATTACNNCA
This script fails to detect the CUT object (which is therefore empty) and as a result, it cannot find the positions where all individuals have missing data (N). Sorry, I'm a beginner, but is it because I have long sequences that span more than one line? I've tried concatenating my sequences, and it still doesn't work. I also tried extracting the headers and sequences of the file using awk (with this command: awk 'BEGIN {RS=">"; ORS=""; FS="\n"} {if (NF>1) {print $2}}'), but with no result. Do you know how to fix my issue? Thanks in advance!
Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
that's not clear to me. Give us an example.
For example, if I have 4 individuals in my files:
I want to remove positions with 'N' from all my individuals' sequences, and obtain something like this:
And keep my alignment intact.