using awk: can handle multiple Ns, case insensitive for Ns:
$ awk '{gsub("[Nn]+","\n>\n");}1' test.fa | awk '/^>/ {if ($0 == ">") {$0=prev} prev=$0}1' | awk '/^>/ {getline seq} {if(seq!="") {print $0"\n"seq}}' | awk '(/^>/ && a[$0]++) {$0=$0"_"a[$0]}1'
>chr1
ATCGCTGATCGAT
>chr1_2
CTAGCTAGCTAGCGTAGCTGCTAG
>chr1_3
CGTAGCCGTAGCT
>chr2
CCTGA
>chr2_2
GTAGT
>chr3
ATG
>chr4
C
>chr4_2
T
input:
$ cat test.fa
>chr1
ATCGCTGATCGATNCTAGCTAGCTAGCGTAGCTGCTAGNCGTAGCCGTAGCT
>chr2
CCTGANGTAGT
>chr3
ATGNNNNNN
>chr4
NNNNCnT
make sure that sequences are single line. If they are not, use following command instead of above line:
$ seqkit seq -w0 test.fa | awk '{gsub("[Nn]+","\n>\n");}1' | awk '/^>/ {if ($0 == ">") {$0=prev} prev=$0}1' | awk '/^>/ {getline seq} {if(seq!="") {print $0"\n"seq}}' | awk '(/^>/ && a[$0]++) {$0=$0"_"a[$0]}1'
Download seqkit from here.
•
link
modified 8 months ago
•
written
8 months ago by
cpad0112 ♦ 11k
What do you want to happen when you reach 100 Ns in a row?
Sorry,I have forget it. if there exist many N base , we also want to split it util there have 'A','T','G'or 'C' base