Question

How to extract upper/lowercase from fasta to output position and +/-

0

Entering edit mode

8.0 years ago

MJS • 0

I'm looking for help on how to extract uppercase/lowercase information from multi fasta format file. I have been using the 'SNPable regions' method to mask a non-reference genome (http://lh3lh3.users.sourceforge.net/snpable.shtml). Hence, presently I have my information in the following format:

>Contig1
TCCcaccgcaacgaactg
>Contig2
ggTCgtgaagtaaaaaaa
>etc...

I want to compare this to another masking approach and need to format it. For simple comparison, I'd like it formattet as contig / position / + for uppercase or - for lowercase, i.e.:

Contig1 1 +
Contig1 2 +
Contig1 3 +
Contig1 4 -
Contig1 5 -
.......
.......
Contig2 1 -
Contig2 2 -
Contig2 3 +
Contig2 4 +
Contig2 5 -
....
....

Any simple fix?

uppercase lowercase fasta extract • 3.1k views

ADD COMMENT • link 8.0 years ago by MJS • 0

score 2 · Answer 1 · 2016-04-20

A quick and simple solution would be using awk on the fasta file. Splitting the string into an array of characters and check with a regex whether the item is lower case or not:

awk '{if(NR%2==1){id=substr($1,2)};if(NR%2==0){n=split($0,a,""); for(i=1;i<=n;i++){if(match(a[i],/[a-z]/)){x="-"}else{x="+"}; printf "%s\t%d\t%s\n" , id,i,x}}}' test.fa

This will result in:

Contig1 1       +
Contig1 2       +
Contig1 3       +
Contig1 4       -
Contig1 5       -
...
Contig1 16      -
Contig1 17      -
Contig1 18      -
Contig2 1       -
Contig2 2       -
Contig2 3       +
Contig2 4       +
Contig2 5       -
Contig2 6       -
...

Denote, this will only work if your FASTA file is structured like the one in your example.

[Edit:] The script had a off-by-one error (i=0 instead of i=1).

score 0 · Answer 2 · 2016-04-20

0

Entering edit mode

8.0 years ago

MJS • 0

Thanks a lot Michael, very elegant solution :)

Ahh, you are correct with regards to structuring the FASTA file. I over simplified it a bit too much. In truth it is a multi fasta file with new line inserts for every 60 nt, so your quick fix only works for the first sequence line following >Contig. It is a very standard file like so:

>Contig1
ggaccagagaggttctcaccttcagtgcggcgatgaagttgtgTCCCGtCatcccgtcac
cagacatccacccgttgctgcgccaatcacagaactccttaaagccagttccctgatatg
acgccaaaaacttggcttctcgggctgctgcccgcgcctttcttgaagcgttcaacccgg
>Contig2
aacgcgtgttcgttgctgctgttgggttatgcaGTTTTGACCGTGGCGCAAAATACAAGA
AGCATAGCGCAAAGTGACGTTATTTAGCGATCAGTGAACACGCGAGCATTGACTAACGGA
AAAAGGGAAAAAGCATACGTACTGCTAACGCAGGCGCTCAGCCTGACGAAGGCGACGCGT
>etc
...

Perhaps some quick fix to your solution can overcome this? Otherwise removing all line breaks in the file and rerunning it?

ADD COMMENT • link 8.0 years ago by MJS • 0

2

Entering edit mode

Search "linearize fasta" and pipe output of that to michael.ante's solution..

ADD REPLY • link 8.0 years ago by 5heikki 11k

1

Entering edit mode

Correct, I'd go for the Fasta formatter from the fastx toolkit.

ADD REPLY • link 8.0 years ago by michael.ante ★ 3.8k

0

Entering edit mode

Thank you for the help

ADD REPLY • link 8.0 years ago by MJS • 0