Question: How to extract upper/lowercase from fasta to output position and +/-
0
gravatar for MJS
3.1 years ago by
MJS0
MJS0 wrote:

I'm looking for help on how to extract uppercase/lowercase information from multi fasta format file. I have been using the 'SNPable regions' method to mask a non-reference genome (http://lh3lh3.users.sourceforge.net/snpable.shtml). Hence, presently I have my information in the following format:

>Contig1
TCCcaccgcaacgaactg
>Contig2
ggTCgtgaagtaaaaaaa
>etc...

I want to compare this to another masking approach and need to format it. For simple comparison, I'd like it formattet as contig / position / + for uppercase or - for lowercase, i.e.:

Contig1 1 +
Contig1 2 +
Contig1 3 +
Contig1 4 -
Contig1 5 -
.......
.......
Contig2 1 -
Contig2 2 -
Contig2 3 +
Contig2 4 +
Contig2 5 -
....
....

Any simple fix?

uppercase extract fasta lowercase • 1.2k views
ADD COMMENTlink modified 3.1 years ago • written 3.1 years ago by MJS0
2
gravatar for michael.ante
3.1 years ago by
michael.ante3.3k
Austria/Vienna
michael.ante3.3k wrote:

A quick and simple solution would be using awk on the fasta file. Splitting the string into an array of characters and check with a regex whether the item is lower case or not:

awk '{if(NR%2==1){id=substr($1,2)};if(NR%2==0){n=split($0,a,""); for(i=1;i<=n;i++){if(match(a[i],/[a-z]/)){x="-"}else{x="+"}; printf "%s\t%d\t%s\n" , id,i,x}}}' test.fa

This will result in:

Contig1 1       +
Contig1 2       +
Contig1 3       +
Contig1 4       -
Contig1 5       -
...
Contig1 16      -
Contig1 17      -
Contig1 18      -
Contig2 1       -
Contig2 2       -
Contig2 3       +
Contig2 4       +
Contig2 5       -
Contig2 6       -
...

Denote, this will only work if your FASTA file is structured like the one in your example.

[Edit:] The script had a off-by-one error (i=0 instead of i=1).

ADD COMMENTlink modified 3.1 years ago • written 3.1 years ago by michael.ante3.3k
0
gravatar for MJS
3.1 years ago by
MJS0
MJS0 wrote:

Thanks a lot Michael, very elegant solution :)

Ahh, you are correct with regards to structuring the FASTA file. I over simplified it a bit too much. In truth it is a multi fasta file with new line inserts for every 60 nt, so your quick fix only works for the first sequence line following >Contig. It is a very standard file like so:

>Contig1
ggaccagagaggttctcaccttcagtgcggcgatgaagttgtgTCCCGtCatcccgtcac
cagacatccacccgttgctgcgccaatcacagaactccttaaagccagttccctgatatg
acgccaaaaacttggcttctcgggctgctgcccgcgcctttcttgaagcgttcaacccgg
>Contig2
aacgcgtgttcgttgctgctgttgggttatgcaGTTTTGACCGTGGCGCAAAATACAAGA
AGCATAGCGCAAAGTGACGTTATTTAGCGATCAGTGAACACGCGAGCATTGACTAACGGA
AAAAGGGAAAAAGCATACGTACTGCTAACGCAGGCGCTCAGCCTGACGAAGGCGACGCGT
>etc
...

Perhaps some quick fix to your solution can overcome this? Otherwise removing all line breaks in the file and rerunning it?

ADD COMMENTlink written 3.1 years ago by MJS0
2

Search "linearize fasta" and pipe output of that to michael.ante's solution..

ADD REPLYlink written 3.1 years ago by 5heikki8.4k
1

Correct, I'd go for the Fasta formatter from the fastx toolkit.

ADD REPLYlink written 3.1 years ago by michael.ante3.3k

Thank you for the help

ADD REPLYlink written 3.1 years ago by MJS0
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 2232 users visited in the last hour