Question: converting lowercase letters to uppercase letters in fasta file
0
gravatar for kk.mahsa
2.4 years ago by
kk.mahsa100
kk.mahsa100 wrote:

i downloaded a assembled genome from NCBI in .fna format that Its content is like this (a sample part).

GAATGTCACGGCGAGTGAAACTACGTAAATTAAATATCTATTGTGAAGAACatgttctaagagtttttttcagagcattc tgcctcattttcgaatCTAAACTTAGGTAAGAGTTTGAAATAAGGGTAAATGTTTCTTGATGACCATATggcttgtatgg tggatgaaagttctttAAACCACATGctacaactcagtaatgaatgatTGTCGAATCCGAGATGCATGTAGCGTATTTGA AACATGGAACATCACAATGtgtgaaactatgtaaattacatatttcttgggtagaactcgctccaagagtaTTTTTCTGC

what is lowercase letter mean? how can i convert all nucleotide to uppercase in fasta format?
meantime, identifier of sequences is like this.
>NW_011509460.1 unplaced genomic scaffold, scaffold730, whole genome shotgun sequence

>NW_011509461.1 unplaced genomic Scaffold670, whole genome shotgun sequence

that i want to convert them to:

>NW_011509460.1

>NW_011509461.1

fna next-gen fasta genome • 4.4k views
ADD COMMENTlink modified 2.4 years ago by Joe14k • written 2.4 years ago by kk.mahsa100
1

What's the link between your question and the title of this thread?

ADD REPLYlink written 2.4 years ago by WouterDeCoster42k

R has a toupper() function. Say you have your fasta file saved as an object called df; the you can use something like:

data.frame(lapply(df, function(u) {
if (is.character(u)) return(toupper(u))
else return(u)
 }))

In the R cmd line type

 ?toupper

for more info.

ADD REPLYlink written 2.4 years ago by theobroma221.1k
1

The toupper() R function has worked well for me. I was using assembly software that ouputted in lowercase and needed to convert this to uppercase for use in downstream quality-check software. The simple R script I used was:

fasta = readLines("/path-to-fasta-file/filename.fa")
new = toupper(fasta)
writeLines(new, "/path-to-new-file/newfilename.fa")
ADD REPLYlink written 21 months ago by kwathen-dunn10
4
gravatar for Brian Bushnell
2.4 years ago by
Walnut Creek, USA
Brian Bushnell16k wrote:

With BBMap:

reformat.sh in=file.fasta out=fixed.fasta trd tuc -Xmx1g

As WouterDeCoster mentioned, often converting the case is not necessary, but it depends. Some programs will ignore lowercase letters; if you want to guarantee they will not be ignored, I recommend converting them. Whether that's a good idea depends on the analysis.

Personally, I don't like masked fasta files, though soft-masked (converted to lowercase) are vastly preferable to hard-masked (converted to N) since the information is still recoverable. People in our gene annotation group inform me that masking is crucial to good annotation, but I really don't understand why that should be the case; it seems like a fundamental flaw in the annotation software they are using.

ADD COMMENTlink modified 2.3 years ago • written 2.4 years ago by Brian Bushnell16k

after try awk comman sugested by 5heikki, i run reformat.sh and it worked too. thanks a lot

ADD REPLYlink written 2.3 years ago by kk.mahsa100
4
gravatar for 5heikki
2.4 years ago by
5heikki8.6k
Finland
5heikki8.6k wrote:

Field separator is space, if line doesn't begin with ">" output uppercase, else output first field as is.

awk 'BEGIN{FS=" "}{if(!/>/){print toupper($0)}else{print $1}}' in.fna > out.fna
ADD COMMENTlink modified 2.4 years ago • written 2.4 years ago by 5heikki8.6k

thanks, it worked fine

ADD REPLYlink written 2.3 years ago by kk.mahsa100
3
gravatar for Joe
2.4 years ago by
Joe14k
United Kingdom
Joe14k wrote:

In python you can simply return any string with the .upper() method.

e.g.

>>> dna = 'ATAGCATGCagcatc'
>>> dna.upper()
ATAGCATGCAGCATC

I would suggest reading your fastas in with BioPython rather than writing your own entire script. Should be easy enough to figure out - but I agree with everyone else, only coerce the capital letters if you're certain you have to.

ADD COMMENTlink written 2.4 years ago by Joe14k
1
gravatar for h.mon
2.4 years ago by
h.mon28k
Brazil
h.mon28k wrote:

To answer properly, it would be nice to know from where and how you downloaded the genome, but I suppose lower-case represent soft-masked repeat sequences (it could also represent introns and intergenic regions; or some other annotation). You could convert everything to upper case with EMBOSS seqret, among other tools.

I suppose you want to convert:

>NW_011509460.1 unplaced genomic scaffold, scaffold730, whole genome shotgun sequence

to:

>NW_011509460.1

right? The markdown interprets the > as quotes and removes it, you have to use >\>.

There are dozens of posts about editing fasta headers, did you try searching? You can use sed, awk, perl, python... You can build a solution starting from this post, for example.

ADD COMMENTlink written 2.4 years ago by h.mon28k

I resolve identifier problem using sed command:

sed 's/unplaced.*$//' input.fasta > new_file.fasta

but i do not convert lowercase to uppercase yet. anyone have suggestion to di it?

ADD REPLYlink modified 5 months ago by RamRS24k • written 2.4 years ago by kk.mahsa100
2

You realize that those lowercase nucleotides mean something? Why would you want to convert them in the first place?

ADD REPLYlink written 2.4 years ago by WouterDeCoster42k

i found that Repeats from RepeatMasker and Tandem Repeats Finder (with period of 12 or less) are shown in lower case; non-repeating sequence is shown in upper case."

i want to use mentioned fasta file as reference in SNP calling project. so, converting lowercase letters to uppercase letters in fasta file is nessecery? results of snp callin before and after unmark reapeat sequences will be different?

ADD REPLYlink written 2.4 years ago by kk.mahsa100

It's not necessary, for the tools I'm aware of.

ADD REPLYlink written 2.4 years ago by WouterDeCoster42k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1039 users visited in the last hour