Question

converting lowercase letters to uppercase letters in fasta file

0

Entering edit mode

6.8 years ago

kk.mahsa ▴ 140

i downloaded a assembled genome from NCBI in .fna format that Its content is like this (a sample part).

GAATGTCACGGCGAGTGAAACTACGTAAATTAAATATCTATTGTGAAGAACatgttctaagagtttttttcagagcattc tgcctcattttcgaatCTAAACTTAGGTAAGAGTTTGAAATAAGGGTAAATGTTTCTTGATGACCATATggcttgtatgg tggatgaaagttctttAAACCACATGctacaactcagtaatgaatgatTGTCGAATCCGAGATGCATGTAGCGTATTTGA AACATGGAACATCACAATGtgtgaaactatgtaaattacatatttcttgggtagaactcgctccaagagtaTTTTTCTGC

what is lowercase letter mean? how can i convert all nucleotide to uppercase in fasta format?
meantime, identifier of sequences is like this.
>NW_011509460.1 unplaced genomic scaffold, scaffold730, whole genome shotgun sequence

>NW_011509461.1 unplaced genomic Scaffold670, whole genome shotgun sequence

that i want to convert them to:

>NW_011509460.1

>NW_011509461.1

genome next-gen fasta fna • 16k views

ADD COMMENT • link updated 6.8 years ago by Joe 21k • written 6.8 years ago by kk.mahsa ▴ 140

1

Entering edit mode

What's the link between your question and the title of this thread?

ADD REPLY • link 6.8 years ago by WouterDeCoster 47k

0

Entering edit mode

R has a toupper() function. Say you have your fasta file saved as an object called df; the you can use something like:

data.frame(lapply(df, function(u) {
if (is.character(u)) return(toupper(u))
else return(u)
 }))

In the R cmd line type

 ?toupper

for more info.

ADD REPLY • link 6.8 years ago by theobroma22 ★ 1.2k

1

Entering edit mode

The toupper() R function has worked well for me. I was using assembly software that ouputted in lowercase and needed to convert this to uppercase for use in downstream quality-check software. The simple R script I used was:

fasta = readLines("/path-to-fasta-file/filename.fa")
new = toupper(fasta)
writeLines(new, "/path-to-new-file/newfilename.fa")

ADD REPLY • link 6.2 years ago by kwathen-dunn ▴ 10

3

Entering edit mode

6.8 years ago

Joe 21k

In python you can simply return any string with the .upper() method.

e.g.

>>> dna = 'ATAGCATGCagcatc'
>>> dna.upper()
ATAGCATGCAGCATC

I would suggest reading your fastas in with BioPython rather than writing your own entire script. Should be easy enough to figure out - but I agree with everyone else, only coerce the capital letters if you're certain you have to.

ADD COMMENT • link 6.8 years ago by Joe 21k

1

Entering edit mode

6.8 years ago

h.mon 35k

To answer properly, it would be nice to know from where and how you downloaded the genome, but I suppose lower-case represent soft-masked repeat sequences (it could also represent introns and intergenic regions; or some other annotation). You could convert everything to upper case with EMBOSS seqret, among other tools.

I suppose you want to convert:

>NW_011509460.1 unplaced genomic scaffold, scaffold730, whole genome shotgun sequence

to:

>NW_011509460.1

right? The markdown interprets the > as quotes and removes it, you have to use >\>.

There are dozens of posts about editing fasta headers, did you try searching? You can use sed, awk, perl, python... You can build a solution starting from this post, for example.

ADD COMMENT • link 6.8 years ago by h.mon 35k

0

Entering edit mode

I resolve identifier problem using sed command:

sed 's/unplaced.*$//' input.fasta > new_file.fasta

but i do not convert lowercase to uppercase yet. anyone have suggestion to di it?

ADD REPLY • link updated 4.9 years ago by Ram 43k • written 6.8 years ago by kk.mahsa ▴ 140

2

Entering edit mode

You realize that those lowercase nucleotides mean something? Why would you want to convert them in the first place?

ADD REPLY • link 6.8 years ago by WouterDeCoster 47k

0

Entering edit mode

i found that Repeats from RepeatMasker and Tandem Repeats Finder (with period of 12 or less) are shown in lower case; non-repeating sequence is shown in upper case."

i want to use mentioned fasta file as reference in SNP calling project. so, converting lowercase letters to uppercase letters in fasta file is nessecery? results of snp callin before and after unmark reapeat sequences will be different?

ADD REPLY • link 6.8 years ago by kk.mahsa ▴ 140

0

Entering edit mode

It's not necessary, for the tools I'm aware of.

ADD REPLY • link 6.8 years ago by WouterDeCoster 47k

score 8 · Accepted Answer · 2017-07-05

8

Entering edit mode

6.8 years ago

5heikki 11k

Field separator is space, if line doesn't begin with ">" output uppercase, else output first field as is.

awk 'BEGIN{FS=" "}{if(!/>/){print toupper($0)}else{print $1}}' in.fna > out.fna

ADD COMMENT • link 6.8 years ago by 5heikki 11k

0

Entering edit mode

thanks, it worked fine

ADD REPLY • link 6.8 years ago by kk.mahsa ▴ 140

score 4 · Accepted Answer · 2017-07-05

With BBMap:

reformat.sh in=file.fasta out=fixed.fasta trd tuc -Xmx1g

As WouterDeCoster mentioned, often converting the case is not necessary, but it depends. Some programs will ignore lowercase letters; if you want to guarantee they will not be ignored, I recommend converting them. Whether that's a good idea depends on the analysis.

Personally, I don't like masked fasta files, though soft-masked (converted to lowercase) are vastly preferable to hard-masked (converted to N) since the information is still recoverable. People in our gene annotation group inform me that masking is crucial to good annotation, but I really don't understand why that should be the case; it seems like a fundamental flaw in the annotation software they are using.