converting lowercase letters to uppercase letters in fasta file
4
0
Entering edit mode
4.2 years ago
kk.mahsa ▴ 140

i downloaded a assembled genome from NCBI in .fna format that Its content is like this (a sample part).

GAATGTCACGGCGAGTGAAACTACGTAAATTAAATATCTATTGTGAAGAACatgttctaagagtttttttcagagcattc tgcctcattttcgaatCTAAACTTAGGTAAGAGTTTGAAATAAGGGTAAATGTTTCTTGATGACCATATggcttgtatgg tggatgaaagttctttAAACCACATGctacaactcagtaatgaatgatTGTCGAATCCGAGATGCATGTAGCGTATTTGA AACATGGAACATCACAATGtgtgaaactatgtaaattacatatttcttgggtagaactcgctccaagagtaTTTTTCTGC

what is lowercase letter mean? how can i convert all nucleotide to uppercase in fasta format?
meantime, identifier of sequences is like this.
>NW_011509460.1 unplaced genomic scaffold, scaffold730, whole genome shotgun sequence

>NW_011509461.1 unplaced genomic Scaffold670, whole genome shotgun sequence

that i want to convert them to:

>NW_011509460.1

>NW_011509461.1

genome next-gen fasta fna • 9.0k views
1
Entering edit mode

0
Entering edit mode

R has a toupper() function. Say you have your fasta file saved as an object called df; the you can use something like:

data.frame(lapply(df, function(u) {
if (is.character(u)) return(toupper(u))
else return(u)
}))


In the R cmd line type

 ?toupper


1
Entering edit mode

The toupper() R function has worked well for me. I was using assembly software that ouputted in lowercase and needed to convert this to uppercase for use in downstream quality-check software. The simple R script I used was:

fasta = readLines("/path-to-fasta-file/filename.fa")
new = toupper(fasta)
writeLines(new, "/path-to-new-file/newfilename.fa")

6
Entering edit mode
4.2 years ago
5heikki 9.9k

Field separator is space, if line doesn't begin with ">" output uppercase, else output first field as is.

awk 'BEGIN{FS=" "}{if(!/>/){print toupper($0)}else{print$1}}' in.fna > out.fna

0
Entering edit mode

thanks, it worked fine

4
Entering edit mode
4.2 years ago

With BBMap:

reformat.sh in=file.fasta out=fixed.fasta trd tuc -Xmx1g


As WouterDeCoster mentioned, often converting the case is not necessary, but it depends. Some programs will ignore lowercase letters; if you want to guarantee they will not be ignored, I recommend converting them. Whether that's a good idea depends on the analysis.

Personally, I don't like masked fasta files, though soft-masked (converted to lowercase) are vastly preferable to hard-masked (converted to N) since the information is still recoverable. People in our gene annotation group inform me that masking is crucial to good annotation, but I really don't understand why that should be the case; it seems like a fundamental flaw in the annotation software they are using.

0
Entering edit mode

after try awk comman sugested by 5heikki, i run reformat.sh and it worked too. thanks a lot

3
Entering edit mode
4.2 years ago
Joe 19k

In python you can simply return any string with the .upper() method.

e.g.

>>> dna = 'ATAGCATGCagcatc'
>>> dna.upper()
ATAGCATGCAGCATC


I would suggest reading your fastas in with BioPython rather than writing your own entire script. Should be easy enough to figure out - but I agree with everyone else, only coerce the capital letters if you're certain you have to.

1
Entering edit mode
4.2 years ago
h.mon 33k

To answer properly, it would be nice to know from where and how you downloaded the genome, but I suppose lower-case represent soft-masked repeat sequences (it could also represent introns and intergenic regions; or some other annotation). You could convert everything to upper case with EMBOSS seqret, among other tools.

I suppose you want to convert:

>NW_011509460.1 unplaced genomic scaffold, scaffold730, whole genome shotgun sequence

to:

>NW_011509460.1

right? The markdown interprets the > as quotes and removes it, you have to use >\>.

There are dozens of posts about editing fasta headers, did you try searching? You can use sed, awk, perl, python... You can build a solution starting from this post, for example.

0
Entering edit mode

I resolve identifier problem using sed command:

sed 's/unplaced.*\$//' input.fasta > new_file.fasta


but i do not convert lowercase to uppercase yet. anyone have suggestion to di it?

2
Entering edit mode

You realize that those lowercase nucleotides mean something? Why would you want to convert them in the first place?

0
Entering edit mode

i found that Repeats from RepeatMasker and Tandem Repeats Finder (with period of 12 or less) are shown in lower case; non-repeating sequence is shown in upper case."

i want to use mentioned fasta file as reference in SNP calling project. so, converting lowercase letters to uppercase letters in fasta file is nessecery? results of snp callin before and after unmark reapeat sequences will be different?

0
Entering edit mode

It's not necessary, for the tools I'm aware of.

Traffic: 2196 users visited in the last hour
FAQ
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.