converting lowercase letters to uppercase letters in fasta file
4
0
Entering edit mode
4.2 years ago
kk.mahsa ▴ 140

i downloaded a assembled genome from NCBI in .fna format that Its content is like this (a sample part).

GAATGTCACGGCGAGTGAAACTACGTAAATTAAATATCTATTGTGAAGAACatgttctaagagtttttttcagagcattc tgcctcattttcgaatCTAAACTTAGGTAAGAGTTTGAAATAAGGGTAAATGTTTCTTGATGACCATATggcttgtatgg tggatgaaagttctttAAACCACATGctacaactcagtaatgaatgatTGTCGAATCCGAGATGCATGTAGCGTATTTGA AACATGGAACATCACAATGtgtgaaactatgtaaattacatatttcttgggtagaactcgctccaagagtaTTTTTCTGC

what is lowercase letter mean? how can i convert all nucleotide to uppercase in fasta format?
meantime, identifier of sequences is like this.
>NW_011509460.1 unplaced genomic scaffold, scaffold730, whole genome shotgun sequence

>NW_011509461.1 unplaced genomic Scaffold670, whole genome shotgun sequence

that i want to convert them to:

>NW_011509460.1

>NW_011509461.1

genome next-gen fasta fna • 9.0k views
ADD COMMENT
1
Entering edit mode

What's the link between your question and the title of this thread?

ADD REPLY
0
Entering edit mode

R has a toupper() function. Say you have your fasta file saved as an object called df; the you can use something like:

data.frame(lapply(df, function(u) {
if (is.character(u)) return(toupper(u))
else return(u)
 }))

In the R cmd line type

 ?toupper

for more info.

ADD REPLY
1
Entering edit mode

The toupper() R function has worked well for me. I was using assembly software that ouputted in lowercase and needed to convert this to uppercase for use in downstream quality-check software. The simple R script I used was:

fasta = readLines("/path-to-fasta-file/filename.fa")
new = toupper(fasta)
writeLines(new, "/path-to-new-file/newfilename.fa")
ADD REPLY
6
Entering edit mode
4.2 years ago
5heikki 9.9k

Field separator is space, if line doesn't begin with ">" output uppercase, else output first field as is.

awk 'BEGIN{FS=" "}{if(!/>/){print toupper($0)}else{print $1}}' in.fna > out.fna
ADD COMMENT
0
Entering edit mode

thanks, it worked fine

ADD REPLY
4
Entering edit mode
4.2 years ago

With BBMap:

reformat.sh in=file.fasta out=fixed.fasta trd tuc -Xmx1g

As WouterDeCoster mentioned, often converting the case is not necessary, but it depends. Some programs will ignore lowercase letters; if you want to guarantee they will not be ignored, I recommend converting them. Whether that's a good idea depends on the analysis.

Personally, I don't like masked fasta files, though soft-masked (converted to lowercase) are vastly preferable to hard-masked (converted to N) since the information is still recoverable. People in our gene annotation group inform me that masking is crucial to good annotation, but I really don't understand why that should be the case; it seems like a fundamental flaw in the annotation software they are using.

ADD COMMENT
0
Entering edit mode

after try awk comman sugested by 5heikki, i run reformat.sh and it worked too. thanks a lot

ADD REPLY
3
Entering edit mode
4.2 years ago
Joe 19k

In python you can simply return any string with the .upper() method.

e.g.

>>> dna = 'ATAGCATGCagcatc'
>>> dna.upper()
ATAGCATGCAGCATC

I would suggest reading your fastas in with BioPython rather than writing your own entire script. Should be easy enough to figure out - but I agree with everyone else, only coerce the capital letters if you're certain you have to.

ADD COMMENT
1
Entering edit mode
4.2 years ago
h.mon 33k

To answer properly, it would be nice to know from where and how you downloaded the genome, but I suppose lower-case represent soft-masked repeat sequences (it could also represent introns and intergenic regions; or some other annotation). You could convert everything to upper case with EMBOSS seqret, among other tools.

I suppose you want to convert:

>NW_011509460.1 unplaced genomic scaffold, scaffold730, whole genome shotgun sequence

to:

>NW_011509460.1

right? The markdown interprets the > as quotes and removes it, you have to use >\>.

There are dozens of posts about editing fasta headers, did you try searching? You can use sed, awk, perl, python... You can build a solution starting from this post, for example.

ADD COMMENT
0
Entering edit mode

I resolve identifier problem using sed command:

sed 's/unplaced.*$//' input.fasta > new_file.fasta

but i do not convert lowercase to uppercase yet. anyone have suggestion to di it?

ADD REPLY
2
Entering edit mode

You realize that those lowercase nucleotides mean something? Why would you want to convert them in the first place?

ADD REPLY
0
Entering edit mode

i found that Repeats from RepeatMasker and Tandem Repeats Finder (with period of 12 or less) are shown in lower case; non-repeating sequence is shown in upper case."

i want to use mentioned fasta file as reference in SNP calling project. so, converting lowercase letters to uppercase letters in fasta file is nessecery? results of snp callin before and after unmark reapeat sequences will be different?

ADD REPLY
0
Entering edit mode

It's not necessary, for the tools I'm aware of.

ADD REPLY

Login before adding your answer.

Traffic: 2196 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6