Question

Behaviour Of Aligners For Occurence Of Lowercase And Upper Case Bases In A Fasta File

1

Entering edit mode

13.2 years ago

Varun Gupta ★ 1.3k

Hi Everyone I created my own genome for some genes. In gene sequence ,I have upper case bases (A,C,G,T) which represents the exons , while lower cases (a,c,g,t) represent introns. My question is that how would aligners like gsnap, bowtie,tophat(which uses bowtie for indexing) treat such genome. Is it necessary to convert all the bases to uppercase letters (because fastq files have reads which are in upper case), or it won't make any difference at all.

Hope to hear from you

Regards

V

fasta • 7.9k views

ADD COMMENT • link updated 13.2 years ago by Vikas Bansal ★ 2.4k • written 13.2 years ago by Varun Gupta ★ 1.3k

score 2 · Answer 1 · 2012-05-08

Most of the aligners does not care about lower and upper case unless there is an option in an aligner and you told the aligner to skip lower case. Is there any specific reason that you specially made intronic region lower case? One reason what I see can be that you do not want your reads to map at intronic regions. For that I will suggest to replace those regions with N's. Or if you have some other reason and you want to keep them in lower case and also want that your reads should map at those positions, then just tell your aligner (or run in default, whatever aligner manual says) to not to differentiate between lower and upper case.

Also I would suggest that when ever you have some problem like this, you can always create a very small test fastq file (or fasta file of reads) and small genome file (fasta file) and then run your favourite aligner on these files. It will hardly take 5 minutes to run but good thing is you can play around with different options and you will learn lot of new things (I did the same thing and believe me every time I learn something new).

score 0 · Answer 2 · 2012-05-08

0

Entering edit mode

13.2 years ago

Istvan Albert 102k

The meaning of lower and upper case can be different, repetitive regions etc.

In general aligners treat the upper and lower case bases identically unless clearly stated otherwise.

ADD COMMENT • link 13.2 years ago by Istvan Albert 102k

0

Entering edit mode

Hi

Usually lowercase indicate repeat sequences , also known as soft masking, but in my case i just took the sequence of a gene and lowercase and upper case only represent introns and exons respectively. So when i map my fastq reads with the genome(having uppercase and lowercase bases) will it cause any problems.

ADD REPLY • link 13.2 years ago by Varun Gupta ★ 1.3k

score 0 · Answer 3 · 2012-05-08

0

Entering edit mode

13.2 years ago

Jeremy Leipzig 23k

lower-case reference sequences can mean "mask" (i.e. don't align to this region) to some older aligners like BLAT, but not many short-read aligners care about this. See: http://www.biostars.org/post/show/3232/which-aligners-recognize-soft-masked-repeats-in-reference-sequences/

ADD COMMENT • link 13.2 years ago by Jeremy Leipzig 23k

0

Entering edit mode

Hi

So since in my case lower case bases are not meaning mask, should i convert them to upper case

Regards

ADD REPLY • link 13.2 years ago by Varun Gupta ★ 1.3k

0

Entering edit mode

probably won't matter, see what happens

ADD REPLY • link 13.2 years ago by Jeremy Leipzig 23k