Behaviour Of Aligners For Occurence Of Lowercase And Upper Case Bases In A Fasta File
3
1
Entering edit mode
12.0 years ago
Varun Gupta ★ 1.3k

Hi Everyone I created my own genome for some genes. In gene sequence ,I have upper case bases (A,C,G,T) which represents the exons , while lower cases (a,c,g,t) represent introns. My question is that how would aligners like gsnap, bowtie,tophat(which uses bowtie for indexing) treat such genome. Is it necessary to convert all the bases to uppercase letters (because fastq files have reads which are in upper case), or it won't make any difference at all.

Hope to hear from you

Regards

V

fasta • 6.8k views
ADD COMMENT
2
Entering edit mode
12.0 years ago
Vikas Bansal ★ 2.4k

Most of the aligners does not care about lower and upper case unless there is an option in an aligner and you told the aligner to skip lower case. Is there any specific reason that you specially made intronic region lower case? One reason what I see can be that you do not want your reads to map at intronic regions. For that I will suggest to replace those regions with N's. Or if you have some other reason and you want to keep them in lower case and also want that your reads should map at those positions, then just tell your aligner (or run in default, whatever aligner manual says) to not to differentiate between lower and upper case.

Also I would suggest that when ever you have some problem like this, you can always create a very small test fastq file (or fasta file of reads) and small genome file (fasta file) and then run your favourite aligner on these files. It will hardly take 5 minutes to run but good thing is you can play around with different options and you will learn lot of new things (I did the same thing and believe me every time I learn something new).

ADD COMMENT
0
Entering edit mode

Hi Vikas

Well the only reason i have lower and upper case bases in my fasta file is that i just copied and pasted it from a database, which makes such distinction probably to view the seq better. I was about to convert them into uppercase but since i will be viewing my reads along with junctions.bed file in IGV i thought why not keep the genes as it is and if the aligner still works normally as it works with upper case bases, it will be fine for me atleast when i view in IGV, I know it won't make much difference though.

Thanks for the help

ADD REPLY
0
Entering edit mode
12.0 years ago

The meaning of lower and upper case can be different, repetitive regions etc.

In general aligners treat the upper and lower case bases identically unless clearly stated otherwise.

ADD COMMENT
0
Entering edit mode

Hi

Usually lowercase indicate repeat sequences , also known as soft masking, but in my case i just took the sequence of a gene and lowercase and upper case only represent introns and exons respectively. So when i map my fastq reads with the genome(having uppercase and lowercase bases) will it cause any problems.

ADD REPLY
0
Entering edit mode
12.0 years ago

lower-case reference sequences can mean "mask" (i.e. don't align to this region) to some older aligners like BLAT, but not many short-read aligners care about this. See: http://www.biostars.org/post/show/3232/which-aligners-recognize-soft-masked-repeats-in-reference-sequences/

ADD COMMENT
0
Entering edit mode

Hi

So since in my case lower case bases are not meaning mask, should i convert them to upper case

Regards

ADD REPLY
0
Entering edit mode

probably won't matter, see what happens

ADD REPLY

Login before adding your answer.

Traffic: 2302 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6