Question

how to build bowtie2 reference sequences

0

Entering edit mode

4.0 years ago

chrisgr ▴ 20

Hi, I cant seem to find any answers after hours of research to a few questions. I want to make a reference sequence for bowtie2. The reference consists of a few sequences of different length RNA:

By simply placing all sequences after eachother with N's in between them (to allow for reads to extend the reference a bit?(I dont know if that is a good thing to do?)): NNNNAGTGATCGGANNNNNNNAGCGTGATCGCATCGANNNNNNNNAGCGTGAGGAATAGTCTCGCATCGANNNNNNNN etc. It seemed to work, but I wanted to assign names to the sequences so later I can see to what reference sequence my reads mapped. I created a reference sequence by making a multifasta of them like this:

>seq1
NNNNAGTGATCGGANNNN

>seq2
NNNNAGCGTGATCGCATCGANNNN

>seq3
NNNNAGCGTGAGGAATAGTCTCGCATCGANNNN

etc.

This also seemed to work for Bowtie2. However, the .sam files resulting from both reference genomes are different. I don't know why. Should I even use N's to allow extended sequences or is that not needed? And for some reason I cannot upload the second reference (where I assigned names) to IGV tools for visualization. Am I doing something wrong?

I realize there's a few questions here, they can be summarized to this one: How could I assign names to the different reference sequences in my reference fasta file?

Thanks in advance!

bowtie2 reference sequence bowtie2-build IGV tools • 1.2k views

ADD COMMENT • link updated 4.0 years ago by GenoMax 141k • written 4.0 years ago by chrisgr ▴ 20

2

Entering edit mode

By simply placing all sequences after eachother with N's in between them

You should not need to append any N's. A multi-fasta file is fine to use.

Are your reference sequences really < 25 bp long? There may be other tools that could be used instead of bowtie2, if that is the case. What kind of data do you want to align to this file?

How could I assign names to the different reference sequences in my reference fasta file?

Just like you did above (in the example I have formatted using code option).

However, the .sam files resulting from both reference genomes are different.

No wonder since you converted your multiple fasta sequences into a single one as far as the aligner is concerned.

And for some reason I cannot upload the second reference (where I assigned names) to IGV tools for visualization. Am I doing something wrong?

Most likely. A simple multi-fasta file should easily be recognized by IGV.

ADD REPLY • link 4.0 years ago by GenoMax 141k

0

Entering edit mode

You should not need to append any N's. A multi-fasta file is fine to use. Alright thanks, good to hear that this way is fine and I dont need to use the N's

Are your reference sequences really < 25 bp long? There may be other tools that could be used instead of bowtie2, if that is the case. What kind of data do you want to align to this file? Yes, very short indeed, just want to map fastq reads to them To see how much mismatches, or full-length distribution there is

Just like you did above (in the example I have formatted using code option). Ah thank you

No wonder since you converted your multiple fasta sequences into a single one as far as the aligner is concerned. Right, I thought because I placed the N's in between them, they should still map the same but I guess there might be some exeptions

Most likely. A simple multi-fasta file should easily be recognized by IGV. Ye this is really strange, I started doubting if what I did was correct because this didnt work. But I'll give this some more thought

Thanks a lot for the help!

ADD REPLY • link 4.0 years ago by chrisgr ▴ 20

1

Entering edit mode

This is somewhat of an outside the box application, since you have very small reference sequences. You could literally grep them out of reads or use a tool like fuzznuc (from EMBOSS) after converting your fastq sequences to fasta.

I am not sure if IGV expects the references to be of a certain size. Perhaps that is why it is having trouble with your multi-fasta file.

Are you sure these alignments are working? Can you post a couple of representative entries from your SAM file?

ADD REPLY • link 4.0 years ago by GenoMax 141k

1

Entering edit mode

I seems to work well, the results make a lot of sense atleast. Why would it not work? The tRNA sequences are synthetic so they cant map on multiple places on the reference and I like the log files I can make from .sam files. One of the .sam files looks like this, is this what you asked for? :) Here is an imgur link: https://imgur.com/wvX2kaQ Its not very sharp :/

btw I figured out how to upload the reference seqs to IGV as a multifasta, when I upload it, I have to select one of the sequences I want to see afterwards. It doesn't show anything right away so I thought it failed.

ADD REPLY • link 4.0 years ago by chrisgr ▴ 20

1

Entering edit mode

When your reference target is that small I was not sure if the aligner would be able to properly soft-clip longer reads (or you have perhaps trimmed then already). It certainly looks like it is working from the image.

If you have a lot of sequence redundancy in your dataset you could look at simplifying it by using this tool: A: Introducing Clumpify: Create 30% Smaller, Faster Gzipped Fastq Files You can actually add counts for sequences to the fastq headers among other things.

I have to select one of the sequences I want to see afterwards.

That is correct. For IGV each of these is a chromosome and if you have a lot of them then they would not show up at the top of the first page.