Question

Mapping fastq files to a fasta reference using galaxy

0

Entering edit mode

10.8 years ago

delods0 • 0

Hello,

I am new to using galaxy and bioinformatics in general. I have a reference library in the fasta format (as shown below) and have data that I wish to map to this reference library in the fastq format (as shown below).

I have tried using Bowtie for Illumina on galaxy but I do not think I am doing the right thing.

reference in fasta:

>A1B1C1D1
CCCTGTACACTTCCTCAAGTTGCTGAAATGATGGCTTTCTAAACCATCCCACTC
>A1B1C1D2
CCCTGTACACTTCCTCAAGTTGCTGAAATGATGGCTTTCTACTACATCCCACTC
>A1B1C1D3
CCCTGTACACTTCCTCAAGTTGCTGAAATGATGGCTTTCTAAGACATCCCACTC

and so on...

fastq file I am trying to map to it:

@QM60Z:09341:09049
GCAGTACCAACCTGTACACCACTCAAGTTTTATGGATGATGCTCTTCTAAAACCGTCCCACTCTGTAGTCAGG
+
=<<=<<<+/+/16568864/477..*-...).9:56669944865266777.9/5599/5689:8:99:<<>0

@QM60Z:09367:09119
CCTGACTACAGAGTGGGATGTAGTAGTTTGGCATCATTTCAGCAACTTGAGTCTTGTGTACAGGGTTGGTACTGC
+
6.59:<<<<<<=;<;<3;<::7:78788084889:::=29::9959:5:;:8.-*-2425566:29575688:::

and so on...

where @QM60Z:09341:09049 is the location of the sample I am trying to map to my reference library

Essentially, I would like my results to say (for example): A1B1C1D1 was present in @QM60Z:09367:09119, @QM60Z:004387:0837 ... And how many times A1B1C1D1 appears in the sample I am trying to map.

I am sure this should not be a very hard thing to do but as I am new to this and have spent hours trying to, I would appreciate your help!

galaxy • 3.5k views

ADD COMMENT • link updated 3.6 years ago by Ram 45k • written 10.8 years ago by delods0 • 0

Ram · Answer 1 · 2015-01-28

0

Entering edit mode

10.8 years ago

Ram 45k

This task is a bit more complicated than "not a hard thing to do" :)

Use BioPerl or BioPython

Extract FASTA content of FASTQ file.
For each line ref_seq in ref FASTA file, match ref_seq to q_seq in FASTA file from Step 1, then print ID of ref_seq, list of names of the q_seqs that match, and a count of the q_seqs.

ADD COMMENT • link 3.6 years ago by Ram 45k

0

Entering edit mode

Thank you RamRS, I am new to Python too so I will have a go at it. The task does get trickier however because my experimental data (i.e. the fastq file) is the product of PCR, I will also have reverse complements of my library (I have created a fasta file with the reverse complements too however so this may not be a problem). Moreover, not all my experimental fragments will perfectly match up with my reference since many of them will be shorter or be a few bases off (the problems with experimental data). I cannot just discard these bases, is there any way to have "incomplete" matches appear and if so, give a quantity and probability as to how likely they are to be linked to a sequence in my library?

ADD REPLY • link updated 3.6 years ago by Ram 45k • written 10.8 years ago by delods0 • 0

0

Entering edit mode

This is dangerously approaching the territory of problems that are too complex to solve without getting personally involved in the project. I'm gonna let this one be handled by people with more experience that might have encountered similar problems before, but if you're gonna involve partial matches where the FASTQ is longer than the reference, that's something I have never heard of.

ADD REPLY • link 3.6 years ago by Ram 45k

0

Entering edit mode

Okay thank you for your help! The fastq is actually the same length or shorter than the reference

ADD REPLY • link updated 3.6 years ago by Ram 45k • written 10.8 years ago by delods0 • 0

0

Entering edit mode

Ah, then you might be able to tweak command line version of bwa to get closer to your goal. The BAM file should give you the reads that align to each reference, which you can then pivot to get to your result.

ADD REPLY • link 3.6 years ago by Ram 45k