Mapping fastq files to a fasta reference using galaxy
1
0
Entering edit mode
10.8 years ago
delods0 • 0

Hello,

I am new to using galaxy and bioinformatics in general. I have a reference library in the fasta format (as shown below) and have data that I wish to map to this reference library in the fastq format (as shown below).

I have tried using Bowtie for Illumina on galaxy but I do not think I am doing the right thing.

reference in fasta:

>A1B1C1D1
CCCTGTACACTTCCTCAAGTTGCTGAAATGATGGCTTTCTAAACCATCCCACTC
>A1B1C1D2
CCCTGTACACTTCCTCAAGTTGCTGAAATGATGGCTTTCTACTACATCCCACTC
>A1B1C1D3
CCCTGTACACTTCCTCAAGTTGCTGAAATGATGGCTTTCTAAGACATCCCACTC

and so on...

fastq file I am trying to map to it:

@QM60Z:09341:09049
GCAGTACCAACCTGTACACCACTCAAGTTTTATGGATGATGCTCTTCTAAAACCGTCCCACTCTGTAGTCAGG
+
=<<=<<<+/+/16568864/477..*-...).9:56669944865266777.9/5599/5689:8:99:<<>0

@QM60Z:09367:09119
CCTGACTACAGAGTGGGATGTAGTAGTTTGGCATCATTTCAGCAACTTGAGTCTTGTGTACAGGGTTGGTACTGC
+
6.59:<<<<<<=;<;<3;<::7:78788084889:::=29::9959:5:;:8.-*-2425566:29575688:::

and so on...

where @QM60Z:09341:09049 is the location of the sample I am trying to map to my reference library

Essentially, I would like my results to say (for example): A1B1C1D1 was present in @QM60Z:09367:09119, @QM60Z:004387:0837 ... And how many times A1B1C1D1 appears in the sample I am trying to map.

I am sure this should not be a very hard thing to do but as I am new to this and have spent hours trying to, I would appreciate your help!

galaxy • 3.5k views
ADD COMMENT
0
Entering edit mode
10.8 years ago
Ram 45k

This task is a bit more complicated than "not a hard thing to do" :)

Use BioPerl or BioPython

  1. Extract FASTA content of FASTQ file.
  2. For each line ref_seq in ref FASTA file, match ref_seq to q_seq in FASTA file from Step 1, then print ID of ref_seq, list of names of the q_seqs that match, and a count of the q_seqs.
ADD COMMENT
0
Entering edit mode

Thank you RamRS, I am new to Python too so I will have a go at it. The task does get trickier however because my experimental data (i.e. the fastq file) is the product of PCR, I will also have reverse complements of my library (I have created a fasta file with the reverse complements too however so this may not be a problem). Moreover, not all my experimental fragments will perfectly match up with my reference since many of them will be shorter or be a few bases off (the problems with experimental data). I cannot just discard these bases, is there any way to have "incomplete" matches appear and if so, give a quantity and probability as to how likely they are to be linked to a sequence in my library?

ADD REPLY
0
Entering edit mode

This is dangerously approaching the territory of problems that are too complex to solve without getting personally involved in the project. I'm gonna let this one be handled by people with more experience that might have encountered similar problems before, but if you're gonna involve partial matches where the FASTQ is longer than the reference, that's something I have never heard of.

ADD REPLY
0
Entering edit mode

Okay thank you for your help! The fastq is actually the same length or shorter than the reference

ADD REPLY
0
Entering edit mode

Ah, then you might be able to tweak command line version of bwa to get closer to your goal. The BAM file should give you the reads that align to each reference, which you can then pivot to get to your result.

ADD REPLY

Login before adding your answer.

Traffic: 2944 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6