Hello,
I am new to using galaxy and bioinformatics in general. I have a reference library in the fasta format (as shown below) and have data that I wish to map to this reference library in the fastq format (as shown below).
I have tried using Bowtie for Illumina on galaxy but I do not think I am doing the right thing.
reference in fasta:
>A1B1C1D1
CCCTGTACACTTCCTCAAGTTGCTGAAATGATGGCTTTCTAAACCATCCCACTC
>A1B1C1D2
CCCTGTACACTTCCTCAAGTTGCTGAAATGATGGCTTTCTACTACATCCCACTC
>A1B1C1D3
CCCTGTACACTTCCTCAAGTTGCTGAAATGATGGCTTTCTAAGACATCCCACTC
and so on...
fastq file I am trying to map to it:
@QM60Z:09341:09049
GCAGTACCAACCTGTACACCACTCAAGTTTTATGGATGATGCTCTTCTAAAACCGTCCCACTCTGTAGTCAGG
+
=<<=<<<+/+/16568864/477..*-...).9:56669944865266777.9/5599/5689:8:99:<<>0
@QM60Z:09367:09119
CCTGACTACAGAGTGGGATGTAGTAGTTTGGCATCATTTCAGCAACTTGAGTCTTGTGTACAGGGTTGGTACTGC
+
6.59:<<<<<<=;<;<3;<::7:78788084889:::=29::9959:5:;:8.-*-2425566:29575688:::
and so on...
where @QM60Z:09341:09049 is the location of the sample I am trying to map to my reference library
Essentially, I would like my results to say (for example): A1B1C1D1 was present in @QM60Z:09367:09119, @QM60Z:004387:0837 ... And how many times A1B1C1D1 appears in the sample I am trying to map.
I am sure this should not be a very hard thing to do but as I am new to this and have spent hours trying to, I would appreciate your help!
Thank you RamRS, I am new to Python too so I will have a go at it. The task does get trickier however because my experimental data (i.e. the fastq file) is the product of PCR, I will also have reverse complements of my library (I have created a fasta file with the reverse complements too however so this may not be a problem). Moreover, not all my experimental fragments will perfectly match up with my reference since many of them will be shorter or be a few bases off (the problems with experimental data). I cannot just discard these bases, is there any way to have "incomplete" matches appear and if so, give a quantity and probability as to how likely they are to be linked to a sequence in my library?
This is dangerously approaching the territory of problems that are too complex to solve without getting personally involved in the project. I'm gonna let this one be handled by people with more experience that might have encountered similar problems before, but if you're gonna involve partial matches where the FASTQ is longer than the reference, that's something I have never heard of.
Okay thank you for your help! The fastq is actually the same length or shorter than the reference
Ah, then you might be able to tweak command line version of bwa to get closer to your goal. The BAM file should give you the reads that align to each reference, which you can then pivot to get to your result.