Question

Inserting Gaps into FASTA files for Alignment to Reference

3

Entering edit mode

2.7 years ago

lnrrnl ▴ 20

Hi there,

I am attempting to clean FASTQ PE data. I have decided to perform quality control on both R1 and R2 separately. My goal is to get the editing efficiency as well as find Indels and "real" mutations. I have been using the software AliView to view the FASTA alignment thus far. I've noticed some of the sequences would align if a gap was inserted at various positions.

Additionally, I have used Usegalaxy.org for the majority of my NGS processing. I begin with FASTQC, (barcode splitter), followed by Trimmomatic, then Bowtie2. Bowtie2 outputs both the BAM files and the aligned FASTQ reads. From here, I was wondering if anyone had suggestions on how to insert gaps with respect to my reference genome fasta file.

If anyone has successfully inserted gaps to align to the reference with Bowtie2, please let me know what parameters you tweaked and how you decided what to set them to!

While I can manually do this in AliView, I am working with 100,000+ sequences and that is obviously not feasible.

Below is a screenshot of aligned sequences with an example of a sequence that I would like to insert a gap for.

The image shows a screenshot of sequences in AliView with an example of a sequence that would need a gap inserted.

Please let me know if you have any suggestions. I am comfortable working in Galaxy and with Python. Thank you so much

mapping Alignment galaxy FASTA gap • 1.7k views

ADD COMMENT • link updated 2.7 years ago by Istvan Albert 100k • written 2.7 years ago by lnrrnl ▴ 20

1

Entering edit mode

How does that look in an alignment viewer meant to handle NGS reads like IGV or IGB? My guess is that there are indels in the alignments that are marked appropriately in the original BAM file but which are being ignored by the viewer.

ADD REPLY • link 2.7 years ago by Devon Ryan 104k

score 0 · Answer 1 · 2021-07-28

There should be no need to alter your reads to match the reference.

The alignment is the information on how one sequence matches another.

When you align your reads with Bowtie the SAM record for the alignment will contain all the relevant information on where the read matches and how it matches: exactly, with mismatches, with insertions and deletions. Now it is true that establishing that information from the SAM format is a bit circuitous.

To that end I would recommend minimap2 and using the PAF format (pairwise alignment format). Within that the CS tag contains the variations in a more explicit manner:

https://github.com/lh3/miniasm/blob/master/PAF.md