Question

Creating an index from multiple .fsa files

0

Entering edit mode

23 days ago

theworldsagainstme • 0

Hi all, I'm new to bioinformatics, and I've been given a set of NGS paired reads (~150bp length) from brewer's yeast that I'm trying to detect variants in. What I'm struggling with is the fact that my reference genome comes in separate .fasta files for each chromosome and mitochondrial dna.

Initially, I was able to align my sequences using HISAT2 with the following steps

1- Concatenating all chromosome and mitochondrial genome files into a single file, and build the index

cat $GENOME_DIR/*chr*.fsa $GENOME_DIR/*mt*.fsa > $GENOME_DIR/all_genomes.fsa

hisat2-build $GENOME_DIR/all_genomes.fsa ./$GENOME_DIR/$INDEX_BASE

2 - Using HISAT2 to align and convert the sam to bam

HISAT2_INDEX=$ReferenceFolder"/yeast_index"
OUTPUT_DIR="./Reads"
READ1=$1
READ2=$2

# Align reads to the combined genome
hisat2 -x $HISAT2_INDEX -1 $READ1 -2 $READ2 --new-summary --summary-file ./Reports/alignment_summary.txt \
| samtools view -b -o $OUTPUT_DIR/alignment.bam - #convert stdout to bam file

The reports suggest the alignment works well. However when trying to detect variants with the specialised detector, Insurveyor I get the warning that my reference genome all_genomes.fsa has a problem

pyfaidx.FastaIndexingError: Line length of fasta file is not consistent! Inconsistent line found in >>tpg|BK006937.2| at line 5278.

I can reproduce the error with pyfaidx alone. However, when using Biopython's SeqIO.read() I'm able to read the file with no problem but I cannot see "lines". This got me wondering if my concatenation and/or indexing is wrong? Especially as the chromosome is marked as III for the whole read

ID: >tpg|BK006937.2|
Name: >tpg|BK006937.2|
Description: >tpg|BK006937.2| [organism=Saccharomyces cerevisiae S288c] [strain=S288c] [moltype=genomic] [chromosome=III] [note=R64-1-1]
Number of features: 0
Seq('CCCACACACCACACCCACACCACACCCACACACCACACACACCACACCCACACA...ATA')

What I'd like help with is whether the concatenation is the correct step or have I got a bug in my code (I'm relatively new to shell scripting)? Doing a bit of reading around it seems as if most people suggest using BWA-mem2 instead of HISAT2, but would that cause the problem I'm encountering too?

Thanks in advance for your help

HISAT2 BWA Alignment • 491 views

ADD COMMENT • link 20 days ago by theworldsagainstme • 0

1

Entering edit mode

Your approach is fine. There seems to be some issue with the specific sequence being flagged. You can try to validate the sequence using (LINK) seqkit seq -v fasta_file.

ADD REPLY • link 23 days ago by GenoMax 141k

0

Entering edit mode

Thanks GenoMax, This is just for anyone else who runs into a similar issue.

I've run through each file with seqkit seq along with the concatenated one. Only the concatenated one produces an error, where it the seqkit seq -v all_genomes_v2.fsa produces the error

seq: invalid DNAredundant lebtter: > at 3101994

Using BioPython SeqIO I targeted the location of the error

SeqRecord = SeqIO.read("./ReferenceGenome/all_genomes_v2.fsa", 'fasta')
print(SeqRecord[3101990:3101999])

This shows that there appears to be a problem with the concatenation, where the header becomes part of the sequence.

ID: >tpg|BK006937.2|
Name: >tpg|BK006937.2|
Description: >tpg|BK006937.2| [organism=Saccharomyces cerevisiae S288c] [strain=S288c] [moltype=genomic] [chromosome=III] [note=R64-1-1]
Number of features: 0
Seq('[organism')

I'll report back when I find out about what I did wrong with the concatenation.

ADD REPLY • link 21 days ago by theworldsagainstme • 0

0

Entering edit mode

This shows that there appears to be a problem with the concatenation, where the header becomes part of the sequence.

That probably means the file before the "problem" one is lacking proper EOF character. You could simply add a carriage return at the end of that file.

ADD REPLY • link 21 days ago by GenoMax 141k

0

Entering edit mode

Amazing. Thanks GenoMax I've spent a few hours trying all different types of concatenations suggested but none worked I'll give that a go.

Post-Script: Seems as if all the EOFs are "%" markers. No idea if this is a normal marker?

ADD REPLY • link 21 days ago by theworldsagainstme • 0

score 1 · Accepted Answer · 2024-04-08

As seen in the comments to the original post, the issue was in the quality of the .fsa files I had received. They had an incorrect ending. As GenoMax suggested simply adding a carriage return to the end of each of the reference files fixed the issue.

The lesson from this issue is to always check the quality and formats of your data!