Hi all, I'm new to bioinformatics, and I've been given a set of NGS paired reads (~150bp length) from brewer's yeast that I'm trying to detect variants in. What I'm struggling with is the fact that my reference genome comes in separate .fasta files for each chromosome and mitochondrial dna.
Initially, I was able to align my sequences using HISAT2 with the following steps
1- Concatenating all chromosome and mitochondrial genome files into a single file, and build the index
cat $GENOME_DIR/*chr*.fsa $GENOME_DIR/*mt*.fsa > $GENOME_DIR/all_genomes.fsa
hisat2-build $GENOME_DIR/all_genomes.fsa ./$GENOME_DIR/$INDEX_BASE
2 - Using HISAT2 to align and convert the sam to bam
HISAT2_INDEX=$ReferenceFolder"/yeast_index"
OUTPUT_DIR="./Reads"
READ1=$1
READ2=$2
# Align reads to the combined genome
hisat2 -x $HISAT2_INDEX -1 $READ1 -2 $READ2 --new-summary --summary-file ./Reports/alignment_summary.txt \
| samtools view -b -o $OUTPUT_DIR/alignment.bam - #convert stdout to bam file
The reports suggest the alignment works well. However when trying to detect variants with the specialised detector, Insurveyor I get the warning that my reference genome all_genomes.fsa
has a problem
pyfaidx.FastaIndexingError: Line length of fasta file is not consistent! Inconsistent line found in >>tpg|BK006937.2| at line 5278.
I can reproduce the error with pyfaidx alone. However, when using Biopython's SeqIO.read() I'm able to read the file with no problem but I cannot see "lines". This got me wondering if my concatenation and/or indexing is wrong? Especially as the chromosome is marked as III
for the whole read
ID: >tpg|BK006937.2|
Name: >tpg|BK006937.2|
Description: >tpg|BK006937.2| [organism=Saccharomyces cerevisiae S288c] [strain=S288c] [moltype=genomic] [chromosome=III] [note=R64-1-1]
Number of features: 0
Seq('CCCACACACCACACCCACACCACACCCACACACCACACACACCACACCCACACA...ATA')
What I'd like help with is whether the concatenation is the correct step or have I got a bug in my code (I'm relatively new to shell scripting)? Doing a bit of reading around it seems as if most people suggest using BWA-mem2 instead of HISAT2, but would that cause the problem I'm encountering too?
Thanks in advance for your help
Your approach is fine. There seems to be some issue with the specific sequence being flagged. You can try to validate the sequence using (LINK)
seqkit seq -v fasta_file
.Thanks GenoMax, This is just for anyone else who runs into a similar issue.
I've run through each file with seqkit seq along with the concatenated one. Only the concatenated one produces an error, where it the
seqkit seq -v all_genomes_v2.fsa
produces the errorUsing BioPython SeqIO I targeted the location of the error
This shows that there appears to be a problem with the concatenation, where the header becomes part of the sequence.
I'll report back when I find out about what I did wrong with the concatenation.
That probably means the file before the "problem" one is lacking proper EOF character. You could simply add a carriage return at the end of that file.
Amazing. Thanks GenoMax I've spent a few hours trying all different types of concatenations suggested but none worked I'll give that a go.
Post-Script: Seems as if all the EOFs are "%" markers. No idea if this is a normal marker?