Question: Failure in alignment with BWA using fusion reference genome
gravatar for marongiu.luigi
18 days ago by
Germany, Mannheim, UMM
marongiu.luigi340 wrote:

Dear all,

I have used the fasta genome provided by NCBI as reported in this post . The headers of this file are:

>chr1  AC:CM000663.2  gi:568336023  LN:248956422  rl:Chromosome  M5:6aef897c3d6ff0c78aff06ac189178dd  AS:GRCh38
>chr2  AC:CM000664.2  gi:568336022  LN:242193529  rl:Chromosome  M5:f98db672eb0993dcfdabafe2a882905c  AS:GRCh38
>chr3  AC:CM000665.2  gi:568336021  LN:198295559  rl:Chromosome  M5:76635a41ea913a405ded820447d067b0  AS:GRCh38
>chrUn_GL000218v1  AC:GL000218.1  gi:224183305  LN:161147  rl:unplaced  M5:1d708b54644c26c7e01c2dad5426d38c  AS:GRCh38
>chrEBV  AC:AJ507799.2  gi:86261677  LN:171823  rl:decoy  M5:6743bd63b3ff2b5b8985d8933c53290a  SP:Human_herpesvirus_4  tp:circular

I need to use a fusion genome built by concatenating this human genome with one obtained from selected virus sequences. This virus genome is formed by a single header and a long stretch of nucleotides derived from individual virus sequences.

Following the indications given in this other post I prepared the header for the virus genome as follows:

>chrV  AC:XXXXXXXX.1  gi:00000000  LN:370064105  rl:Chromosome   M5:5aa5be7025d7baa666a8651e0909e4ce  AS:1  SP:All_viruses  tp:linear

I made up accession number AC to XXXXXXXX.1 because there is no real entry for my made-up genome in Genbank & NCBI; since the IDs given in the human genome are 8 digit long, I gave a 8 letters fake entry and a ".1" because this is first time I am using this genome (maybe I should have used two letter, 6 numbers?).

Same for the GI number: the made up genome is not recorded in GenBank, thus I simply gave a fake 8 digit number.

LN is the length of the genome, I treated it as a real chromosome and M5 derives for md5sum I made on the fasta file. AS and SP are free text fields (I assumed) and the genome is linear.

By the way, I separated the fields with two spaces.

I concatenated the human genome and the made up virus genome with `cat <human.fa> <virus.fa> > <fusion.fa> and I prepared the indices for this genome and aligned the samples with

bwa index -a bwtsw <fusion.fa>
bwa mem -t 8 -R <read_group> <fusion.fa> <R1.fq.gz> <R2.fq.gz> | \
samtools sort -o <file_ALN-SRT.sam>

However, I got this error message:

[bns_restore_core] Parse error reading <fusion.fa>.amb

and the SAM file is virtually empty:

cat <file_ALN-SRT.sam>  
@HD VN:1.3  SO:coordinate

May I ask what I got wrong? When aligning against either one or the other genome separately the alignment is OK, thus it must be a problem with the headers I guess.

Thank you

alignment software error • 121 views
ADD COMMENTlink written 18 days ago by marongiu.luigi340

Does this help?

ADD REPLYlink written 17 days ago by finswimmer7.8k

That would be weird because I could align to the individual genomes individually. How could spaces be introduced by simply concatenating two files? And that post indicates that "spaces in the header are OK". Anyway, I'll try this, thank you.

ADD REPLYlink written 17 days ago by marongiu.luigi340

Nope, it did not. I tried with both sed -i 's/\s*$//g' (for spaces in the sequence) and sed -i 's/^[^>]\s*$//g' (for spaces in the header) followed by bwa index but the result was always the sae.

ADD REPLYlink written 17 days ago by marongiu.luigi340

I also tried with printf "field\t...field\n" but I got the same error.

ADD REPLYlink written 3 days ago by marongiu.luigi340

Let's check if the line ending terminators are correct.

$ file <fusion.fa>
<fusion.fa>: ASCII text

If the output is <fusion.fa>: ASCII text, with CRLF line terminators you should install dos2unix and run dos2unix <fusion.fa>.

fin swimmer

ADD REPLYlink modified 3 days ago • written 3 days ago by finswimmer7.8k

That was a good tip that I shall remember, but there no Windows here:

$ file GRCh38.fa 
GRCh38.fa: ASCII text
$ file virChr10510.fa 
virChr10510.fa: ASCII text
$ file fusion38-10k.fa
fusion38-10k.fa: ASCII text
ADD REPLYlink written 3 days ago by marongiu.luigi340

Could you please try to reindex fusion38-10k.fa. But this time without any parameters.

For -a bwtsw` you used I see a warning in the manual about it:

Warning: `-a bwtsw' does not work for short genomes, while `-a is' and
         `-a div' do not work not for long genomes.

Without setting the algorithm bwa will auto select the best one.

fin swimmer

ADD REPLYlink written 2 days ago by finswimmer7.8k

exactly the same error...

ADD REPLYlink written 20 hours ago by marongiu.luigi340
  • Can you index and align against <human.fa> only?
  • Can you index and align against <virus.fa> only?
  • What's the output of grep "chrV" -A 5 -B 5 <fusion.fa>?
ADD REPLYlink written 18 hours ago by finswimmer7.8k
  • yes
  • yes
  • $ grep "chrV" -A 5 -B 5 fusion38-10k.fa

>chrV  AC:KI270741.1  gi:86261678  LN:370064105  rl:Chromosome   M5:5aa5be7025d7baa666a8651e0909e4ce  AS:GRCh38  SP:All_viruses
ADD REPLYlink modified 16 hours ago by finswimmer7.8k • written 17 hours ago by marongiu.luigi340
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1721 users visited in the last hour