Question

Creating BWA index files from custom fasta file

0

Entering edit mode

4.4 years ago

robbertijsselsteijn • 0

Hello! I've recently started working with NGS data and I got stuck and was wondering if anyone could help me. Before I started I used some sequencing data from a colleague of mine to get my pipeline up and running and so far that works.

What I am doing is the following:

Create index files of hg19 using BWA index

Use BWA mem to align my fastq.gz files

Use Samtools to create BAM files

Use lofreq to call the variants

This all works completely fine using the genomic data from my colleague but my approach was a little different. I sequenced cDNA and I was wondering if I could use a fasta file created by myself of the gene of interest which I downloaded from UCSC. When I try to create index files from that fasta using BWA index, it does so very rapidly (the cDNA is only 700bp), but afterwards it gives me Segmentation fault (core dumped). I think this has to do with my custom fasta and accompanying index files. Any thoughts?

Thanks in advance!

The custom fasta I want to use:

> [HPRT cDNA NGS full sequence.xdna]  - 1289 bp
TTACCTCACTGCTTTCCGGAGCGGTAGCACCTCCTCCGCCGGCTTCCTCCTCAGACCGCTTTTTGCCGCGAGCCGACCGG
TCCCGTCATGCCGACCCGCAGTCCCAGCGTCGTGATTAGCGATGATGAACCAGGTTATGACCTAGATTTGTTTTGTATAC
CTAATCATTATGCCGAGGATTTGGAAAAAGTGTTTATTCCTCATGGACTGATTATGGACAGGACTGAAAGACTTGCTCGA
GATGTCATGAAGGAGATGGGAGGCCATCACATTGTGGCCCTCTGTGTGCTCAAGGGGGGCTATAAGTTCTTTGCTGACCT
GCTGGATTACATTAAAGCACTGAATAGAAATAGTGATAGATCCATTCCTATGACTGTAGATTTTATCAGACTGAAGAGCT
ACTGTAATGATCAGTCAACGGGGGACATAAAAGTTATTGGTGGAGATGATCTCTCAACTTTAACTGGAAAGAATGTCTTG
ATTGTTGAAGATATAATTGACACTGGTAAAACAATGCAAACTTTGCTTTCCCTGGTTAAGCAGTACAGCCCCAAAATGGT
TAAGGTTGCAAGCTTGCTGGTGAAAAGGACCTCTCGAAGTGTTGGATACAGGCCAGACTTTGTTGGATTTGAAATTCCAG
ACAAGTTTGTTGTTGGATATGCCCTTGACTATAATGAGTACTTCAGGGATTTGAATCACGTTTGTGTCATTAGTGAAACT
GGAAAAGCCAAATACAAAGCCTAAGATGAGCGCAAGTTGAATCTGCAAATACGAGGAGTCCTGTTGATGTTGCCAGTAAA
ATTAGCAGGTGTTCTAGTCCTGTGGCCATCTGCCTAGTAAAGCTTTTTGCATGAACCTTCTATGAATGTTACTGTTTTAT
TTTTAGAAATGTCAGTTGCTGCGTCCCCAGACTTTTGATTTGCACTATGAGCCTATAGGCCAGCCTACCCTCTGGTAGAT
TGTCGCTTATCTTGTAAGAAAAACAAATCTCTTAAATTACCACTTTTAAATAATAATACTGAGATTGTATCTGTAAGAAG
GATTTAAAGAGAAGCTATATTAGTTTTTTAATTGGTATTTTAATTTTTATATATTCAGGAGAGAAAGATGTGATTGATAT
TGTTAATTTAGACGAGTCTGAAGCTCTCGATTTCCTATCAGTAACAGCATCTAAGAGGTTTTGCTCAGTGGAATAAACAT
GTTTCAGCAGTGTTGGCTGTATTTTCCCACTTTCAGTAAATCGTTGTCAACAGTTCCTTTTAAATGCAAATAAATAAATT
CTAAAAATT

The code:

bwa index fastafile.fa

bwa mem fastafile.fa sample1Fw.fastq.gz sample1Rv.fastq.gz > sample1.sam

The error:
[M::bwa_idx_load_from_disk] read 0 ALT contigs
[M::process] read 66226 sequences (10000126 bp)...
./bwascript-test.sh: line 2: 26915 Segmentation fault      (core dumped)

line 2 is the bwa mem I typed above.

alignment RNA-Seq genome sequencing • 7.2k views

ADD COMMENT • link updated 4.4 years ago by ATpoint 82k • written 4.4 years ago by robbertijsselsteijn • 0

0

Entering edit mode

I don't think in principle there should be any problem doing this. I've mapped reads to short bespoke references in the past. As ATpoint says, without a bit more direct information we can't be of much help.

For some initial troubleshooting steps, perhaps try decimating your data. Seg faults typically imply not enough memory, so perhaps reducing the number of reads so you have less mapping and less to discard might help pinpoint the issue. It seems unlikely that its memory based though (but if you could share some hardware specs that would also help).

Perhaps consider trying bowtie or minimap and see if they also choke on your data.

ADD REPLY • link 4.4 years ago by Joe 21k

0

Entering edit mode

Yeah I thought of memory as well, but it also gave me the error when I ran with 20 reads :0

ADD REPLY • link 4.4 years ago by robbertijsselsteijn • 0

0

Entering edit mode

Does the command run when not inside your script?

ADD REPLY • link 4.4 years ago by Joe 21k

0

Entering edit mode

Nope same problem sadly! Thanks for the help so far.

ADD REPLY • link 4.4 years ago by robbertijsselsteijn • 0

0

Entering edit mode

Can you post the full output of

bwa index fastafile.fa

This should work without any issues unless your index is not being made properly.

Normally you should see something like this being printed to screen:

bwa index -p junk te.fa
[bwa_index] Pack FASTA... 0.00 sec
[bwa_index] Construct BWT for the packed sequence...
[bwa_index] 0.00 seconds elapse.
[bwa_index] Update BWT... 0.00 sec
[bwa_index] Pack forward-only FASTA... 0.00 sec
[bwa_index] Construct SA from BWT and Occ... 0.00 sec
[main] Version: 0.7.17-r1188
[main] CMD: bwa index -p junk te.fa
[main] Real time: 0.006 sec; CPU: 0.003 sec

ADD REPLY • link 4.4 years ago by GenoMax 142k

0

Entering edit mode

bwa index HPRTmouse10.fa
[bwa_index] Pack FASTA... 0.00 sec
[bwa_index] Construct BWT for the packed sequence...
[bwa_index] 0.00 seconds elapse.
[bwa_index] Update BWT... 0.00 sec
[bwa_index] Pack forward-only FASTA... 0.00 sec
[bwa_index] Construct SA from BWT and Occ... 0.00 sec
[main] Version: 0.7.16a-r1181
[main] CMD: bwa index HPRTmouse10.fa
[main] Real time: 0.034 sec; CPU: 0.005 sec

Seems to work so far. Its the next step that gives the problem. If I use hg19 reference that works but aligning to genomic sequence for cDNA is unlogical. So I assume there's something wrong with my input for reference?

ADD REPLY • link updated 4.4 years ago by GenoMax 142k • written 4.4 years ago by robbertijsselsteijn • 0

0

Entering edit mode

aligning to genomic sequence for cDNA

What does that mean?

As long as your cDNA fasta sequence is ok, it should work. Example I posted above had about a kb of sequence in the reference. I can align a set of fastq reads to this index without any issues using bwa mem.

ADD REPLY • link 4.4 years ago by GenoMax 142k

score 1 · Answer 1 · 2019-12-11

1

Entering edit mode

4.4 years ago

ATpoint 82k

I cannot reproduce this with your cDNA sequence and a subset I copied out of it and used as a read in a fastq file: I suggest you delete the index and start over. Also make sure that your fastq file is fine and try to run the bwa mem command manually outside of the script to see if the issue is with the script. Can you show the content of the script?

Edit: => The issue was due to hidden characters as the fasta file was saved on a Windows system but then exported to Unix, see OPs comment below. enter image description here

ADD COMMENT • link 4.4 years ago by ATpoint 82k

0

Entering edit mode

I don't know how but when i reproduced what you did it seemed to work. Maybe its the (base) at the bottom? This time I created the file in my unix system instead of exporting as fasta from serialcloner on windows, this might have solved it as well.

Either way thanks!

ADD REPLY • link 4.4 years ago by robbertijsselsteijn • 0

2

Entering edit mode

You're welcome. Yes, system incompatibilities when it comes to file formatting is indeed an issue, therefore it is recommended to stick with Unix systems. The (base) simply indicates that it is my base environment being active (conda package manager) but this has nothing to do with bwa. Good to hear it works now!

ADD REPLY • link 4.4 years ago by ATpoint 82k