Question: Creating BWA index files from custom fasta file
0
gravatar for robbertijsselsteijn
3 months ago by
robbertijsselsteijn0 wrote:

Hello! I've recently started working with NGS data and I got stuck and was wondering if anyone could help me. Before I started I used some sequencing data from a colleague of mine to get my pipeline up and running and so far that works.

What I am doing is the following:

Create index files of hg19 using BWA index

Use BWA mem to align my fastq.gz files

Use Samtools to create BAM files

Use lofreq to call the variants

This all works completely fine using the genomic data from my colleague but my approach was a little different. I sequenced cDNA and I was wondering if I could use a fasta file created by myself of the gene of interest which I downloaded from UCSC. When I try to create index files from that fasta using BWA index, it does so very rapidly (the cDNA is only 700bp), but afterwards it gives me Segmentation fault (core dumped). I think this has to do with my custom fasta and accompanying index files. Any thoughts?

Thanks in advance!

The custom fasta I want to use:

> [HPRT cDNA NGS full sequence.xdna]  - 1289 bp
TTACCTCACTGCTTTCCGGAGCGGTAGCACCTCCTCCGCCGGCTTCCTCCTCAGACCGCTTTTTGCCGCGAGCCGACCGG
TCCCGTCATGCCGACCCGCAGTCCCAGCGTCGTGATTAGCGATGATGAACCAGGTTATGACCTAGATTTGTTTTGTATAC
CTAATCATTATGCCGAGGATTTGGAAAAAGTGTTTATTCCTCATGGACTGATTATGGACAGGACTGAAAGACTTGCTCGA
GATGTCATGAAGGAGATGGGAGGCCATCACATTGTGGCCCTCTGTGTGCTCAAGGGGGGCTATAAGTTCTTTGCTGACCT
GCTGGATTACATTAAAGCACTGAATAGAAATAGTGATAGATCCATTCCTATGACTGTAGATTTTATCAGACTGAAGAGCT
ACTGTAATGATCAGTCAACGGGGGACATAAAAGTTATTGGTGGAGATGATCTCTCAACTTTAACTGGAAAGAATGTCTTG
ATTGTTGAAGATATAATTGACACTGGTAAAACAATGCAAACTTTGCTTTCCCTGGTTAAGCAGTACAGCCCCAAAATGGT
TAAGGTTGCAAGCTTGCTGGTGAAAAGGACCTCTCGAAGTGTTGGATACAGGCCAGACTTTGTTGGATTTGAAATTCCAG
ACAAGTTTGTTGTTGGATATGCCCTTGACTATAATGAGTACTTCAGGGATTTGAATCACGTTTGTGTCATTAGTGAAACT
GGAAAAGCCAAATACAAAGCCTAAGATGAGCGCAAGTTGAATCTGCAAATACGAGGAGTCCTGTTGATGTTGCCAGTAAA
ATTAGCAGGTGTTCTAGTCCTGTGGCCATCTGCCTAGTAAAGCTTTTTGCATGAACCTTCTATGAATGTTACTGTTTTAT
TTTTAGAAATGTCAGTTGCTGCGTCCCCAGACTTTTGATTTGCACTATGAGCCTATAGGCCAGCCTACCCTCTGGTAGAT
TGTCGCTTATCTTGTAAGAAAAACAAATCTCTTAAATTACCACTTTTAAATAATAATACTGAGATTGTATCTGTAAGAAG
GATTTAAAGAGAAGCTATATTAGTTTTTTAATTGGTATTTTAATTTTTATATATTCAGGAGAGAAAGATGTGATTGATAT
TGTTAATTTAGACGAGTCTGAAGCTCTCGATTTCCTATCAGTAACAGCATCTAAGAGGTTTTGCTCAGTGGAATAAACAT
GTTTCAGCAGTGTTGGCTGTATTTTCCCACTTTCAGTAAATCGTTGTCAACAGTTCCTTTTAAATGCAAATAAATAAATT
CTAAAAATT

The code:

bwa index fastafile.fa

bwa mem fastafile.fa sample1Fw.fastq.gz sample1Rv.fastq.gz > sample1.sam

The error:
[M::bwa_idx_load_from_disk] read 0 ALT contigs
[M::process] read 66226 sequences (10000126 bp)...
./bwascript-test.sh: line 2: 26915 Segmentation fault      (core dumped)

line 2 is the bwa mem I typed above.

ADD COMMENTlink modified 3 months ago by ATpoint32k • written 3 months ago by robbertijsselsteijn0

I don't think in principle there should be any problem doing this. I've mapped reads to short bespoke references in the past. As ATpoint says, without a bit more direct information we can't be of much help.

For some initial troubleshooting steps, perhaps try decimating your data. Seg faults typically imply not enough memory, so perhaps reducing the number of reads so you have less mapping and less to discard might help pinpoint the issue. It seems unlikely that its memory based though (but if you could share some hardware specs that would also help).

Perhaps consider trying bowtie or minimap and see if they also choke on your data.

ADD REPLYlink written 3 months ago by Joe16k

Yeah I thought of memory as well, but it also gave me the error when I ran with 20 reads :0

ADD REPLYlink written 3 months ago by robbertijsselsteijn0

Does the command run when not inside your script?

ADD REPLYlink written 3 months ago by Joe16k

Nope same problem sadly! Thanks for the help so far.

ADD REPLYlink written 3 months ago by robbertijsselsteijn0

Can you post the full output of

bwa index fastafile.fa

This should work without any issues unless your index is not being made properly.

Normally you should see something like this being printed to screen:

bwa index -p junk te.fa
[bwa_index] Pack FASTA... 0.00 sec
[bwa_index] Construct BWT for the packed sequence...
[bwa_index] 0.00 seconds elapse.
[bwa_index] Update BWT... 0.00 sec
[bwa_index] Pack forward-only FASTA... 0.00 sec
[bwa_index] Construct SA from BWT and Occ... 0.00 sec
[main] Version: 0.7.17-r1188
[main] CMD: bwa index -p junk te.fa
[main] Real time: 0.006 sec; CPU: 0.003 sec
ADD REPLYlink modified 3 months ago • written 3 months ago by genomax80k
bwa index HPRTmouse10.fa
[bwa_index] Pack FASTA... 0.00 sec
[bwa_index] Construct BWT for the packed sequence...
[bwa_index] 0.00 seconds elapse.
[bwa_index] Update BWT... 0.00 sec
[bwa_index] Pack forward-only FASTA... 0.00 sec
[bwa_index] Construct SA from BWT and Occ... 0.00 sec
[main] Version: 0.7.16a-r1181
[main] CMD: bwa index HPRTmouse10.fa
[main] Real time: 0.034 sec; CPU: 0.005 sec

Seems to work so far. Its the next step that gives the problem. If I use hg19 reference that works but aligning to genomic sequence for cDNA is unlogical. So I assume there's something wrong with my input for reference?

ADD REPLYlink modified 3 months ago by genomax80k • written 3 months ago by robbertijsselsteijn0

aligning to genomic sequence for cDNA

What does that mean?

As long as your cDNA fasta sequence is ok, it should work. Example I posted above had about a kb of sequence in the reference. I can align a set of fastq reads to this index without any issues using bwa mem.

ADD REPLYlink modified 3 months ago • written 3 months ago by genomax80k
1
gravatar for ATpoint
3 months ago by
ATpoint32k
Germany
ATpoint32k wrote:

I cannot reproduce this with your cDNA sequence and a subset I copied out of it and used as a read in a fastq file: I suggest you delete the index and start over. Also make sure that your fastq file is fine and try to run the bwa mem command manually outside of the script to see if the issue is with the script. Can you show the content of the script?

Edit: => The issue was due to hidden characters as the fasta file was saved on a Windows system but then exported to Unix, see OPs comment below. enter image description here

ADD COMMENTlink modified 3 months ago • written 3 months ago by ATpoint32k

I don't know how but when i reproduced what you did it seemed to work. Maybe its the (base) at the bottom? This time I created the file in my unix system instead of exporting as fasta from serialcloner on windows, this might have solved it as well.

Either way thanks!

ADD REPLYlink written 3 months ago by robbertijsselsteijn0
2

You're welcome. Yes, system incompatibilities when it comes to file formatting is indeed an issue, therefore it is recommended to stick with Unix systems. The (base) simply indicates that it is my base environment being active (conda package manager) but this has nothing to do with bwa. Good to hear it works now!

ADD REPLYlink written 3 months ago by ATpoint32k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1056 users visited in the last hour