Question

Error with samtools faidx...Different line length in sequence

1

Entering edit mode

6.4 years ago

oars ▴ 200

I've tried making a "fai" file using the below command:

samtools faidx genome.fa

But I get an error:

[E::fai_build_core] Different line length in sequence '(null)'
Could not build fai index /media/dcpattie/backup/Week08/genome.fa.fai

I found this old thread: Error while doing indexing of fasta file using SAMTOOL faidx

Is the best course of action the following sequence of steps?

java -jar picard.jar NormalizeFasta \
      I=input_genome.fa \
      O=normalized_genome.fa

samtools faidx • 11k views

ADD COMMENT • link updated 6.2 years ago by Biostar 20 • written 6.4 years ago by oars ▴ 200

2

Entering edit mode

if the following command shows different length of lines.

awk '/^>/ {print;next;} {print length($0);}'  genome.fa  | uniq

then you'll have to reformat your fasta sequence with NormalizeFasta

ADD REPLY • link 6.4 years ago by Pierre Lindenbaum 163k

0

Entering edit mode

First - many thanks for your reply!

now, the results...

30
50
21
>chr2
50
23
>chr3
50
30
>chr4
50
26
>chr5
50
10
>chr6
50
17
>chr7
50
13
>chr8
50
22
>chr9
50
31
>chr10
50
47
>chr11
50
16
>chr12
50
45
>chr13
50
28
>chr14
50
40
>chr15
50
42
>chr16
50
3
>chr17
50
10
>chr18
50
48
>chr19
50
33
>chr20
50
20
>chr21
50
45
>chr22
50
16
>chrX
50
10
>chrY
50
16

ADD REPLY • link 6.4 years ago by oars ▴ 200

0

Entering edit mode

It seems your >chr1 header is corrupted, here is what I get:

>chr1
50
22

So the genome is formatted with 50 bases per line, with variations in the last line for each chromosome - the differences between my results and yours for these last lines is probably due to different versions of the genome.

What is the output of head -n2 genome.fa?

ADD REPLY • link 6.4 years ago by h.mon 35k

0

Entering edit mode

>chr1
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN

ADD REPLY • link 6.4 years ago by oars ▴ 200

1

Entering edit mode

So why the output of awk '/^>/ {print;next;} {print length($0);}' genome.fa | uniq is:

30
50
21

It should be:

>chr1
50
21

ADD REPLY • link 6.4 years ago by h.mon 35k

0

Entering edit mode

oars : As an aside, if your genome.fa is corrupt then your bwa indexes are no good either.

ADD REPLY • link 6.4 years ago by GenoMax 144k

0

Entering edit mode

I don't know what went wrong, here is my commands:

$ wget http://hgdownload.cse.ucsc.edu/goldenPath/hg19/bigZips/chromFa.tar.gz
$ tar xvfz chromFa.tar.gz
$ for x in {1..22} X Y;do cat chr$x.fa;done > genome.fa

ADD REPLY • link 6.4 years ago by oars ▴ 200

1

Entering edit mode

Are there any blank lines in your genome.fa file?

ADD REPLY • link 6.4 years ago by GenoMax 144k

0

Entering edit mode

I'm not sure, its a 3GB file. I pulled the file from the UCSC website, then after making my genome.fa output file I used...

bwa index genome.fa

...to create five additional files, all named with the prefix genome.fa (.sa; .ann; .amb; .pac; .bwt).

Now I'm trying to make a genome.fa.fai file but this is much harder than advertised.

ADD REPLY • link 6.4 years ago by oars ▴ 200