Question: Error with samtools faidx...Different line length in sequence
0
gravatar for oars
23 months ago by
oars160
oars160 wrote:

I've tried making a "fai" file using the below command:

samtools faidx genome.fa

But I get an error:

[E::fai_build_core] Different line length in sequence '(null)'
Could not build fai index /media/dcpattie/backup/Week08/genome.fa.fai

I found this old thread: Error while doing indexing of fasta file using SAMTOOL faidx

Is the best course of action the following sequence of steps?

java -jar picard.jar NormalizeFasta \
      I=input_genome.fa \
      O=normalized_genome.fa
samtools faidx • 3.6k views
ADD COMMENTlink modified 21 months ago by Biostar ♦♦ 20 • written 23 months ago by oars160
1

Are there any blank lines in your genome.fa file?

ADD REPLYlink written 23 months ago by genomax78k

I'm not sure, its a 3GB file. I pulled the file from the UCSC website, then after making my genome.fa output file I used...

bwa index genome.fa

...to create five additional files, all named with the prefix genome.fa (.sa; .ann; .amb; .pac; .bwt).

Now I'm trying to make a genome.fa.fai file but this is much harder than advertised.

ADD REPLYlink written 23 months ago by oars160
1

if the following command shows different length of lines.

awk '/^>/ {print;next;} {print length($0);}'  genome.fa  | uniq

then you'll have to reformat your fasta sequence with NormalizeFasta

ADD REPLYlink modified 23 months ago • written 23 months ago by Pierre Lindenbaum126k

First - many thanks for your reply!

now, the results...

30
50
21
>chr2
50
23
>chr3
50
30
>chr4
50
26
>chr5
50
10
>chr6
50
17
>chr7
50
13
>chr8
50
22
>chr9
50
31
>chr10
50
47
>chr11
50
16
>chr12
50
45
>chr13
50
28
>chr14
50
40
>chr15
50
42
>chr16
50
3
>chr17
50
10
>chr18
50
48
>chr19
50
33
>chr20
50
20
>chr21
50
45
>chr22
50
16
>chrX
50
10
>chrY
50
16
ADD REPLYlink written 23 months ago by oars160

It seems your >chr1 header is corrupted, here is what I get:

>chr1
50
22

So the genome is formatted with 50 bases per line, with variations in the last line for each chromosome - the differences between my results and yours for these last lines is probably due to different versions of the genome.

What is the output of head -n2 genome.fa?

ADD REPLYlink written 23 months ago by h.mon29k
>chr1
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
ADD REPLYlink modified 23 months ago • written 23 months ago by oars160
1

So why the output of awk '/^>/ {print;next;} {print length($0);}' genome.fa | uniq is:

30
50
21

It should be:

>chr1
50
21
ADD REPLYlink modified 23 months ago • written 23 months ago by h.mon29k

oars : As an aside, if your genome.fa is corrupt then your bwa indexes are no good either.

ADD REPLYlink written 23 months ago by genomax78k

I don't know what went wrong, here is my commands:

$ wget http://hgdownload.cse.ucsc.edu/goldenPath/hg19/bigZips/chromFa.tar.gz
$ tar xvfz chromFa.tar.gz
$ for x in {1..22} X Y;do cat chr$x.fa;done > genome.fa
ADD REPLYlink written 23 months ago by oars160
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1296 users visited in the last hour