Question

Bedtools getfasta: error

0

Entering edit mode

6 weeks ago

ummeswaiba • 0

Hello everyone,

I am using .txt file, converted to input.bed, which contains coordinates to the gene sequences (around 1507 in total). To fetch these sequences I'm using this command:

bedtools getfasta -fi hg38.fa.align.gz -bed input.bed -fo output.fast

As you can see, I'm using hg38.fa.align.gz as my reference genome. However, when I run it, I get this error.

index file hg38.fa.align.gz.fai not found, generating...
ERROR: mismatched line lengths at line 3 within sequence 
File not suitable for fasta index generation.

I did try to fix issue with reference genome file, but to no avail. Can anyone suggest any fix, I would be grateful.

getfasta bedtools • 355 views

ADD COMMENT • link updated 6 weeks ago by GenoMax 141k • written 6 weeks ago by ummeswaiba • 0

0

Entering edit mode

align

Is there a significance of that word in file name? You need to use a plain fasta sequence file as reference. Not an aligned fasta format file.

ADD REPLY • link 6 weeks ago by GenoMax 141k

0

Entering edit mode

There is no specific significance of align, and I have shifted to plain fasta file. The code runs but I get this warning for majority of the chr.

WARNING. chromosome (chr6) was not found in the FASTA file. Skipping.
WARNING. chromosome (chr6) was not found in the FASTA file. Skipping.
WARNING. chromosome (chr6) was not found in the FASTA file. Skipping.
Differing number of BED fields encountered at line: 1128.  Exiting...

I have rechecked my bed file as as well. It is in Bed6 format. both fasta and .bed file contain chrn. what could cause this problem?

ADD REPLY • link updated 6 weeks ago by GenoMax 141k • written 6 weeks ago by ummeswaiba • 0

0

Entering edit mode

Your chromosome names need to match in BED and the reference file. You also appear to have some problem with the BED file at the line noted above.

Can you show us output of:

grep "^>" refererence_file | head -4
head -5 bed_file

ADD REPLY • link 6 weeks ago by GenoMax 141k

0

Entering edit mode

Sure:

 grep "^>" hg38.repeats.cleaned.fa | head -4

>chr1|-|11485|11676|LINE|L1|L1ME3G_3end|3|32.68
>chr1|-|11505|11675|LINE|L1|L1MC5a_3end|3|24.83
>chr1|-|11678|11780|DNA|hAT-Charlie|MER5B|4|37.80
>chr1|-|15265|15355|SINE|MIR|MIR|5|29.95

head -5 input.bed

chr1    3718559 3719552 HERVK13-int 6199.000000 -
chr1    4242679 4242934 HERVH-int   34034.000000    -
chr1    5110790 5111366 HERVH-int   28994.000000    +
chr1    6623059 6623614 HERVFH21-int    6639.000000 -
chr1    8956488 8957043 HERV9-int   19583.000000    +

ADD REPLY • link updated 6 weeks ago by GenoMax 141k • written 6 weeks ago by ummeswaiba • 0

0

Entering edit mode

As you can see your fasta files have headers that seem to have extraneous information beyond just >chr1. So that is reason bedtools can't match the two. It also looks like you don't have a contiguous sequence for entire chromosome. In order for getfasta to be able to find an interval e.g. chr1 3718559 3719552 the entire sequence will need to be present as a single record.

I am not sure what the best solution is going to be in this case and what exact sequence you are looking to extract. You may want to use your BED file with the original reference genome if you just want to extract intervals present in your BED.

ADD REPLY • link 6 weeks ago by GenoMax 141k