Question

twoBitToFa retrieving longer sequence than expected in hg19

0

Entering edit mode

4.1 years ago

biosol ▴ 170

Hi all!

I'm using twoBitToFa to retrieve a specific region of a chromosome in hg19. The command I'm using is:

twoBitToFa http://hgdownload.cse.ucsc.edu/gbdb/hg19/hg19.2bit myfile.fa -seq=chr1 -start=891021  -end=125030866

I would expect there to be 125030866 - 891021 = 124139845 nucleotides. However, when I count the number of nucleotides with a bash command there seem to be 60 more positions.

So, my question is: how can I find those extra positions and delete them? Or are they supposed to be there and I'm doing something wrong/ not having something into account??

Thanks in advance!

twoBitToFa hg19 • 1.0k views

ADD COMMENT • link updated 4.1 years ago by wm ▴ 550 • written 4.1 years ago by biosol ▴ 170

1

Entering edit mode

Are there any N's?

ADD REPLY • link 4.1 years ago by GenoMax 141k

0

Entering edit mode

Thanks to both of you, you were both right!! It turns out that my bash command was counting both the N's and the header line of the sequence!!

ADD REPLY • link 4.1 years ago by biosol ▴ 170

score 1 · Answer 1 · 2020-03-18

You can post your bash code here. the error could be from your script.

The output of twoBitToFa is Correct for my test.

$ twoBitToFa  hg19.2bit out.fa -seq=chr1 -start=891021  -end=125030866
$ wc -l out.fa
2482798 out.fa
$ head -n 2 out.fa 
>chr1:891021-125030866
GAAAATCTGAGAGCAAAGGCAGCTGCACACGCCACACCTTCTGAGACTCT

Use bash command to count the base (expect: 124139845)

$ sed '1d' out.fa | awk '{s+=length($1)}END{print s}'
124139845

Use seqkit command to stat

$ seqkit stat 
file    format  type  num_seqs      sum_len      min_len      avg_len      max_len
out.fa  FASTA   DNA          1  124,139,845  124,139,845  124,139,845  124,139,845