Question: Get part of sequence from genome, given a start and stop position with Java.
gravatar for ahclugtenberg
15 months ago by
ahclugtenberg0 wrote:

I've got VCF-like files with start, stop, REF and ALT columns. I need to check that the REF position from the variants are the same as the one in the genome, to check if they're from the same built. I also need the surrounding nucleotides of the given position. Also, some of the REF columns are empty and because of this, it is not an appropriate VCF file.

I've got a fasta file which has the genome for chromosome 1, and I was wondering if there's a library available to get a part of the genome in nucleotides, given a start- and stop position. For example, if you've got the genome AACCGGTT, that given a start position of 1 and a stop position of 4 it returns AACC. I could write such a parser myself, but I'd rather use a library which has the edge-cases covered.

I'd rather have something locally than use the API of NCBI, which also makes this possible.

java vcf genome • 326 views
ADD COMMENTlink modified 11 months ago by Biostar ♦♦ 20 • written 15 months ago by ahclugtenberg0

Hi, You can use bedtools getfasta .


ADD REPLYlink written 15 months ago by Titus910

samtools faidx, pyfaidx, bedtools getfasta can all retrieve parts of fasta sequence given a start and stop. While not libraries they may be an option to consider.

@Pierre has his Javarkit which may have something that will work (if you must use Java):

ADD REPLYlink modified 15 months ago • written 15 months ago by genomax87k

If it's anything like BioPython and you absolutely must use Java, there's no doubt something in BioJava which you could use.

I know less than nothing about Java specifically though so can't offer any practical code for this.

ADD REPLYlink written 15 months ago by Joe17k
gravatar for Pierre Lindenbaum
15 months ago by
France/Nantes/Institut du Thorax - INSERM UMR1087
Pierre Lindenbaum129k wrote:

use the htsjdk library and the class IndexedFastaSequenceFile

faidx =new IndexedFastaSequenceFile(fastaFile);
sub = faidx.getSubsequenceAt("chr1",10,20).getBaseString();
ADD COMMENTlink modified 15 months ago • written 15 months ago by Pierre Lindenbaum129k

Yes, thank you! I was just looking at this library, but couldn't find the right function.

ADD REPLYlink written 15 months ago by ahclugtenberg0

If an answer was helpful you should upvote it, if the answer resolved your question you should mark it as accepted.


ADD REPLYlink written 15 months ago by Pierre Lindenbaum129k
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 894 users visited in the last hour