Extract Sequences From Chr In Fasta Via Java
1
0
Entering edit mode
10.6 years ago
bhonsai • 0

Hi,

I've got a little problem. I try to retrieve many sequence parts from the chr21. The positions come from and Encode table and I want to browse fast through my chromosome via java. I get sequences, but not the ones I get, when I use the GenomeBrowser. Per BLAT I found out that for example my position 15,000,000 is in reality 15,049,142, but I can't explain the difference. Help would be nice.

        StringBuffer finalSeq = new StringBuffer();
    try {
        RandomAccessFile chr21 = new RandomAccessFile ("/scratch/fbh/Homo_sapiens.GRCh37.72.dna.chromosome.21.fa", "r");
        /* skipping top line for calculating the right pointer coordinates
         *  from start- and endpositions from the sequence of interest */
        chr21.readLine();
        long begin = (51*(startpos/50) + startpos%50 + chr21.getFilePointer());
        long end = (51*(endpos/50) + endpos%50 + chr21.getFilePointer());

        chr21.seek(begin);                                        // setting pointer at begin of sequence
        char charAtPos;

        for ( int i = 0; i < (end - begin)+1; i++){

            if ((charAtPos = (char) chr21.read()) != '\n'){            // getting sequence without newlines
                finalSeq.append(charAtPos);
            }
        }
        finalSeq.trimToSize();
        chr21.close();
    } catch (IOException e2){}                                     // catching throws
fasta java sequence • 2.5k views
ADD COMMENT
1
Entering edit mode
10.6 years ago

Suggestions:

  • check the versions of the builds (both GRCh37) ?
  • check if all lines have the same length=50
  • check if all lines ends with CRLF instead of CR
  • have a look at the sam code for picard/faidx here:

byte[] target = new byte[length];
ByteBuffer targetBuffer = ByteBuffer.wrap(target);

final int basesPerLine = indexEntry.getBasesPerLine();
final int bytesPerLine = indexEntry.getBytesPerLine();
final int terminatorLength = bytesPerLine - basesPerLine;

long startOffset = ((start-1)/basesPerLine)*bytesPerLine + (start-1)%basesPerLine;
ADD COMMENT
0
Entering edit mode

I really checked Versions, but with all versions different results so i was very confused. Now checked the byte of the lines and versions and it worked. I wish I had asked earlier. Stupid mistake, but I'm very grateful for your help. Besides you even gave me a more affective way to save my result. Thanks a lot.

ADD REPLY

Login before adding your answer.

Traffic: 2523 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6