Biojava FastaReaderHelper read only 2814 ProteinSequence
2
0
Entering edit mode
8.4 years ago

Hi everyone:

I have been trying to read a Fasta file containing 11374 protein sequences, but only the firsts 2814 sequences are readed. I am using biojava-4.1.0 and the line for reading the sequences is:

LinkedHashMap<String, ProteinSequence> entries = FastaReaderHelper.readFastaProteinSequence(file);

I know that all sequences are different because I created the file with different values. My Fasta file is here

I previously tried with biojava3-core-3.0.8 and biojava3-core-3.1.0, using the same code for reading and got the same count readed.

Any help are welcome.

EDIT: Any other alternative for reading and writing Fasta files is also accepted.

sequence biojava reader fasta • 1.9k views
ADD COMMENT
1
Entering edit mode
8.4 years ago

I can replicate the issue. It appears biojava doesn't like the pipe char | in the sequence name. Or rather, I suspect it trims the sequence name to the first |. So when a duplicate name is found, the existing entry in the HashMap is silently replaced.

If you remove pipe chars from sequence names, than your code will read all the sequences:

sed 's/|/_/g' sequences_87394380194.fasta > tmp.fa

Then this will do:

public static void main (String[] args) throws IOException{

       File file= new File("/Users/berald01/Downloads/tmp.fa");
        LinkedHashMap<String, ProteinSequence> entries = FastaReaderHelper.readFastaProteinSequence(file);
        System.out.println(entries.size()); // 11374

}

It strikes me that no warning or anything is issued though!

If interested, here's my implementation to read fasta files BioJava/FASTA file help

ADD COMMENT
1
Entering edit mode
8.4 years ago

I finally solved it. The problem was at creating the file, because the aforementioned issue. So, I replace the first token with new code like this:

File tmpFolder = new File(System.getProperty("user.dir"), "tmp");
tmpFolder.mkdirs();

long h = 0;
int i = 1;
for (Map.Entry<String, ProteinSequence> entry : uniqueSequences.entrySet()) {                
    StringTokenizer st = new StringTokenizer(entry.getValue().getOriginalHeader(), "|");
    //st.nextToken(); //Ignore first because it is going to be replaced

    StringBuilder sb = new StringBuilder();
    sb.append(String.format("AMP_%d|", i++));

    while (st.hasMoreElements()) {
        sb.append(st.nextToken()).append(st.hasMoreElements()?"|":"");
    }

    entry.getValue().setOriginalHeader(sb.toString());
    h += entry.getKey().hashCode();
}

File sequencesFile = new File(tmpFolder, "sequences_" + h + FASTA_EXT);

boolean notExists = sequencesFile.createNewFile();
if (notExists) FastaWriterHelper.writeProteinSequence(sequencesFile, uniqueSequences.values());

Thanks you @dariober for your reply.

ADD COMMENT

Login before adding your answer.

Traffic: 2498 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6