Question

Java fastq String sequence reader

0

Entering edit mode

8.5 years ago

priess1991 • 0

Hi!

I'm not schooled very much in bioinformatics that's why I'm hoping there is a better way to do this.

I want to test an index-based compression method which highly scales with the similarity between a sequence and a reference-sequence. In order to do that I need some genomes from the 1000 human genome project. I downloaded some of them from ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/data_collections/1000_genomes_project/1000genomes.sequence.index as .fastq files.

Around 2.6 gb each. They seem to be split in 2 .fastq files each.

All I need is the raw sequence in form of one huge String. In order to do that I thought the best way might be to write a little program for this. The problem is that one .fastq file seems to have around 112000000 lines and after one hour of compilation time I have checked around 2000000 lines of the .fastq file. There needs to be a better way to get the raw String data of an human genome.

That's the code I'm using.

Thank you very much for help!

import java.io.BufferedReader;
import java.io.File;
import java.io.FileNotFoundException;
import java.io.FileReader;
import java.io.IOException;
import java.io.LineNumberReader;

public class ReadFastq {

    public String readFastq() throws IOException{
        int lineCount;
        String sequence = "";
        String line = "";

        LineNumberReader lnr = new LineNumberReader(
                new FileReader(
                        new File(
                                "C:\\Users\\Patrick\\Documents\\Bachelorarbeit\\genom\\ERR012616_2.fastq")));
        System.out.println("File opened successful!");
        lnr.skip(Long.MAX_VALUE);
        lineCount = lnr.getLineNumber();
        System.out.println("Input file contains " + lineCount
                + " lines of information");

        lnr.close();

        BufferedReader input = new BufferedReader(
                new FileReader(
                        new File(
                                "C:\\Users\\Patrick\\Documents\\Bachelorarbeit\\genom\\ERR012616_2.fastq")));
        System.out.println("File opened successful!");
        String firstChar;
        for (int x = 0; x < lineCount; x++) {

            line = input.readLine();
            firstChar = ""+line.charAt(0);
            //System.out.println(firstChar);
            if(firstChar.contains("N") || firstChar.contains("A") || firstChar.contains("G") || firstChar.contains("T") || firstChar.contains("C")){
                sequence = sequence +line;
                System.out.println("Checked line "+x+" and added to Sequence");
            }
        }
        System.out.println("Sequence length: "+sequence.length());
        return sequence;

    }
}

sequencing • 3.1k views

ADD COMMENT • link updated 19 months ago by Ram 43k • written 8.5 years ago by priess1991 • 0

0

Entering edit mode

"All I need is the raw sequence in form of one huge String."

you cannot do this. For a common Fastq file it won't fit in memory and why would you do this?

In order to do that i thought the best way might be to write a little program for this. The problem is that one .fastq file seems to have around 112,000,000 lines and

"after one hour of compilation time"

no compilation time, you meant execution.

sequence = sequence +line; ## this is the worst way to concatenate string. see java.util.StringReader or java.lang.StringBuilder.

why do you read the file twice when you could use only one loop using a java.io.BufferedReader?

BufferedReader in=(...); while((line=in.readLine())!=null) { genome.append(line);}

ADD REPLY • link updated 4.4 years ago by Ram 43k • written 8.5 years ago by Pierre Lindenbaum 161k

0

Entering edit mode

At first,

thanks for your answers!

I never studied this, but have to deal with it now in connection with my bachelor thesis, so all my knowledge is self-induced. Sorry for that.

I was aware of how the .fastq-files are structured, but I thought if if this two files contain all the raw reads of the sequenced genome, the entire sum would be the genomes sequence data.

Unfortunately I need to use the 1000 genome project data in order to reproduce some results of an important related work.

This question might seem a bit stupid but what's the difference between the raw reads and an "real" genome data?

I aware of that every letter of the raw data has a quality value which determines how exact this letter is.

ADD REPLY • link updated 4.4 years ago by Ram 43k • written 8.5 years ago by priess1991 • 0

1

Entering edit mode

This is not a forum. This is Q&A. If you have new questions they go in new threads.

I want to say there are no stupid questions; you're just new to the field and the technologies. Need to do some research on how "next gen sequencing" works, Read up on sequence alignment and why it's a thing that matters. There might be some questions about that here already asked and answered, but since they're foundational questions rather than technical ones, you might find better answers at wikipedia.

ADD REPLY • link updated 4.4 years ago by Ram 43k • written 8.5 years ago by karl.stamm 4.1k

Ram · Answer 1 · 2015-10-08

1

Entering edit mode

8.5 years ago

Joseph Pearson ▴ 480

If you aren' wedded to Java, this command-line should work (I think):

zcat MyFastq.fastq.gz | grep -A 1 "^@" --no-group-separator| grep -v "^@" | tr -d '\n' > MyLongSequence.txt

ADD COMMENT • link updated 4.4 years ago by Ram 43k • written 8.5 years ago by Joseph Pearson ▴ 480

0

Entering edit mode

zcat decompresses your fastq file

the first grep will find lines starting with '@', and print that line and the following line (containing your sequence)

the second grep will find lines starting with '@', but will NOT print those lines, but prints all other lines (your sequence)

tr will delete newlines, writing into your file

However, this is not taking into account available memory. You could break your fastq into chunks, and append your MyLongSequence.txt file.

ADD REPLY • link 8.5 years ago by Joseph Pearson ▴ 480

Ram · Answer 2 · 2015-10-08

Paired fastq files do not contain the human genome. They contain raw reads, which include headers, quality scores, and so forth. Turning them into giant strings would not be useful for any purpose. I suggest you look at the contents of the files before trying to use them.

Then, if you are interested in compression, download some bacterial genomes (not raw reads). For example, E.coli has many sequenced strains available at places like NCBI.