Question: Java fastq String sequence reader
0
gravatar for priess1991
3.9 years ago by
priess19910
priess19910 wrote:

Hi!

I'm not schooled very much in bioinformatiks that's why i'm hoping there is a better way to do this.

I want to test an index-based compression method which highly scales with the similarity between a sequence and a reference-sequence. In order to do that i need some genomes from the 1000 human genome project. I downloaded some of them from ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/data_collections/1000_genomes_project/1000genomes.sequence.index as .fastq files. 

Around 2.6 gb each. They seem to be splitted in 2 .fastq files each.

All I need is the raw sequence in form of one huge String. In order to do that i thought the best way might be to write a little program for this. The problem is that one .fastq file seems to have around 112000000 lines and after one hour of compilation time i have checked around 2000000 lines of the .fastq file. There needs to be a better way to get the raw String data of an human genome. 

That's the code i'm using.

Thank you very much for help!

 

import java.io.BufferedReader;
import java.io.File;
import java.io.FileNotFoundException;
import java.io.FileReader;
import java.io.IOException;
import java.io.LineNumberReader;

public class ReadFastq {

	public String readFastq() throws IOException{
		int lineCount;
		String sequence = "";
		String line = "";
		
		LineNumberReader lnr = new LineNumberReader(
				new FileReader(
						new File(
								"C:\\Users\\Patrick\\Documents\\Bachelorarbeit\\genom\\ERR012616_2.fastq")));
		System.out.println("File opened successful!");
		lnr.skip(Long.MAX_VALUE);
		lineCount = lnr.getLineNumber();
		System.out.println("Input file contains " + lineCount
				+ " lines of information");
		
		lnr.close();
		
		
		
		BufferedReader input = new BufferedReader(
				new FileReader(
						new File(
								"C:\\Users\\Patrick\\Documents\\Bachelorarbeit\\genom\\ERR012616_2.fastq")));
		System.out.println("File opened successful!");
		String firstChar;
		for (int x = 0; x < lineCount; x++) {

			line = input.readLine();
			firstChar = ""+line.charAt(0);
			//System.out.println(firstChar);
			if(firstChar.contains("N") || firstChar.contains("A") || firstChar.contains("G") || firstChar.contains("T") || firstChar.contains("C")){
				sequence = sequence +line;
				System.out.println("Checked line "+x+" and added to Sequence");
			}
		}
		System.out.println("Sequence length: "+sequence.length());
		return sequence;
		
	}
}

 

sequencing • 1.8k views
ADD COMMENTlink modified 3.9 years ago • written 3.9 years ago by priess19910

"All I need is the raw sequence in form of one huge String. "

you cannot do this. For a common Fastq file it won't fit in memory and why would you do this ?

In order to do that i thought the best way might be to write a little program for this. The problem is that one .fastq file seems to have around 112,000,000 lines and

"after one hour of compilation time"

no compilation time, you meant execution.

sequence = sequence +line; ## this is the worst way to concatenate string. see java.util.StringReader or java.lang.StringBuilder.

why do you read the file twice when you could use only one loop using a java.io.BufferedReader ? BufferedReader in=(...); while((line=in.readLine())!=null) { genome.append(line);}

 

ADD REPLYlink modified 3.9 years ago • written 3.9 years ago by Pierre Lindenbaum122k

At first,

thanks for your answers!

I never studied this, but have to deal with it now in connection with my bachelor thesis, so all my knowledge is self-induced. Sorry for that.

I was aware of how the .fastq-files are structured, but I thought if if this two files contain all the raw reads of the sequenced genome, the entire sum would be the genomes sequence data.

Unfortunately i need to use the 1000 genome project data in order to reproduce some results of an important related work.

This question might seem a bit stupid but what's the difference between the raw reads and an "real" genome data?

I aware of that every letter of the raw data has an quality value which determines how exact this letter is.

 

 

ADD REPLYlink written 3.9 years ago by priess19910
1

This is not a forum. This is Q&A. If you have new questions they go in new threads. 

I want to say there are no stupid questions; you're just new to the field and the technologies. Need to do some research on how "next gen sequencing" works, Read up on sequence alignment and why it's a thing that matters. There might be some questions about that here already asked and answered, but since theyre foundational questions rather than technical ones, you might find better answers at wikipedia. 

 

ADD REPLYlink written 3.9 years ago by karl.stamm3.5k
1
gravatar for Joseph Pearson
3.9 years ago by
UNC Chapel Hill
Joseph Pearson450 wrote:

If you aren't wedded to Java, this command-line should work (I think):

zcat MyFastq.fastq.gz | grep -A 1 "^@" --no-group-separator| grep -v "^@" | tr -d '\n' > MyLongSequence.txt

ADD COMMENTlink modified 3.9 years ago • written 3.9 years ago by Joseph Pearson450

zcat decompresses your fastq file

the first grep will find lines starting with '@', and print that line and the following line (containing your sequence)

the second grep will find lines starting with '@', but will NOT print those lines, but prints all other lines (your sequence)

tr will delete newlines, writing into your file

However, this is not taking into account available memory. You could break your fastq into chunks, and append your MyLongSequence.txt file.

ADD REPLYlink written 3.9 years ago by Joseph Pearson450
1
gravatar for Brian Bushnell
3.9 years ago by
Walnut Creek, USA
Brian Bushnell16k wrote:

Paired fastq files do not contain the human genome.  They contain raw reads, which include headers, quality scores, and so forth.  Turning them into giant strings would not be useful for any purpose.  I suggest you look at the contents of the files before trying to use them.

Then, if you are interested in compression, download some bacterial genomes (not raw reads).  For example, E.coli has many sequenced strains available at places like NCBI.

ADD COMMENTlink written 3.9 years ago by Brian Bushnell16k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1673 users visited in the last hour