How to find newline character in fastq file? Is it essential to remove them?
1
0
Entering edit mode
2.5 years ago
Fizzah ▴ 30

Hello there I am a beginner in data analysis domain and want to ask:

  1. How to find newline character in fastq file
  2. Is it essential to remove them? If not then what impact they can have on data analysis
FASTQ • 1.7k views
ADD COMMENT
1
Entering edit mode
2.5 years ago
Mensur Dlakic ★ 27k

Newline characters are, for practical purposes, invisible. As the name implies, they are at the end of each line, and would be an equivalent of hitting the Enter/Return key on a keyboard, or pushing a lever on a typewriter to move the cylinder up by one line. Just like you don't see newline character in MS word when you type (unless you turn that particular function ON), they are invisible to the eye in most file types. In Linux, an ASCII character numbered 10 (0a hexadecimal) is the newline character, while under Windows it is actually two characters (10 + 13, or 0a + 0d).

I don't know why you would need to find them or remove them, since they are interpreted by most programs exactly as they should be.

ADD COMMENT
0
Entering edit mode

Well, you might want to remove new lines internal to the sequences and quality strings...but files with those are rather uncommon.

ADD REPLY
0
Entering edit mode

actually when I run command (with and without newline character) for read count it gives me different output.

Initially I use this command for read count (link Counting Number Of Bases In A Fastq File)

cat file.fq.gz | paste - - - - | cut -f2 | wc -c..

and after that I run following command, which is use to remove newline characters

cat test.fastq | paste - - - - | cut -f 2 | tr -d '\n' | wc -c (link Counting Number Of Bases In A Fastq File ) the out puts from both commands were different that made me assume that probably my fastq file does contain these characters

ADD REPLY
1
Entering edit mode

I think you are stuck on this unnecessarily because you are using a wrong tool to count bases. wc -c counts ALL characters, which includes newlines. The number of newlines needs to be subtracted from the total number of characters, and that is what I have suggested in the other post you started:

awk '{if (NR % 4 == 0) print $0}' myfile.fastq | wc | awk '{print ($3-$1)}'

But really you should be using proper tools to count bases which ignore newline characters, unlike wc.

ADD REPLY
0
Entering edit mode

I am beginner in this domain that's why I get confused why people suggested 3 different domains wc -c or wc -l or the type you suggested. I dont know what will give me best output as I was getting different outputs in single file. that's why I asked to make myself clear what is the actual reason behind it... Thank you for patiently clearing my concepts about it.

ADD REPLY
0
Entering edit mode

You assume you have internal newlines? Did you look? Most fastqs don't have them, except where they belong, at the end of each element of the fastq entry.

ADD REPLY
0
Entering edit mode

It is not a matter of internal newlines. wc counts newlines at the end as well, and I don't think the poster was using the wc command properly. It shouldn't be used for counting bases unless one knows the internal workings of that particular command. There are better commands for that particular task, which I have outlined in a different post.

ADD REPLY

Login before adding your answer.

Traffic: 2476 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6