Question

How to find newline character in fastq file? Is it essential to remove them?

0

Entering edit mode

2.5 years ago

Fizzah ▴ 30

Hello there I am a beginner in data analysis domain and want to ask:

How to find newline character in fastq file
Is it essential to remove them? If not then what impact they can have on data analysis

FASTQ • 1.7k views

ADD COMMENT • link updated 7 months ago by Ram 43k • written 2.5 years ago by Fizzah ▴ 30

score 1 · Answer 1 · 2021-10-13

1

Entering edit mode

2.5 years ago

Mensur Dlakic ★ 27k

Newline characters are, for practical purposes, invisible. As the name implies, they are at the end of each line, and would be an equivalent of hitting the Enter/Return key on a keyboard, or pushing a lever on a typewriter to move the cylinder up by one line. Just like you don't see newline character in MS word when you type (unless you turn that particular function ON), they are invisible to the eye in most file types. In Linux, an ASCII character numbered 10 (0a hexadecimal) is the newline character, while under Windows it is actually two characters (10 + 13, or 0a + 0d).

I don't know why you would need to find them or remove them, since they are interpreted by most programs exactly as they should be.

ADD COMMENT • link 2.5 years ago by Mensur Dlakic ★ 27k

0

Entering edit mode

Well, you might want to remove new lines internal to the sequences and quality strings...but files with those are rather uncommon.

ADD REPLY • link 2.5 years ago by swbarnes2 14k

0

Entering edit mode

actually when I run command (with and without newline character) for read count it gives me different output.

Initially I use this command for read count (link Counting Number Of Bases In A Fastq File)

cat file.fq.gz | paste - - - - | cut -f2 | wc -c..

and after that I run following command, which is use to remove newline characters

cat test.fastq | paste - - - - | cut -f 2 | tr -d '\n' | wc -c (link Counting Number Of Bases In A Fastq File ) the out puts from both commands were different that made me assume that probably my fastq file does contain these characters

ADD REPLY • link 2.5 years ago by Fizzah ▴ 30

1

Entering edit mode

I think you are stuck on this unnecessarily because you are using a wrong tool to count bases. wc -c counts ALL characters, which includes newlines. The number of newlines needs to be subtracted from the total number of characters, and that is what I have suggested in the other post you started:

awk '{if (NR % 4 == 0) print $0}' myfile.fastq | wc | awk '{print ($3-$1)}'

But really you should be using proper tools to count bases which ignore newline characters, unlike wc.

ADD REPLY • link 2.5 years ago by Mensur Dlakic ★ 27k

0

Entering edit mode

I am beginner in this domain that's why I get confused why people suggested 3 different domains wc -c or wc -l or the type you suggested. I dont know what will give me best output as I was getting different outputs in single file. that's why I asked to make myself clear what is the actual reason behind it... Thank you for patiently clearing my concepts about it.

ADD REPLY • link 2.5 years ago by Fizzah ▴ 30

0

Entering edit mode

You assume you have internal newlines? Did you look? Most fastqs don't have them, except where they belong, at the end of each element of the fastq entry.

ADD REPLY • link 2.5 years ago by swbarnes2 14k

0

Entering edit mode

It is not a matter of internal newlines. wc counts newlines at the end as well, and I don't think the poster was using the wc command properly. It shouldn't be used for counting bases unless one knows the internal workings of that particular command. There are better commands for that particular task, which I have outlined in a different post.

ADD REPLY • link 2.5 years ago by Mensur Dlakic ★ 27k