How to find newline character in fastq file? Is it essential to remove them?
1
0
Entering edit mode
1 day ago
Fizzah • 0

Hello there I am a beginner in data analysis domain and want to ask:

1. How to find newline character in fastq file
2. Is it essential to remove them? If not then what impact they can have on data analysis
character newline • 295 views
1
Entering edit mode
1 day ago
Mensur Dlakic ★ 14k

Newline characters are, for practical purposes, invisible. As the name implies, they are at the end of each line, and would be an equivalent of hitting the Enter/Return key on a keyboard, or pushing a lever on a typewriter to move the cylinder up by one line. Just like you don't see newline character in MS word when you type (unless you turn that particular function ON), they are invisible to the eye in most file types. In Linux, an ASCII character numbered 10 (0a hexadecimal) is the newline character, while under Windows it is actually two characters (10 + 13, or 0a + 0d).

I don't know why you would need to find them or remove them, since they are interpreted by most programs exactly as they should be.

0
Entering edit mode

Well, you might want to remove new lines internal to the sequences and quality strings...but files with those are rather uncommon.

0
Entering edit mode

actually when I run command (with and without newline character) for read count it gives me different output.

Initially I use this command for read count (link Counting Number Of Bases In A Fastq File)

cat file.fq.gz | paste - - - - | cut -f2 | wc -c..

and after that I run following command, which is use to remove newline characters

cat test.fastq | paste - - - - | cut -f 2 | tr -d '\n' | wc -c (link Counting Number Of Bases In A Fastq File ) the out puts from both commands were different that made me assume that probably my fastq file does contain these characters

1
Entering edit mode

I think you are stuck on this unnecessarily because you are using a wrong tool to count bases. wc -c counts ALL characters, which includes newlines. The number of newlines needs to be subtracted from the total number of characters, and that is what I have suggested in the other post you started:

awk '{if (NR % 4 == 0) print $0}' myfile.fastq | wc | awk '{print ($3-\$1)}'


But really you should be using proper tools to count bases which ignore newline characters, unlike wc.

0
Entering edit mode

I am beginner in this domain that's why I get confused why people suggested 3 different domains wc -c or wc -l or the type you suggested. I dont know what will give me best output as I was getting different outputs in single file. that's why I asked to make myself clear what is the actual reason behind it... Thank you for patiently clearing my concepts about it.

0
Entering edit mode

You assume you have internal newlines? Did you look? Most fastqs don't have them, except where they belong, at the end of each element of the fastq entry.

0
Entering edit mode

It is not a matter of internal newlines. wc counts newlines at the end as well, and I don't think the poster was using the wc command properly. It shouldn't be used for counting bases unless one knows the internal workings of that particular command. There are better commands for that particular task, which I have outlined in a different post.