Entering edit mode
6.8 years ago
louis.gil007
•
0
Hi, I'm trying to concatenate FastQ files. My dough is: Is there an end_line character at the end of every FastQ? Is it even necessary, do all programs just use regular expressions for knowing which read is which (map using @ char)?
Files made in Windows and ancient Mac versions pose unique challenges. Linux uses \n for end of line. Windows uses \r\n because hey, why not bloat everything? And the ancient Mac format was \r only, because hey, why bother with standards? Modern Mac computers use the Linux convention. Windows... well, it will never be compliant with modern uses, so software just has to support it as a special case, which makes parsing everything slower.
So, as long as your files were all generated on the same platform, just concatenate them. If some of them were generated on Windows or ancient Mac versions, I suggest you reformat them to Unix standard prior to concatenation.
What is the actual problem you're having? You can concatenate FastQ files like any other text file with
cat *.fastq > all_fastqs.fastq
I know how to concatenate, my doughs are about aligners and things that read these fastQ files. Sometimes fastQ files have an endline character when you use cat, if there is no \n in the first file the new seq will be added to the same line. How do programs read these files do they use that delimiter or do they just use regular expresions?
I've never encountered aligners/assemblers etc that had issues with concatenated fastqs. I'm pretty sure they just ignore the EOF character. Your mileage may vary though. If you're using programs which are less robust, you may find you need to get your hands dirty removing those kinds of invisibles.
Thanks. The people that need these files concatenated I belive are going to use Star Aligner (Not sure). I won't be doing that myself I just wanted to ensure the data was not being corrupted by concatenating the files.
This probably just means the standard aligner uses the "@" char as the beginning of a seq and its end, until the last seq in the file where it uses "@" and that the file has ended.
I am afraid you need to brush up the following concepts before moving any further: 1) FastQ is a text text-file: each line is delimited by new-line and usually an EOF <end-of-file> marker is present, which says the file is finished. 2) regular expressions http://www.regular-expressions.info/
You really need to read and understand the gory details of FastQ format https://en.wikipedia.org/wiki/FASTQ_format
FastQ is a plain-text format, and as such each line of record is delimited by newline
Using just @ (at the beginning of a line) is not enough. It is safer to use part of the machine serial number (e.g. @M0123) since a quality score line can also start with @, a valid Q score.