My annual question here: I received 17 large (each between 5.27 and 7.89 Gb) FASTQ files of transcriptome data from a collaborator. What they sent me has no header information for any of the sequences in the files. Here's the (truncated) first two sequences of the first file:
@No name CAAGCGTGTCACCTATACCCCTCCGCCGGGGCAAAA + ????????DDBDDDDDFFFFF9CEEHCECHHHBFHFH @No name CCAACTGCTGTTCACACGGAACCTTTCCCCACTTCAG + BBBDDBDDDDDFFFFEFIIIIHIIIHHIIHHHHIIIF
Before I QC and do any downstream analyses I would like to give each sequence a number so I can reference each later (i.e. controls and treatments). I want something like this:
@1_1 CAAGCGTGTCACCTATACCCCTCCGCCGGGGCAAAA + ????????DDBDDDDDFFFFF9CEEHCECHHHBFHFH @1_2 CCAACTGCTGTTCACACGGAACCTTTCCCCACTTCAG + BBBDDBDDDDDFFFFEFIIIIHIIIHHIIHHHHIIIF
etc., etc., for the millions of sequences... but I need to make sure all the sequence data has individual header names that correspond to that particular sequence (so when I rename the next file it needs to be unique for that file).
This shouldn't be that hard, but I googled around on the forum and elsewhere and can't find a solution. I've tried to grep the files and I get a memory error. I tried to use bioawk (with the -c flag) on the giant gzip'ed file but the renaming is erratic.
Admittedly, I really need to work on my awk skills. Can anyone provide a quick awk one liner to rename these headers in the unzipped files that would use minimal memory? Thanks so much for your help.