split fastq by @SEQID
1
1
Entering edit mode
6.6 years ago
2nelly ▴ 310

Hi all,

I have a couple of fastq files containing reads starting with different name like: @HWI-ST865:463:C7C8KACXX:2:2316:21016:100943 1:N:0:TAAGGCGA @HWI-ST1178:227:C7C95ACXX:7:1101:1581:2125 1:N:0:TAAGGCGA

My question is: how can I split them in two parts? I tried to use some tools like fastx_toolkit but I cannot create a proper barcode file Is there any easy way to do that such as a grep command, cause i also tried with grep but i got an output containing only the first line of the reads and missed the other three

sequencing next-gen • 2.2k views
0
Entering edit mode

So you want to split on the "HWI-STXXX" bit? Or every unique ID should be a different output file?

2
Entering edit mode

Probably a mixed data set. Of late some submitters have been merging data from multiple flowcells/machines into one file for SRA submission (beats me why they do it) and this could be a case of that sort.

1
Entering edit mode

yes, this is exactly the case, but it was accidentally done. 2 different persons sequenced the same sample in 2 different sequencers without being aware of and then they decided to merge the outputs

0
Entering edit mode

Hahahah :D Awesome. Was it the exact same biological sample? If so, is the data publicly avalible? If so, i'd be interested in looking at the QC data. See how much of an effect sequencing machine/etc really plays on the downstream statistics.

1
Entering edit mode

Yes exactly the same sample but different capture processes with the same kit (exome sequencing). Unfortunately data are not publicly available....my boss will kill me if i do that!sorry....hahaha... Anyway, I suppose this fact will affect the analysis anyhow, cause the capture process was different despite the fact they used the same kit and protocol. You know sometimes things are working almost 100% and sometimes not.

0
Entering edit mode

No worries man - there's more than enough data to go around :) And yeah, maybe a different capture process will highlight different exons better, who knows, it might not be a waste at all!

0
Entering edit mode

Split them by "HWI-STXXX"

4
Entering edit mode
6.6 years ago
Ram 37k

You can use either Heng Li's bioawk or grep -A 3. The former is a wrapper on awk to make it work with separators used in biological data formats, and the latter is a grep that picks up the matching line+3 lines that follow.

1
Entering edit mode

I did not know about the -A flag, awesome! Thank you Ram :)

2
Entering edit mode

You're welcome. There are also the -B (before) and -C (around) flags.

0
Entering edit mode

Ram, you 're the best!!!! grep -A 3 worked! fastly and accurately.

It was such a simple addition of the A parameter in my grep command script.

Sequencing God bless u!

0
Entering edit mode

You're welcome. It is good that you were on the right track with grep. You may benefit from reading man grep and other such manuals when you have time - UNIX commands have a ton of features that are not evident at the outset.