Question

split fastq by @SEQID

1

Entering edit mode

8.1 years ago

2nelly ▴ 310

Hi all,

I have a couple of fastq files containing reads starting with different name like: @HWI-ST865:463:C7C8KACXX:2:2316:21016:100943 1:N:0:TAAGGCGA @HWI-ST1178:227:C7C95ACXX:7:1101:1581:2125 1:N:0:TAAGGCGA

My question is: how can I split them in two parts? I tried to use some tools like fastx_toolkit but I cannot create a proper barcode file Is there any easy way to do that such as a grep command, cause i also tried with grep but i got an output containing only the first line of the reads and missed the other three

Thank you in advance!

sequencing next-gen • 2.8k views

ADD COMMENT • link updated 8.1 years ago by Ram 43k • written 8.1 years ago by 2nelly ▴ 310

0

Entering edit mode

So you want to split on the "HWI-STXXX" bit? Or every unique ID should be a different output file?

ADD REPLY • link 8.1 years ago by John 13k

2

Entering edit mode

Probably a mixed data set. Of late some submitters have been merging data from multiple flowcells/machines into one file for SRA submission (beats me why they do it) and this could be a case of that sort.

ADD REPLY • link 8.1 years ago by GenoMax 141k

1

Entering edit mode

yes, this is exactly the case, but it was accidentally done. 2 different persons sequenced the same sample in 2 different sequencers without being aware of and then they decided to merge the outputs

ADD REPLY • link 8.1 years ago by 2nelly ▴ 310

0

Entering edit mode

Hahahah :D Awesome. Was it the exact same biological sample? If so, is the data publicly avalible? If so, i'd be interested in looking at the QC data. See how much of an effect sequencing machine/etc really plays on the downstream statistics.

ADD REPLY • link 8.1 years ago by John 13k

1

Entering edit mode

Yes exactly the same sample but different capture processes with the same kit (exome sequencing). Unfortunately data are not publicly available....my boss will kill me if i do that!sorry....hahaha... Anyway, I suppose this fact will affect the analysis anyhow, cause the capture process was different despite the fact they used the same kit and protocol. You know sometimes things are working almost 100% and sometimes not.

ADD REPLY • link 8.1 years ago by 2nelly ▴ 310

0

Entering edit mode

No worries man - there's more than enough data to go around :) And yeah, maybe a different capture process will highlight different exons better, who knows, it might not be a waste at all!

ADD REPLY • link 8.1 years ago by John 13k

0

Entering edit mode

Split them by "HWI-STXXX"

ADD REPLY • link 8.1 years ago by 2nelly ▴ 310

Ram · Accepted Answer · 2016-03-04

4

Entering edit mode

8.1 years ago

Ram 43k

You can use either Heng Li's bioawk or grep -A 3. The former is a wrapper on awk to make it work with separators used in biological data formats, and the latter is a grep that picks up the matching line+3 lines that follow.

ADD COMMENT • link 8.1 years ago by Ram 43k

1

Entering edit mode

I did not know about the -A flag, awesome! Thank you Ram :)

ADD REPLY • link updated 4.6 years ago by Ram 43k • written 8.1 years ago by John 13k

2

Entering edit mode

You're welcome. There are also the -B (before) and -C (around) flags.

ADD REPLY • link 8.1 years ago by Ram 43k

0

Entering edit mode

Ram, you 're the best!!!! grep -A 3 worked! fastly and accurately.

It was such a simple addition of the A parameter in my grep command script.

Sequencing God bless u!

ADD REPLY • link updated 4.6 years ago by Ram 43k • written 8.1 years ago by 2nelly ▴ 310

0

Entering edit mode

You're welcome. It is good that you were on the right track with grep. You may benefit from reading man grep and other such manuals when you have time - UNIX commands have a ton of features that are not evident at the outset.

ADD REPLY • link 8.1 years ago by Ram 43k