Question

How to tell if a FastQ file is a concatinate of 2 seperate illumina runs?

0

Entering edit mode

6.5 years ago

landrjos ▴ 20

Hi All,

I have a fastQ file which was left by a student which I suspect is a file which is a concatenate of a HiSeq run and a smaller MiSeq run. How do I determine if this is the case? The distribution of read length is uniform at 101 bp.

sequence • 2.3k views

ADD COMMENT • link updated 6.5 years ago by GenoMax 141k • written 6.5 years ago by landrjos ▴ 20

1

Entering edit mode

Were the headers edited or that student has retained the original ones?

ADD REPLY • link 6.5 years ago by lakhujanivijay 5.8k

1

Entering edit mode

Can you run "head my.fastq" and "tail my.fastq" on your file and paste the results here? One should be able to make a fair guess based on that.

Strictly speaking however, it's not possible to be able to determine this in all situations. FASTQ is a terrible file format for metadata.

ADD REPLY • link 6.5 years ago by John 13k

1

Entering edit mode

FASTQ is a terrible file format

Fixed that for you.

ADD REPLY • link 6.5 years ago by WouterDeCoster 47k

0

Entering edit mode

Hahah, hey man :)

ADD REPLY • link 6.5 years ago by John 13k

score 1 · Answer 1 · 2017-10-28

1

Entering edit mode

6.5 years ago

GenoMax 141k

Since you are referring to there being data from two different sequencer types the following should work. There are unique barcodes on flowcells from different types of sequencers. Following also assumes that fastq headers have not been modified in any way.

Out the following code in a file (barcode.awk) :

BEGIN { FS = ":"; }

((NR % 4) == 1) { barcodes[$3]++; }

END {
  for (bc in barcodes) {
            print bc": "barcodes[bc]"";
    }
}

then run like this: zcat your.fastq.gz | awk -f barcode.awk. It should tell you if you have one or more barcodes represented along with the number of reads for each type. If your data is not compressed then cat your.fastq | awk -f barcode.awk should be used.

ADD COMMENT • link 6.5 years ago by GenoMax 141k

0

Entering edit mode

Thanks a lot,

This script is reporting the flow cell for the runs. Is there a modification that could be made to report what the barcode or index sequence is?

ADD REPLY • link 6.5 years ago by landrjos ▴ 20

0

Entering edit mode

If you replace $3 above with $10 it will report index sequences. You can't make out if you have data from multiple sequencers from this information unless you have some prior information about the contents of the pools.

ADD REPLY • link 6.5 years ago by GenoMax 141k

0

Entering edit mode

I think with both the flow cell and barcode information from the fast q files I can figure it out. Thanks for your help.

ADD REPLY • link 6.5 years ago by landrjos ▴ 20