Question

Sorting fq data into groups with missing barcode info?

0

Entering edit mode

5.5 years ago

jorja • 0

10/17/18 clean up file names and added reads per file

Use "header" instead of reads. each Header means each @M70287:XXXX...

Dear all,

I am trying to split a pooled bacteria Illumina sequencing data by samples.

i.e., we have 20 samples. The results from the 20 samples are grouped in one fq file. We are tring to spit it into 20 fq files, one for each sample.

We also have a separate barcode file.

And here are examples of every file I got from them:

Demultiplex_sheet: (total 20 sampleID)

SampleID,   BarcodeSequence,    LinkerPrimerSequence,   ReversePrimer,  Description

W.1860  ATTGCCCAGATG    GGACTACHVGGGTWTCTAAT    GTGCCAGCMGCCGCGGTAA W.1860

Merged_Reads fq: (total header: 776721)

@M70287:117:000000000-B5B22:1:2110:5816:15958 1:N:0:
TACGGAGGATGCGAGCGTTATCCGGATTTATTGGGTTTAAAGGGTGCGTAGGCGGACTGTCAAGTCAGCGGTAAAATACGGGGGCTCAACCTCCGCCCGCCGTTGAAACTGACGGTCTTGAGTGGGCGAGAAGTATGCGGAATGCGTGGTGTAGCGGTGAAATGCATAGATATCACGCAGAACTCCGATTGCGAAGGCAGCATACCGGCGCCCGACTGACGCTGAAGCACGAAAGCGTGGGTATCGAACAGG
+
CCCCBABCCFFFGGGGGEGGGGHFGFGGHGHHHGFEGGHHGHHGHGGGGEGGHGGGGGHHHFGHHHFHHGGEFGDGHHHHGGGGGFHHGGHHHHGGJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJBBBB

Merged_Barcodes.fq: (total header: 63169)

@M70287:117:000000000-B5B22:1:1101:25927:6491 2:N:0:
GGTTAACAGGAA
+
CCBBCFFFFCFF

Raw_read1.fq (total header: 1157939)

@M70287:117:000000000-B5B22:1:1101:14478:1753 1:N:0:
TACGGAGGATGCGAGCGTTATCCGGATTTATTGGGTTTAAAGGGTGCGCAGGCGGTCTGTTAAGTCAGCGGTCAAATCCCGGGGCTCAACCCCGGCCCGCCGTTGAAACTGGCAGTCTCGAGTTGGAGAGAAGTATGCGGAATGCGCGGTGTAGCGGTGAAATGCATAGATATCACGCAGAACCCCGATTGCGAAGGCAGCCTGCCAAGCCATGACTGACGCTGATGCACGAAAGCGTGGGGATCAAACA
+
AAA>11>11CCFA10E0EEAGFHC00BEHEGH11/BFFEDD11/B/E//AA/?>EEEE0FGDE1GF2@@//E>EEF2B>F////>CCGCFCAA///>@//@-AC..>1F=1/</.CCD0/0.DE0/0//..//C0C::A--9CFFFB??EGGGFB@-B??BBFFBFFF/BFBFFBF9-9-9/;9AB-=@FE/B----99:AEFFFABB-9A-B/;9FB/9-9A9-999BBE-:?B9ABF###########

Raw_read2_Barcode.fq (total header: 1157939)

@M70287:117:000000000-B5B22:1:1101:14478:1753 1:N:0:
TACGGAGGATGCGAGCGTTATCCGGATTTATTGGGTTTAAAGGGTGCGCAGGCGGTCTGTTAAGTCAGCGGTCAAATCCCGGGGCTCAACCCCGGCCCGCCGTTGAAACTGGCAGTCTCGAGTTGGAGAGAAGTATGCGGAATGCGCGGTGTAGCGGTGAAATGCATAGATATCACGCAGAACCCCGATTGCGAAGGCAGCCTGCCAAGCCATGACTGACGCTGATGCACGAAAGCGTGGGGATCAAACA
+
AAA>11>11CCFA10E0EEAGFHC00BEHEGH11/BFFEDD11/B/E//AA/?>EEEE0FGDE1GF2@@//E>EEF2B>F////>CCGCFCAA///>@//@-AC..>1F=1/</.CCD0/0.DE0/0//..//C0C::A--9CFFFB??EGGGFB@-B??BBFFBFFF/BFBFFBF9-9-9/;9AB-=@FE/B----99:AEFFFABB-9A-B/;9FB/9-9A9-999BBE-:?B9ABF###########

Raw_read3.fq (total header: 1157939)

@M70287:117:000000000-B5B22:1:1101:14478:1753 3:N:0:
CCTGTTTGATCCCCACGCTTTCGTGCATCAGCGTCAGTCATGGCTTGTCTGGCTTCCTTCTCCATCTTGGTTCTTCCTTCTTTCTTTTCCTTTCCCCTCTCCACCCCGCATTCCTCCTACTTCTCTCCCACTCCATACTTCCCGTTTCCACTGCCGGCCCGCTTTGTTTCCCCCCCCCTTCCCCCCCCCCTTTCCCCCCCCCTCCCCCCCCCTTCCTCCCCCCTCTTCCCCTCCCCTTCCCTCTCCCTCC
+
AAAAAFFBFFFDGEGE?EGFGGEAACGGFFDFBEEF2GB3EA3A222135D3B21AFAFG3355555D53AB5A5@555@B5D@D5@@53BFH@3BE2BB3B3??1/////?4F433B333B?F343B3?00B01120121B0//@FF######################################################################################################

Centroid info:

'W.1509_90;Unc01fwp;size=668;\n'
'TACGGAGGATGCGAGCGTTATCCGGATTTATTGGGTTTAAAGGGTGCGTAGGCGGACTGTCAAGTCAGCGGTAAAATACGGGGGCTCAACCTCCGCCCGCCGTTGAAACTGACGGTCTTGAGTGGGCGAGAAGTATGCGGAATGCGTGGTGTAGCGGTGAAATGCATAGATATCACGCAGAACTCCGATTGCGAAGGCAGCATACCGGCGCCCAACTGACGCTGAAGCACGAAAGCGTGGGTATCGAACAGG\n'

from Read_QC

Raw Stats
Count: 21
Total: 1157985
Min: 8
Max: 107330
Mean: 55142.1428571429
Med: 65102
StdDev: 24050.1290805084
CoV: 0.436147886795318

Merged Stats:
Count: 21
Total: 521908
Min: 1
Max: 47063
Mean: 24852.7619047619
Med: 26797
StdDev: 10628.1543033441
CoV: 0.42764479634385

Mapped Stats:
Count: 20
Total: 368435
Min: 9379
Max: 35605
Mean: 18421.75
Med: 19744.5
StdDev: 7332.66925392793
CoV: 0.398044119257287

My feeling is that I cannot do the job because the company that sequencing our data has deleted the index portion in the sequence headers.

But is this so? If I am wrong (hopefully), can you point to me the right way to do it?

I am thinking of using the XY coordinate in the header file and the barcode file to align the two. But the reads do not seem to match? Any suggestions on how should I proceed?

Thank you!

Jorja

next-gen sequencing • 1.9k views

ADD COMMENT • link updated 5.5 years ago by Istvan Albert 100k • written 5.5 years ago by jorja • 0

2

Entering edit mode

You don't say if these are custom barcodes or regular Illumina barcodes. Also, as the headers from the barcode and reads do not match, either something went really wrong, or the files were processed somehow and you didn't tell us.

If these are regular Illumina barcodes, ask the sequencing provider to demultiplex the samples for you. If these are custom barcodes, explain in detail how should they work.

ADD REPLY • link 5.5 years ago by h.mon 35k

0

Entering edit mode

These are regular Illumina barcodes. We are running this by ourselves because we cannot reach the provider any more... So I am basically on my own.

When you said that the barcode and reads do not match, do you mean their header line does not match?

This is my bad. I did not paste the matching one (the barcode and the sequence are in different sequence. Here I pasted the header of the first line of each file)

But in fact they are matching, I can find identical ones.

My original thought is I use this part: @M70287:117:000000000-B5B22:1:1101:25927:6491

to match the barcode to read, then match back to the info file that link barcode with each sample.

Base on your thoughts and what genomax suggested me to read, I believe this is how people matching them?

However, the number of reads find by this method does not match the QC file that they gave us.

For example, one sample is said to have 27000 reads, but mine can be 30000, while the other said to be 10000, mine can be 40000. They do not match in number and not even in the same ratio either... That makes me wonder what I did wrong...

ADD REPLY • link 5.5 years ago by jorja • 0

1

Entering edit mode

No, most people don't break out their fastqs into 20 individual ones by the headers. Most people use a sample sheet to tell bcl2fastq what sample goes with what sequencing index sequence, and the demultiplexing is done as the fastqs are generated using the data in the index read. There are non-standard library preps that might put the sample barcode in read 2 instead of the usual index position, we have no idea if you did that. The "2" in the header of your "barcode.fq" suggests that you did, but we can't be sure if you don't tell us.

ADD REPLY • link 5.5 years ago by swbarnes2 14k

0

Entering edit mode

Thank you, swbarnes2!

So I was only given this pile of data without knowing what the sequencing service did...

Assume what you said is true (non-standard library preps that might put the sample barcode in read 2 instead of the usual index position), how should I proceed?

ADD REPLY • link 5.5 years ago by jorja • 0

0

Entering edit mode

You need to have some basic information about what was done before anyone can help you. I can't even tell from your question how many sequence reads you have, and how many index reads you have. Find that out, and put that in your question.

ADD REPLY • link 5.5 years ago by swbarnes2 14k

0

Entering edit mode

I just did! And I put the first read of every file that I can find in the folder. Hope that helps.

And also I put the QC file they generated for us too.

ADD REPLY • link 5.5 years ago by jorja • 0

2

Entering edit mode

Take a look at the answer in: A: Demultiplexing Illumina data

I don't know what centroid info refers to but I have a suspicion that it is not going to be useful here.

ADD REPLY • link 5.5 years ago by GenoMax 141k

0

Entering edit mode

Thank you, genomax!

I agree centroid info seems not important. Put it here just in case... Nice to confirm.

Your link is very useful. Apparently, it means I cannot simply separate files based on their header...

I will try the package in the link!

ADD REPLY • link 5.5 years ago by jorja • 0

0

Entering edit mode

I am very sad to see that there are 89 views and no answer yet...

Can someone maybe point to me how I can improve my post to make it clearer to the readers?

Or what additional info I need to put in?

Thanks!

ADD REPLY • link 5.5 years ago by jorja • 0

0

Entering edit mode

Your barcoding must be non-standard, so how can anyone understand it if you don't explain it?

ADD REPLY • link 5.5 years ago by swbarnes2 14k

0

Entering edit mode

What do you me by non-standard? This is my first time saw this kind of files, so not much to compare with....

ADD REPLY • link 5.5 years ago by jorja • 0

score 1 · Answer 1 · 2018-10-17

1

Entering edit mode

5.5 years ago

Istvan Albert 100k

If you have all three files then you can match back the barcodes by sequence ID. I believe that the id of the sequences will match from barcode to reads since physically they are all on the same location (coordinate).

Depending how many reads you have you may need to either write a simple program to load up one file in memory or you need an approach re-pair your reads, for example, linearize both so that the data fits on one column, then sort and join them from command line.

Once you paired the reads with barcodes you will have your data restored.

ADD COMMENT • link 5.5 years ago by Istvan Albert 100k

0

Entering edit mode

I have already linked a tool that will do that here: C: Sorting fq data into groups with missing barcode info?

ADD REPLY • link 5.5 years ago by GenoMax 141k

0

Entering edit mode

I was thinking that since the barcodes are not missing, perhaps a simple approach could work, something like this

cat reads.fq | bioawk -c fastx  '{ print $name, $seq }' | sort -k1,1 > a
cat barcodes.fq | bioawk -c fastx  '{ print $name, $seq }' | sort -k1,1 > b
join a b > success.txt

then parse the success.txt and make a fastq out of it :-)

ADD REPLY • link 5.5 years ago by Istvan Albert 100k

0

Entering edit mode

Thank you, Albert!

Glad to find someone suggesting the same thing I tried too!

But things might not be that easy... Because what I got by this method does not match the QC report they gave us (the read numbers). Makes me wonder what I did was wrong...

ADD REPLY • link 5.5 years ago by jorja • 0

1

Entering edit mode

Forget about the QC report, it is not clear what it contains.

What matters is whether your read sequence ids are among the barcode sequence ids. If that is the case then you assign the barcodes to sequences.

ADD REPLY • link 5.5 years ago by Istvan Albert 100k

0

Entering edit mode

Super! That is very nice to know.

Thank you, Albert!

I will do it this way then!

ADD REPLY • link 5.5 years ago by jorja • 0

0

Entering edit mode

Hi genomax,

I believe you mean using deML right?

Based on reading the document of the deML, I believe it using a method that is not a simple location matching, right?

It suggested maximum likelihood method that is similar to sequence mapping? I believe you have the right point, (and this answer to Istvan Albert why I am suspicious about my sorting results based on the method you suggested)

The first thing I tried was the xy coordination method like Istvan Albert suggested (the 14478:1753 in the header line), ie. using XY to find barcode, use barcode seq to sort samples. But this approach has two problems: 1) the sample groups that I got had different size comparing to QC file (not in ratio either); 2) in barcode files, there are way more than 20 types of sequences (I do not remember, but somewhere around 60).

If we are using max likelyhood then it make sense because there are barcode reads that were wrong. So we need to somehow find the "best match" to the sequence, not the "exact match", in order to sort every reads in the file.

Do I got it right?

(deML is using Linux but I am using mac... so I am more leaning towards understanding the method and use python to do it, but if it too complicated, I will add an ubuntu to my mac and try it....)

ADD REPLY • link 5.5 years ago by jorja • 0

1

Entering edit mode

deML should work on macOS since it has its origins in BSD unix.

That said. I don't understand the data you posted above. Raw_read2_Barcode.fq does not look like any index read data I have seen so far. On other hand Merged_Barcodes.fq does look like a real Illumina index read sequences file. What does Merged_Reads fq refer to? Is that R1/R2 reads concatenated end to end?

the sample groups that I got had different size comparing to QC file

If you are doing something of your own I would not worry about the QC file. As long as you understand what you are doing you can trust your results.

ADD REPLY • link 5.5 years ago by GenoMax 141k

0

Entering edit mode

These are the data I got and that basically this is it... My guess is like your guess that they are two ends, but does not seems so...

Let me search the sequence coordinate and compare the of the merged one vs raw read and report to you!

ADD REPLY • link 5.5 years ago by jorja • 0