10/17/18 clean up file names and added reads per file
Use "header" instead of reads. each Header means each @M70287:XXXX...
Dear all,
I am trying to split a pooled bacteria Illumina sequencing data by samples.
i.e., we have 20 samples. The results from the 20 samples are grouped in one fq file. We are tring to spit it into 20 fq files, one for each sample.
We also have a separate barcode file.
And here are examples of every file I got from them:
Demultiplex_sheet: (total 20 sampleID)
SampleID, BarcodeSequence, LinkerPrimerSequence, ReversePrimer, Description
W.1860 ATTGCCCAGATG GGACTACHVGGGTWTCTAAT GTGCCAGCMGCCGCGGTAA W.1860
Merged_Reads fq: (total header: 776721)
@M70287:117:000000000-B5B22:1:2110:5816:15958 1:N:0:
TACGGAGGATGCGAGCGTTATCCGGATTTATTGGGTTTAAAGGGTGCGTAGGCGGACTGTCAAGTCAGCGGTAAAATACGGGGGCTCAACCTCCGCCCGCCGTTGAAACTGACGGTCTTGAGTGGGCGAGAAGTATGCGGAATGCGTGGTGTAGCGGTGAAATGCATAGATATCACGCAGAACTCCGATTGCGAAGGCAGCATACCGGCGCCCGACTGACGCTGAAGCACGAAAGCGTGGGTATCGAACAGG
+
CCCCBABCCFFFGGGGGEGGGGHFGFGGHGHHHGFEGGHHGHHGHGGGGEGGHGGGGGHHHFGHHHFHHGGEFGDGHHHHGGGGGFHHGGHHHHGGJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJBBBB
Merged_Barcodes.fq: (total header: 63169)
@M70287:117:000000000-B5B22:1:1101:25927:6491 2:N:0:
GGTTAACAGGAA
+
CCBBCFFFFCFF
Raw_read1.fq (total header: 1157939)
@M70287:117:000000000-B5B22:1:1101:14478:1753 1:N:0:
TACGGAGGATGCGAGCGTTATCCGGATTTATTGGGTTTAAAGGGTGCGCAGGCGGTCTGTTAAGTCAGCGGTCAAATCCCGGGGCTCAACCCCGGCCCGCCGTTGAAACTGGCAGTCTCGAGTTGGAGAGAAGTATGCGGAATGCGCGGTGTAGCGGTGAAATGCATAGATATCACGCAGAACCCCGATTGCGAAGGCAGCCTGCCAAGCCATGACTGACGCTGATGCACGAAAGCGTGGGGATCAAACA
+
AAA>11>11CCFA10E0EEAGFHC00BEHEGH11/BFFEDD11/B/E//AA/?>EEEE0FGDE1GF2@@//E>EEF2B>F////>CCGCFCAA///>@//@-AC..>1F=1/</.CCD0/0.DE0/0//..//C0C::A--9CFFFB??EGGGFB@-B??BBFFBFFF/BFBFFBF9-9-9/;9AB-=@FE/B----99:AEFFFABB-9A-B/;9FB/9-9A9-999BBE-:?B9ABF###########
Raw_read2_Barcode.fq (total header: 1157939)
@M70287:117:000000000-B5B22:1:1101:14478:1753 1:N:0:
TACGGAGGATGCGAGCGTTATCCGGATTTATTGGGTTTAAAGGGTGCGCAGGCGGTCTGTTAAGTCAGCGGTCAAATCCCGGGGCTCAACCCCGGCCCGCCGTTGAAACTGGCAGTCTCGAGTTGGAGAGAAGTATGCGGAATGCGCGGTGTAGCGGTGAAATGCATAGATATCACGCAGAACCCCGATTGCGAAGGCAGCCTGCCAAGCCATGACTGACGCTGATGCACGAAAGCGTGGGGATCAAACA
+
AAA>11>11CCFA10E0EEAGFHC00BEHEGH11/BFFEDD11/B/E//AA/?>EEEE0FGDE1GF2@@//E>EEF2B>F////>CCGCFCAA///>@//@-AC..>1F=1/</.CCD0/0.DE0/0//..//C0C::A--9CFFFB??EGGGFB@-B??BBFFBFFF/BFBFFBF9-9-9/;9AB-=@FE/B----99:AEFFFABB-9A-B/;9FB/9-9A9-999BBE-:?B9ABF###########
Raw_read3.fq (total header: 1157939)
@M70287:117:000000000-B5B22:1:1101:14478:1753 3:N:0:
CCTGTTTGATCCCCACGCTTTCGTGCATCAGCGTCAGTCATGGCTTGTCTGGCTTCCTTCTCCATCTTGGTTCTTCCTTCTTTCTTTTCCTTTCCCCTCTCCACCCCGCATTCCTCCTACTTCTCTCCCACTCCATACTTCCCGTTTCCACTGCCGGCCCGCTTTGTTTCCCCCCCCCTTCCCCCCCCCCTTTCCCCCCCCCTCCCCCCCCCTTCCTCCCCCCTCTTCCCCTCCCCTTCCCTCTCCCTCC
+
AAAAAFFBFFFDGEGE?EGFGGEAACGGFFDFBEEF2GB3EA3A222135D3B21AFAFG3355555D53AB5A5@555@B5D@D5@@53BFH@3BE2BB3B3??1/////?4F433B333B?F343B3?00B01120121B0//@FF######################################################################################################
Centroid info:
'W.1509_90;Unc01fwp;size=668;\n'
'TACGGAGGATGCGAGCGTTATCCGGATTTATTGGGTTTAAAGGGTGCGTAGGCGGACTGTCAAGTCAGCGGTAAAATACGGGGGCTCAACCTCCGCCCGCCGTTGAAACTGACGGTCTTGAGTGGGCGAGAAGTATGCGGAATGCGTGGTGTAGCGGTGAAATGCATAGATATCACGCAGAACTCCGATTGCGAAGGCAGCATACCGGCGCCCAACTGACGCTGAAGCACGAAAGCGTGGGTATCGAACAGG\n'
from Read_QC
Raw Stats
Count: 21
Total: 1157985
Min: 8
Max: 107330
Mean: 55142.1428571429
Med: 65102
StdDev: 24050.1290805084
CoV: 0.436147886795318
Merged Stats:
Count: 21
Total: 521908
Min: 1
Max: 47063
Mean: 24852.7619047619
Med: 26797
StdDev: 10628.1543033441
CoV: 0.42764479634385
Mapped Stats:
Count: 20
Total: 368435
Min: 9379
Max: 35605
Mean: 18421.75
Med: 19744.5
StdDev: 7332.66925392793
CoV: 0.398044119257287
My feeling is that I cannot do the job because the company that sequencing our data has deleted the index portion in the sequence headers.
But is this so? If I am wrong (hopefully), can you point to me the right way to do it?
I am thinking of using the XY coordinate in the header file and the barcode file to align the two. But the reads do not seem to match? Any suggestions on how should I proceed?
Thank you!
Jorja
You don't say if these are custom barcodes or regular Illumina barcodes. Also, as the headers from the barcode and reads do not match, either something went really wrong, or the files were processed somehow and you didn't tell us.
If these are regular Illumina barcodes, ask the sequencing provider to demultiplex the samples for you. If these are custom barcodes, explain in detail how should they work.
These are regular Illumina barcodes. We are running this by ourselves because we cannot reach the provider any more... So I am basically on my own.
When you said that the barcode and reads do not match, do you mean their header line does not match?
This is my bad. I did not paste the matching one (the barcode and the sequence are in different sequence. Here I pasted the header of the first line of each file)
But in fact they are matching, I can find identical ones.
My original thought is I use this part: @M70287:117:000000000-B5B22:1:1101:25927:6491
to match the barcode to read, then match back to the info file that link barcode with each sample.
Base on your thoughts and what genomax suggested me to read, I believe this is how people matching them?
However, the number of reads find by this method does not match the QC file that they gave us.
For example, one sample is said to have 27000 reads, but mine can be 30000, while the other said to be 10000, mine can be 40000. They do not match in number and not even in the same ratio either... That makes me wonder what I did wrong...
No, most people don't break out their fastqs into 20 individual ones by the headers. Most people use a sample sheet to tell bcl2fastq what sample goes with what sequencing index sequence, and the demultiplexing is done as the fastqs are generated using the data in the index read. There are non-standard library preps that might put the sample barcode in read 2 instead of the usual index position, we have no idea if you did that. The "2" in the header of your "barcode.fq" suggests that you did, but we can't be sure if you don't tell us.
Thank you, swbarnes2!
So I was only given this pile of data without knowing what the sequencing service did...
Assume what you said is true (non-standard library preps that might put the sample barcode in read 2 instead of the usual index position), how should I proceed?
You need to have some basic information about what was done before anyone can help you. I can't even tell from your question how many sequence reads you have, and how many index reads you have. Find that out, and put that in your question.
I just did! And I put the first read of every file that I can find in the folder. Hope that helps.
And also I put the QC file they generated for us too.
Take a look at the answer in: A: Demultiplexing Illumina data
I don't know what
centroid info
refers to but I have a suspicion that it is not going to be useful here.Thank you, genomax!
I agree centroid info seems not important. Put it here just in case... Nice to confirm.
Your link is very useful. Apparently, it means I cannot simply separate files based on their header...
I will try the package in the link!
I am very sad to see that there are 89 views and no answer yet...
Can someone maybe point to me how I can improve my post to make it clearer to the readers?
Or what additional info I need to put in?
Thanks!
Your barcoding must be non-standard, so how can anyone understand it if you don't explain it?
What do you me by non-standard? This is my first time saw this kind of files, so not much to compare with....