I am analyzing some public Drop-Seq data, where the data is not demultiplexed.
When I download the data, I get two FASTQ files - the 'R1' file has barcode sequences, which are in-line. The 'R2' file has the actual sequencing data. I don't have the original Illumina BaseCalls directory. I only have these two files and a list of the barcodes. The barcodes, which are 6 bp in length, aren't necessarily at a specific location in the R1 reads; they are often in the middle of the read.
The barcodes are like this, where each barcode corresponds to a sample:
AAAACT
AAAGTT
AAATTG
AAGATT
AATACA
I'm providing the first few lines of each file as an example:
R1:
@HISEQ:284:C9JKFANXX:1:1101:1202:1999 1:N:0:
NTATTGCACTAAGGTA
+
#3=ABGGGEGGGCGFG
@HISEQ:284:C9JKFANXX:1:1101:1274:1979 1:N:0:
NAAACTTACGTGCTTT
+
#=AABFGCGGGGEGGG
@HISEQ:284:C9JKFANXX:1:1101:1406:1981 1:N:0:
NGCGGGACAGTGTGCC
R2:
@HISEQ:284:C9JKFANXX:1:1101:1202:1999:3:N:0:
ATCCAGGAGAATGGCTCTTTGGTTGAAATCCGAAATTTCTTGGGTGAAA
+
3>3<>;>;F@BFE1CFG11;F1EB>:1=FGG/>>/:EC1C1100880:0B
@HISEQ:284:C9JKFANXX:1:1101:1274:1979 3:N:0:
TCCTTCTTGGGTATGGAATCCTGTGGCATCCATGAAACTACATTCACTTC
+
BBBBBGDEGG0F11F1;=DG1FGGGBFDGGGCFE@DFGGFGGGG>C0=:
@HISEQ:284:C9JKFANXX:1:1101:1406:1981 3:N:0:
I am stumped on how to proceed, and any help would be greatly appreciated!
Perhaps tools designed specifically for
drop-seq
would be the way to go : https://github.com/broadinstitute/Drop-seqThere are also suggestions in this thread --> Tools for demultiplexing a large fastq file based on random in-line barcodes
How would you like the data demultiplexed? Each barcodes gets its own individual FASTQ file?
I have solutions (such as my own software: splitcode) that I can help you use for purposes like this; can show you how to use if that's what you're looking for.