I have started learning about metagenomic analysis recently. My supervisor provided me some data from one of her past research, for me to try and familiarise with data processing.
I downloaded the FASTQ files from NCBI and managed to clean off the tags but I cannot manage to do barcode matching.
This is due to the fact that I am still confused about what barcode really is, where it is typically located, how it differs from the 'tag' and whether or not it is supposed to be within the primer sequences.
For instance, there are sequence bits that stay consistent for each read of a single SRA experiment - and this is in the very beginning of each read. This sequence bit is different for every SRA experiment, but it is consistent for every read of an individual SRA experiment.
Here is some information from the article about the amplicon sequencing:
- ...Multiplexed experiments that include Illumina adapters,
- sequencing primers,
- a 12 bp barcode sequence,
- a heterogeneity spacer to mitigate the low sequence diversity amplicon issue,
- and 16S rRNA gene universal primers...
Additional information --
My supervisor sent me an excel sheet containing the primer sequences for each sample.The only difference between these primer sequences were the bits towards the end, here is an example:
CAAGCAGAAGACGGCATACGAGATGTGACTGGAGTTCAGACGTGTGCTCTTCCGATCT CCTAAACTACGG GTGBCAGCMGCCGCGGTAA
The bold bit is the part that changes for every sequence, and the rest of the sequence is the same for every primer.
I tried to look for these bits (one by one) in one SRA experiment's full raw read, but none of the primers had a sequence bit that was present in the sample.
How am I supposed to find the barcodes in this situation? I need to do barcode matching to proceed with QIIME v2 analysis.
Assuming this is illumina sequencing:
Indexes/tags are what the library kit adapters provide. They are sequenced independent of the actual sequence reads and thus are never part of the actual read. These are automatically transferred to read headers by the demultiplexing programs (bcl2fastq or bcl-convert).
Barcodes are oligos/sequences that people incorporate in their constructs using PCR etc. These will be sequenced as a part of the main read. It would be user's responsibility to demultiplex/remove them.
You may be able to follow this tutorial to get things going https://forum.qiime2.org/t/demultiplexing-and-trimming-adapters-from-reads-with-q2-cutadapt/2313
If the 'Data Type' on NCBI says 'Raw Reads', do you think there is still a possibility that the data was minimally preprocessed (i.e. the barcodes were removed prior to uploading) ?
I have a sheet containing the oligos for each sample. However, no part of the Forward or Reverse oligos that I have is present in the corresponding SRA experiment's sequence reads. I am thinking on proceeding with my data analysis, as the samples that I have contain no part of the corresponding oligos.
Many thanks
If the samples were demultiplexed as a part of barcode removal then that is a possibility. Don't assume anything unless the submission notes clearly say what was done.