How to strip barcodes from demultiplexed data
0
0
Entering edit mode
4 months ago
MboiTui • 0

Dear BioStars community,

I am working for the first time with 'raw' sequencing data (in the format of fastq files). The data is single end GBS data produced with two restriction enzymes.

The sequencing centre provided the data already demultiplexed, but with the barcodes still present in line at the start of the read.

Here the first two lines from one fastq file

@HISEQ:658:CDPMCANXX:6:1101:8843:1997 1:N:0:
NACAGCAGACAGTGCAGTTTTACCTCAGAAACCACATATGCATGAGATCGGAAGAGCGGTTCAGCAGGAATGCCGAGACCGAT


The metadata file provides a barcode9I (GACAGCAGACAGTGC) and a barcode (GACAGCAGACAG) for this individual.

First of all, what is the difference between the two?

Furthermore, how can I remove the barcodes as part of the process_radtags module, considering that the data has already been demultiplexed (i.e., one fastq file per individual)?

Cheers

stacks • 197 views
0
Entering edit mode

Are you following stacks manual (LINK)?

0
Entering edit mode

Hello GenoMax,

Thanks for your answer. I apologize if my question was very broad and made it look like I was asking for someone to do my work. it was not my intention.

I have been looking at the manual, but I was a bit confused. It stated to not include the barcode file if the data had already been demultiplexed, but that ended up retaining the barcodes after QC (not even sure how well QC was then being performed, if at all).

I struggled for a bit, and finally ended up with the following:

process_radtags -p ./rawdata/ -o ./samples/ -b ./metadata/barcodes_file -c -q -r --disable_rad_check --inline_null


I used the --disable_rad_check option because otherwise all sequences would be discarded, despite good phred scores. When searching for the cut site sequences in my data, I could not find them, thus I initially believed they were already removed by the sequencing company.

I then read a few blog posts (e.g., https://groups.google.com/g/stacks-users/c/LQ6cyOruXh8?pli=1) and it made me think that I am doing something wrong.

I believe the barcode9I sequence contains the remainder of the cut site sequence. So I will now try with the barcode sequence (instead of the barcode9I sequence) and retain the restriction enzyme information (--renz_1 pstI --renz_2 sphI)

0
Entering edit mode

I now ran the following code:

process_radtags -p ./rawdata_try/ -o ./samples/ -b ./metadata/DFr19-4488_Barcodes2.txt -c -q -r --inline_null --renz_1 pstI --renz_2 sphI


It returned the following message. I believe the module is now running correctly, but will inspect the outputs to better assess that

Processing single-end data.
Using Phred+33 encoding for quality scores.
Found 1 input file(s).
Searching for single-end, inlined barcodes.
Will attempt to recover barcodes with at most 1 mismatches.
Processing file 1 of 1 [1872148.FASTQ.gz]
Closing files, flushing buffers...

1558043 total sequences
17815 low quality read drops (1.1%)


EDIT: All retained sequences now start with TGCAG, with i believe is part of the pstI cut site. Not sure why that would be the case.

When I ran the command with the barcode9I barcodes and with --disable_rad_check option that was not the case :/

0
Entering edit mode

I linked the manual just to make sure you had seen it and were following the procedure described.

The sequencing centre provided the data already demultiplexed, but with the barcodes still present in line at the start of the read.

Looking at the fastq header you posted in original question, it would appear that your data is not-demultiplexed as far as Illumina indexes go. Is that correct? Normally there would be an index sequence at the end of the header and it will look like 1:N:0:ATGCGTA.

0
Entering edit mode

Being so new to this pipelines, I am not sure. This is what was stated when downloading the fastq files from the sequencing centre:

Files are provided demultiplexed and have been named by target ID. No filtering has been applied. The demultiplexing barcodes have not been stripped.

All sequences within one fastq files share the same inline barcode, and each fastq file is named according to sample ID. I received as many fastq files as i have sampled individuals