I have two separate fastq files, one file for each mate of a pair. Each file has a corresponding .txt file, but only offers a summary of quality information. I believe my paired-end reads are multiplexed, but I have no way of identifying them on a unique basis. From my understanding I am supposed to use a barcode.txt file to merge this information into my fastq files, but this is information that I simply do not have. The only way I discovered that this was most likely the problem is by basically failing all further analysis of my data. I am capable of mapping the data and creating bam file etc., but I always ruin into a similar problem which is that I have no read group id information in the header or in other words no unique identifier associated with my reads so that they may be distinguished as different reads. So my questions are ...
Is a barcode.txt file typically provided? Or is it strictly just generated automatically when a fastQC report is conducted?
Is it possible to add fake barcodes to fastq files for the sake of continuing on in a given pipeline?
Is there something that I am missing because I am fairly new to bioinformatics? Here is an example of my fastq headers after joining the reads into mate pairs. Does it look like this is in the correct format?
I AM DESPERATE! please help if you can
@SN996:194:H5V7HBCXY:1:1108:1872:2028 1:N:0:TCTCGCGC NTATTTCATAGCATACTTTTCCGGGCTCGCCGGGCCTAAGAAAGTTGCAAAAATTTTTCAATCGAAATACAAATGAAATTAAAACCTACGCGCGTGTGTGGGCCGGCGGCAGTTTGTGCATTGCTTTTGAAGTGGCAACAATTTCGCCACGATTCTCTTGGTCTTTCTTCGGTTGCTGTTGCTGGAGGAGCCTCCATTATTC
what is this "TCTCGCGC" in the read name?
The TCTCGCGC or other sequence following '1:N:0:' in the fastq header should be the Illumina barcode. The reads are usually demultiplexed by the sequencing center, and that process adds the barcode to the read header line.
Are you sure that is the correct way? I don't know what you exactly mean by this.
I have been researching different pipelines for handling fastq files and almost everything that I have read suggests preprocessing the fastq data prior to mapping the reads via bowtie2 or BWA etc. So by "joining the reads into mate pairs" I meant that my fastq files are currently separate files where file_1.fastq are the forward reads and file_2.fastq are the reverse reads. Most preprocessing pipelines suggest joining the two files before preforming any manipulations of the fastq data such as removing duplicates or trimming adapter sequencing.
What type of data do you have, what do you expect the insert length to be, and what type of analysis are you planning to do?
Most analyses on paired end data are done with the reads in separate files, one file for the forward reads and a second file for the reverse reads. You might want to merge the two reads of a pair if the insert length is shorter than twice the read length and you expect your forward and reverse reads to overlap.
That's odd. It's not impossible, but it's not the most common workflow to my knowledge. If there is one set of guidelines that you should care about, then it's the GATK best practices. I recommend following these.