Merging two sets of NGS reads
0
0
Entering edit mode
3 months ago
Lora ▴ 10

Hi everyone,

for my project, I'm trying to merge two sets of NGS reads (for each, I want to merge R1 and R2 reads), sequenced on an Illumina NextSeq 500 instrument. One set of reads is coming from enriched genomic libraries and the other from "non-enriched"/regular libraries.

First, I've tried to do that using a simple command for 'cat' option: cat file1.fq.gz file2.fq.gz > mergedfile.fq.gz After mapping and sorting etc., this seemed to cause a problem in the structure of .vcf file, so I wasn't able to use it for further analysis. I assume that the reason for the improper merging was the sequence identifiers and their structure. Now, I'm wondering if this is even possible to do and if yes, how?

A few examples of sequence identifiers of my fastq files:

Sample1 (Non-enriched), R1:
>A00801:156:HKYC2DSX2:1:1101:20808:13855 1:N:0:AGCGATAGAT+AGGCTATAGT
>A00801:156:HKYC2DSX2:1:1101:20410:13886 1:N:0:AGCGATAGAT+AGGCTATAGT
>A00801:156:HKYC2DSX2:1:1101:16441:13933 1:N:0:AGCGATAGAT+AGGCTATAGT
>A00801:156:HKYC2DSX2:1:1101:24252:13933 1:N:0:AGCGATAGAT+AGGCTATAGT
>A00801:156:HKYC2DSX2:1:1101:24288:13933 1:N:0:AGCGATAGAT+AGGCTATAGT
>A00801:156:HKYC2DSX2:1:1101:30092:13933 1:N:0:AGCGATAGAT+AGGCTATAGT

Sample1 (Non-enriched), R2:
>A00801:156:HKYC2DSX2:1:1101:20808:13855 2:N:0:AGCGATAGAT+AGGCTATAGT
>A00801:156:HKYC2DSX2:1:1101:20410:13886 2:N:0:AGCGATAGAT+AGGCTATAGT
>A00801:156:HKYC2DSX2:1:1101:16441:13933 2:N:0:AGCGATAGAT+AGGCTATAGT
>A00801:156:HKYC2DSX2:1:1101:24252:13933 2:N:0:AGCGATAGAT+AGGCTATAGT
>A00801:156:HKYC2DSX2:1:1101:24288:13933 2:N:0:AGCGATAGAT+AGGCTATAGT
>A00801:156:HKYC2DSX2:1:1101:30092:13933 2:N:0:AGCGATAGAT+AGGCTATAGT

Sample1 (Enriched), R1:
>A00801:156:HKYC2DSX2:2:1101:30056:27837 1:N:0:AGCGATAGAT+AGGCTATAGT
>A00801:156:HKYC2DSX2:2:1101:12445:27868 1:N:0:AGCGATAGAT+AGGCTATAGT
>A00801:156:HKYC2DSX2:2:1101:23167:27868 1:N:0:AGCGATAGAT+AGGCTATAGT
>A00801:156:HKYC2DSX2:2:1101:14136:27884 1:N:0:AGCGCTAGAT+AGGCTATAGT
>A00801:156:HKYC2DSX2:2:1101:17607:27884 1:N:0:AGCGATAGAT+AGGCTATAGT
>A00801:156:HKYC2DSX2:2:1101:32542:27884 1:N:0:AGCGATAGAT+AGGCTATAGT

Sample1 (Enriched), R2:
>A00801:156:HKYC2DSX2:2:1103:18801:28948 2:N:0:AGCGATAGAT+AGGCTATAGT
>A00801:156:HKYC2DSX2:2:1103:3423:28964 2:N:0:AGCGATAGAT+AGGCTATAGT
>A00801:156:HKYC2DSX2:2:1103:11297:28980 2:N:0:AGCGATAGAT+AGGCTATAGT
>A00801:156:HKYC2DSX2:2:1103:21450:28995 2:N:0:AGCGATAGAT+AGGCTATAGT
>A00801:156:HKYC2DSX2:2:1103:1714:29011 2:N:0:AGCGATAGAT+AGGCTATAGT
>A00801:156:HKYC2DSX2:2:1103:19985:29027 2:N:0:AGCGATAGAT+AGGCTATAGT


Just to clarify based on the example above: I want to merge Sample1 (Non-enriched) R1 with Sample1 (Enriched) R1 and Sample1 (Non-enriched) R2 with Sample1 (Enriched) R2. Hope that's clear.. Also: I'm performing my analysis on Linux Ubuntu.

NGS Linux fastq • 494 views
0
Entering edit mode

After mapping and sorting etc., this seemed to cause a problem in the structure of .vcf file,

what kind of "problem" ?

0
Entering edit mode

Hi, thank you for your reply! First one was RStudio saying: "Your file appears to have 76 header elements and 77 columns in the body. This should never happen!". I believe there were no mistakes in my commandlines for mapping, sorting, creating .vcf files, because they work fine with other, non-merged reads (also produce "normal" .vcf file). After that, I deleted an extra column in the .vcf file and now, when importing it to RStudio, the software crashes (R session aborted, R encountered a fatal error ..) after I type in the command line for reading the file (vcfR package). The command starts running but finally crashes when creating a character matrix gt.

0
Entering edit mode

merge two sets of NGS reads

What you are describing is not strictly merging. It is simply concatenating two files end to end, such that contents of file 2 are tacked at the end of file 1.

True read merging will overlap a read from R1 and its corresponding mate from R2 to create a single (if the reads can overlap in the middle) read.

If you wanted to do the former then carry on, but if it is latter then you need software like bbmerge.sh to do this. Your reads need to overlap as well.

0
Entering edit mode

I don't want to merge overlapping paired-end reads, but join/merge the two R1 reads from different files (but same specimen) into one file, which I can further use for mapping.. and the same with R2 reads ofc.

0
Entering edit mode

Your simple cat approach sounds valid. Can you explain what you did with the data after merging and what was the problem with the VCF?
Regardless, since you have two libraries generated with different protocols, I'm not sure merging them is a good idea...

0
Entering edit mode

Regarding the fact that the two sets of reads are based two different protocols, I understand the concern, but I also assume that the enrichment process is not "perfect" and that those reads will still contain some of non-enriched sequences, which I could maybe use for a better coverage. I think it's maybe worth a try..

Here is a rough desrciption of my command lines to create a vcf file:

# indexing the reference sequence

bwa index reference.fasta


# mapping the reads to the ref sequence

for i in *_R1_joined.fq.gz
do
SAMPLE=$(echo${i} | sed "s/_R1_\joined\.fq.gz//")
echo ${SAMPLE}_R1_joined.fq.gz${SAMPLE}_R2_joined.fq.gz
bwa mem reference.fasta ${SAMPLE}_R1_joined.fq.gz${SAMPLE}_R2_joined.fq.gz > ${SAMPLE}.sam done  # filtering, sam to bam, sorting for i in *.sam ; do pattern=${i%%.sam} ; samtools view -F 0x4 -S -h -o ${pattern}filt.sam$i ; done ;
for i in *filt.sam ; do pattern=${i%%.sam} ; samtools view -F 0x4 -S -b -o${pattern}.bam $i ; done ; for i in *.bam ; do pattern=${i%%.bam} ; samtools sort $i -o$i.sort ; done


# creating one mpileup file from 68 samples

samtools mpileup -B -f reference.fasta \$(printf '%s ' *.bam.sort) > myfile.mpileup


# variant calling and vcf file creation (added text file with sample species names)

varscan mpileup2cns myfile.mpileup min-var-freq 0.01 --min-avg-qual 30 --min-freq-for-hom 0.9 --min-coverage 4 --output-vcf --vcf-sample-list NameList.txt --strand-filter 0  > myfileCNS.vcf

1
Entering edit mode

I'm pretty sure the problem is not with the fastq merging, as all your commands seem fine to me. Might be something with the pileup or variant calling steps. At what point did you get an error message, and what was it. Also, maybe give bcftools call a go - see if it does better than varscan.

0
Entering edit mode

I will indeed try bfctools! Thanks :)

Also, I've checked the joined reads in Geneious. Here, it is visible that at the break of one set of reads into another, the identifier sequence changes and no longer has a consecutive descending number, indicating the end of one set and beginning of the other one (see attached photo). Could this be the reason the vcf file is not properly created for RStudio to read it?