Question

Reformatting Merged and Unmerged Paired-End Files

0

Entering edit mode

6 weeks ago

young_bioinformatician ▴ 230

Hello everyone,

I had paired-end data which I merged using fastp and performed several prepocessing analysis on them. As a result, I ended up with merged, unmerged_paired_1, and unmerged_paired_2 files.

Now, I want to reformat these processed files back into paired-end format. How can I do this correctly?

So assuming for one sample:

merged_reads : 1,200,000
unmerged_reads_1 : 1,800,000
unmerged_reads_2 : 1,800,000

I want to get paired-end format again.

reformat.sh in=merged_reads out1=forward.reads out2=reverse.reads
cat forward.reads unmerged_reads_1  > final_paired_1.reads
cat forward.reads unmerged_reads_2  > final_paired_2.reads

Is that technically correct? If not, do you have any suggestions to do it in a correct way. Thank you.

illumina fastp fastqc bbmap • 262 views

ADD COMMENT • link updated 6 weeks ago by GenoMax 141k • written 6 weeks ago by young_bioinformatician ▴ 230

score 0 · Answer 1 · 2024-03-12

0

Entering edit mode

6 weeks ago

GenoMax 141k

In theory this is correct. I would add a repair.sh step at the end to make sure reads are in sync (in terms of order) across the two final files.

ADD COMMENT • link 6 weeks ago by GenoMax 141k

0

Entering edit mode

Thank you GenoMax for your answer. I want to add some details concerning this puzzling and extend the topic. Maybe It is a bit out of the topic but related to my question anyway.

Here, I follow the procedure what I explained and I mapped two different format to a reference using bowtie2 and salmon. Here are the results for different format for the real data.

For merged and unmerged format:
bowtie2 -x reference.fa -U merged.reads -S merged.sam 
bowtie2 -x reference.fa -1 unmerged_reads_1 -2 unmerged_reads_2 -S unmerged_paired.sam
samtools sort -o merged_sorted.bam merged.sam
samtools sort -o paired_sorted.bam paired.sam
bamtools -in merged_sorted.bam -in paired_sorted.bam -out merged_paired_sorted.bam

flagstat results of merged_paired_sorted.bam:
97540614 + 0 in total (QC-passed reads + QC-failed reads)
0 + 0 secondary
0 + 0 supplementary
0 + 0 duplicates
52284671 + 0 mapped (53.60% : N/A)
65027076 + 0 paired in sequencing
32513538 + 0 read1
32513538 + 0 read2
24309546 + 0 properly paired (37.38% : N/A)
31709744 + 0 with itself and mate mapped
2677439 + 0 singletons (4.12% : N/A)
3827400 + 0 with mate mapped to a different chr
1828106 + 0 with mate mapped to a different chr (mapQ>=5)

and here same reformat paired end reads I used for salmon-mapper. I know bowtie2 and salmon are totally different aligner but I am confused at that point. Here only one paired was considered: 48770307

salmon quant -i reference.index --meta --libType IU -1 final_paired_1.reads -2 final_paired_2.reads -o output

Log File:

Observed 48770307 total fragments (48770307 in most recent round)   
[2024-01-18 19:57:45.215] [jointLog] [info] Computed 2,960,309 rich equivalence classes for further processing
[2024-01-18 19:57:45.215] [jointLog] [info] Counted 26,417,982 total reads in the equivalence classes
[2024-01-18 19:57:45.257] [jointLog] [warning] 0.000508506% of fragments were shorter than the k used to build the index.
If this fraction is too large, consider re-building the index with a smaller k.
The minimum read size found was 15.


[2024-01-18 19:57:45.257] [jointLog] [info] Number of mappings discarded because of alignment score : 30,255,045
[2024-01-18 19:57:45.257] [jointLog] [info] Number of fragments entirely discarded because of alignment score : 12,287,142
[2024-01-18 19:57:45.257] [jointLog] [info] Number of fragments discarded because they are best-mapped to decoys : 0
[2024-01-18 19:57:45.257] [jointLog] [info] Number of fragments discarded because they have only dovetail (discordant) mappings to valid targets : 1,348,537
[2024-01-18 19:57:45.257] [jointLog] [info] Mapping rate = 54.1682%

In bowtie2 result, when I divided total reads,97540614/2, I get the same number as Salmon says. So, why should I expect Salmon not give me 97540614 for initial reads but bowtie2 does? That's why I was afraid I made a mistake about reformating the reads but I think It is not about reformatting issue. What do you think?

ADD REPLY • link 6 weeks ago by young_bioinformatician ▴ 230

0

Entering edit mode

This is indeed off topic for original question. Programs may account of input reads in different ways.

I suggest that you verify that reads are in sync in your files first. If they are not then the entire results above is null and void.

ADD REPLY • link 6 weeks ago by GenoMax 141k