Question

How to extract unique paired end reads from a merged BAM file?

0

Entering edit mode

6.0 years ago

bioinfo89 ▴ 50

Hello All,

I am working on RNA-Seq paired end data (C.elegans). I wanted to extract unmapped reads, so i mapped the reads to reference genome using tophat2 and STAR aligner. I then merged the two BAM files and used picard remove duplicates. Then I extracted the paired end reads using bamToFastq. Now when check the read IDs, I still find them in multiples rather then just once when I do grep | sort | uniq -c. Both the forward and reverse reads are included in extracted R1 file and R2 file.

eg.

 <<<<<<<<<<<<<<<<<@SRR504339.999983/1>>>>>>>>>>>>>>

@SRR504339.999983/1
TCAGATCCGAGAGGTCTCGATCGCTCCACCGCGGCGTGCGCCGTGTAAGCATAGGGCGCAACTTCGACCAGTATCAGGGCCCGTGATT
+
HEEG9C@?FFGFHHFBDDF<FFHIGBGG>@BBEF<@=B8@5@BD@B9:>CAC>CCB9>5-5&+:>?<?BDB@@DCA@CB8?>>9@0?A

@SRR504339.999983/1
TCAGATCCGAGAGGTCTCGATCGCTCCACCGCGGCGTGCGCCGTGTAAGCATAGGGCGCAACTTCGACCAGTATCAGGGCCCGTGATT
+
HEEG9C@?FFGFHHFBDDF<FFHIGBGG>@BBEF<@=B8@5@BD@B9:>CAC>CCB9>5-5&+:>?<?BDB@@DCA@CB8?>>9@0?A

@SRR504339.999983/1
GGTTTAAGTCTAGGACCGGACAAACGGTACGATTACGTCTTAGCATCAAGTGGTGCCTGATCTCGTGATAGACAACGTGAGAGTATTT
+
FHFHICACHFHBGGGHEFH:?@GGGFF7=@BHIG;@EHGHD@DE@DCD@CECCDDCCDC<C@@@BDD5>:ACCDCB@?B8?@A@3>BC

@SRR504339.999983/1
GGTTTAAGTCTAGGACCGGACAAACGGTACGATTACGTCTTAGCATCAAGTGGTGCCTGATCTCGTGATAGACAACGTGAGAGTATTT
+
FHFHICACHFHBGGGHEFH:?@GGGFF7=@BHIG;@EHGHD@DE@DCD@CECCDDCCDC<C@@@BDD5>:ACCDCB@?B8?@A@3>BC

<<<<<<<@SRR504339.999983/2>>>>>>>>>>>>

@SRR504339.999983/2
TCAGATCCGAGAGGTCTCGATCGCTCCACCGCGGCGTGCGCCGTGTAAGCATAGGGCGCAACTTCGACCAGTATCAGGGCCCGTGATT
+
HEEG9C@?FFGFHHFBDDF<FFHIGBGG>@BBEF<@=B8@5@BD@B9:>CAC>CCB9>5-5&+:>?<?BDB@@DCA@CB8?>>9@0?A

@SRR504339.999983/2
TCAGATCCGAGAGGTCTCGATCGCTCCACCGCGGCGTGCGCCGTGTAAGCATAGGGCGCAACTTCGACCAGTATCAGGGCCCGTGATT
+
HEEG9C@?FFGFHHFBDDF<FFHIGBGG>@BBEF<@=B8@5@BD@B9:>CAC>CCB9>5-5&+:>?<?BDB@@DCA@CB8?>>9@0?A

@SRR504339.999983/2
GGTTTAAGTCTAGGACCGGACAAACGGTACGATTACGTCTTAGCATCAAGTGGTGCCTGATCTCGTGATAGACAACGTGAGAGTATTT
+
FHFHICACHFHBGGGHEFH:?@GGGFF7=@BHIG;@EHGHD@DE@DCD@CECCDDCCDC<C@@@BDD5>:ACCDCB@?B8?@A@3>BC

@SRR504339.999983/2
GGTTTAAGTCTAGGACCGGACAAACGGTACGATTACGTCTTAGCATCAAGTGGTGCCTGATCTCGTGATAGACAACGTGAGAGTATTT
+
FHFHICACHFHBGGGHEFH:?@GGGFF7=@BHIG;@EHGHD@DE@DCD@CECCDDCCDC<C@@@BDD5>:ACCDCB@?B8?@A@3>BC

I am not sure how should I remove the duplicates and retain only single entry for reach read.

Any suggestions?

Thanks!

RNA-Seq • 2.7k views

ADD COMMENT • link updated 5.9 years ago by lakhujanivijay 5.8k • written 6.0 years ago by bioinfo89 ▴ 50

1

Entering edit mode

So many questions here :

Why did you use Tophat2 ? Authors of Tophat paper recommend not to use the software anymore.

Why did you use 2 aligners ? One is not enought ?

Would you please show us the picard command line you used ?

ADD REPLY • link 6.0 years ago by Bastien Hervé 5.3k

0

Entering edit mode

Hi @Bastien,

I used more than one aligner to compare and extract maximum unmapped reads!

I used the following picard command:

$picard_dup I=unmapped_merged.sorted.bam O=unmapped_merged.sorted.rmdup.bam M=unmapped_merged_rmdup REMOVE_DUPLICATES=true AS=true VALIDATION_STRINGENCY=LENIENT

ADD REPLY • link updated 5.9 years ago by lakhujanivijay 5.8k • written 6.0 years ago by bioinfo89 ▴ 50

0

Entering edit mode

That's a very strange way to do :) I would rather use STAR only, with setting up some hard filter on STAR options (scoring, alignment & seeding...), if you want something specific.

To answer your question, did you check in your bam files at your reads's flags ? if there are unmapped maybe this is the answer : https://github.com/broadinstitute/picard/pull/1018

Your duplicate reads are actually not only duplicate but have the same names (like a copy/paste, due to your bam merge). I don't know how picard deal with this kind of reads.

ADD REPLY • link 6.0 years ago by Bastien Hervé 5.3k

0

Entering edit mode

Yes I checked the flags, they are 133, 69, 141 and 77!

ADD REPLY • link 6.0 years ago by bioinfo89 ▴ 50

0

Entering edit mode

Well, use this tool : http://www.samformat.info/sam-format-flag

And you'll see that your reads are unmapped

ADD REPLY • link 6.0 years ago by Bastien Hervé 5.3k

0

Entering edit mode

Yes, I know they are unmapped reads, I checked it before as well. But I am looking for an answer to why Picard remove duplicates isn't removing the duplicates in this case.

ADD REPLY • link 6.0 years ago by bioinfo89 ▴ 50

0

Entering edit mode

What append to duplicate mapped reads ?

ADD REPLY • link 6.0 years ago by Bastien Hervé 5.3k

1

Entering edit mode

You can check this link:https://github.com/alvaralmstedt/Tutorials/wiki/Separating-mapped-and-unmapped-reads-from-libraries

ADD REPLY • link 6.0 years ago by bioinfo89 ▴ 50

2

Entering edit mode

Hey bioinfo89, you should have posted this as a comment to your other answer. By posting this as a new answer, it breaks the flow of the thread and makes it somewhat confusing for people arriving here. Leave it for now, though. Another Moderator may see my comment and then move this answer to the top-level.

ADD REPLY • link 6.0 years ago by Kevin Blighe 87k

1

Entering edit mode

Did that Kevin.

ADD REPLY • link 5.9 years ago by lakhujanivijay 5.8k

1

Entering edit mode

Thanks for sharing the link @ bioinfo89

ADD REPLY • link 6.0 years ago by cpad0112 21k

1

Entering edit mode

Hello bioinfo89,

Please use the formatting bar (especially the code option) to present your post better. I've done it for you this time.
code_formatting

Thank you!

ADD REPLY • link 5.9 years ago by lakhujanivijay 5.8k

lakhujanivijay · Accepted Answer · 2018-05-03

1

Entering edit mode

6.0 years ago

bioinfo89 ▴ 50

A found a simple solution! Used seqtk subseq extract all the unique unmapped reads!

ADD COMMENT • link updated 5.9 years ago by lakhujanivijay 5.8k • written 6.0 years ago by bioinfo89 ▴ 50

1

Entering edit mode

could you please post seqtk solution for future reference?

ADD REPLY • link 6.0 years ago by cpad0112 21k

2

Entering edit mode

summary of the solution is:

Use samtools to get mapped and unmapped sams using -F and -f tags
Get read names from the first column of sams
Use read names (from above step) to extract reads from fastq files, for mapped and unmapped. For this seqtk subseq can be used.

ADD REPLY • link 6.0 years ago by cpad0112 21k

0

Entering edit mode

You may find seqkit useful ..

refer the becnhmarking - https://bioinf.shenwei.me/seqkit/#benchmark

ADD REPLY • link 5.7 years ago by lakhujanivijay 5.8k