Question: Extracting Read Pairs From Region So That All Pairs Remain Intact
1
gravatar for Davy
6.1 years ago by
Davy360
United States
Davy360 wrote:

Hi all, I have some whole exome data and I am only interested in the exons of 4 genes, which I have in a bed file.

I ,perhap naively, tried extracting only those reads using this bed file with samtools

samtools view -b -L 4genes.bed myBAM.bam > myBAM.4genes.bam

which works fine, except when I go to convert the BAM file back to fastq with pircard it throws an error because of reads with missing mates.

java -Xmx4g -jar /path/to/picard/SamToFastQ.jar I=myBAM.4genes.bam O=my4genes.fastq VALIDATION_STRNGENCY=LENIENT

Runtime.totalMemory()=2029715456
To get help, see http://picard.sourceforge.net/index.shtml#GettingHelp
Exception in thread "main" net.sf.picard.PicardException: Found 2638 unpaired mates
    at net.sf.picard.sam.SamToFastq.doWork(SamToFastq.java:190)
    at net.sf.picard.cmdline.CommandLineProgram.instanceMain(CommandLineProgram.java:177)
    at net.sf.picard.sam.SamToFastq.main(SamToFastq.java:121)

I also tried intersectBed for this but it had the same result. I guess what I am after is all the reads in my regions plus any mates of those reads that fall outside those regions.

I am hoping there is a tool and or option in samtools or something that I can use for this, but if I must I will program something to do this using python and pysam. I'm just being lazy and trying to find out if anyone else has done this already?

Cheers, Davy

B02V0ACXX110721:2:2205:9893:149433    163    1    10024    0    76M    =    10056    107 TAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTA
ACCCTAACCCTAA    CCCFFFFFHHHHGJJIJJJIJJJJJHGGIIIHHIJIJGIGJJJHIJJIJIHIJJJGECHJHHGGFFFFECEEDDD=    MD:Z:76    PG:Z:bwa.15    RG:Z:B02V0.2    AM:i:0    NM:i:0    SM:i:0    MQ:i:17    UQ:i:0
B02V0ACXX110721:4:1108:21203:4558    163    1    10039    0    66M10S    =    10188    183    ACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACC
CTAACCCTAACCC    B@B7D3BD+AC;?3C??A@G:AAF@?CFF;)?DHBFDFDDGG3DH@F3=B@28)).6=>E76CE@###########    MD:Z:66    PG:Z:bwa.2    RG:Z:B02V0.4    AM:i:0    NM:i:0    SM:i:0    MQ:i:0UQ:i:0
B02FEACXX110730:8:2308:14858:37039    675    1    10042    0    63M13S    =    10184    176    CTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTA
ACCCTAACCCTAA    ;?>@?BC<9BC?E<>>D=BB5BEDCB?CEF<B(AF;EC:<C+6A8AA>DD>CF7ED>?,<<C##############    MD:Z:63    PG:Z:bwa.8    RG:Z:B02FE.8    AM:i:0    NM:i:0    SM:i:0    MQ:i:0OQ:Z:@<@D?BB;?FF>D9E?F;BC9AEA@@FGGG8?)?F9DB>;B)0?9B?8=C=@F2@FA;(66?##############    UQ:i:0
fastq picard samtools bam • 3.9k views
ADD COMMENTlink modified 12 months ago by cmdcolin1.2k • written 6.1 years ago by Davy360

Try: bamtools filter -isMateMapped true -in myBAM.4genes.bam -out myBAM.4genes.bothmapped.bam

ADD REPLYlink written 6.1 years ago by skymningen330

Thanks, but this gives the same result.

ADD REPLYlink written 6.1 years ago by Davy360

The lazy and slow way is to do the samtools filtering, then extract the read names that it gave, then grep in the original bam file for those read names. It will pull out each read that mapped in the region, or its mate, assuming the names are the same (which they should be). It will be quite slow as the bam grows, but on the other hand it is trivially parallelizable if you want to split up the bam.

ADD REPLYlink modified 6.1 years ago • written 6.1 years ago by matted7.1k

Can you show us some sample reads? I'm thinking the read names might be goofy and that's throwing off some of your tools. This happened to me when working with TCGA BAM files.

ADD REPLYlink written 6.1 years ago by Dan D6.8k

Reads added above. Thanks for the assistance.

ADD REPLYlink written 6.1 years ago by Davy360

Could you figure out the solution to this problem? I am trying to do something similar: I want to extract the reads mapping to my region of interest and also extract it's pair even if is mapped out of region or unmapped. What I am trying to do is:

(i) extracting the read names which map to the region of interest samtools view coordinate-sorted-WXS.bam chr2:29907-400156 chr2:33030-33050 | awk '{print $1}' | sort | uniq > mapped_reads.txt

(ii) extracting the reads by name: For this I found these two tools but both of them are giving me bam files which have the same issue described above–not both pairs are included in bam file. These tools are:

ExtractReads: http://timoast.github.io/2015/10/12/ExtractReads/

SamGrep https://github.com/lindenb/jvarkit/wiki/SamGrep/

If anyone of you have resolved this please let me know.

Thanks, Arun

ADD REPLYlink modified 2.6 years ago • written 2.6 years ago by akhattri0
0
gravatar for Pierre Lindenbaum
6.1 years ago by
France/Nantes/Institut du Thorax - INSERM UMR1087
Pierre Lindenbaum123k wrote:

filter with samtools view and the required flag 'read mapped in proper pair'

or fix myBAM.4genes.bam with (? I think that should work) picard FixMateInformation http://picard.sourceforge.net/command-line-overview.shtml#FixMateInformation

ADD COMMENTlink written 6.1 years ago by Pierre Lindenbaum123k

So I tried FixMateInformation, and that didn't work. I also tried (emphasis on tried) filtering with samtools view -f (tried 2 and 3 here) but I didn't really get anywhere. Samtools flagstat reports 0 singletons but picard now says "Found 2587 unpaired mates". Any other suggestions, let me know, but for the meantimes I think I will begin investigating how to do this in pysam.

ADD REPLYlink written 6.1 years ago by Davy360
0
gravatar for Ashutosh Pandey
6.1 years ago by
Philadelphia
Ashutosh Pandey11k wrote:

I am not sure if I understood you correctly but why can't you extract those reads using two steps:

1) Get all the reads that are mapped with in a region that you specified. I think you have already solved it.

2) Now you can get all the unique read ids from the new bam file and extract both forward and reverse reads corresponding to that read id from the original bam file. You can use a single grep command with multiple strings or read ids. Of course , this is not an efficient way but it should work.

ADD COMMENTlink written 6.1 years ago by Ashutosh Pandey11k

Would love a good script that handles step 2

ADD REPLYlink written 12 months ago by cmdcolin1.2k
0
gravatar for cmdcolin
12 months ago by
cmdcolin1.2k
United States
cmdcolin1.2k wrote:

A new software called Bazam is built for doing this type of task https://github.com/ssadedin/bazam

It is designed to output a FASTQ for realignment and doesn't re-output the original BAM data as currently implemented I think but I made a request to maybe enable BAM/SAM output of the tool. Preprint here https://www.biorxiv.org/content/early/2018/10/04/433003

ADD COMMENTlink written 12 months ago by cmdcolin1.2k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 705 users visited in the last hour