Question: extracting reads from sam/bam file
0
gravatar for KVC_bioinfo
3.0 years ago by
KVC_bioinfo510
Boston
KVC_bioinfo510 wrote:

Hello All,

I have sam and bam file from alignment of reads (single-end) obtained from Oxford nanopore technology(human genome as a reference) I want to extract reads that aligned to given coordinates. This should only include reads that aligned and not the clipped reads. I have used following command which gives me reads with clipping:

samtools file.bam "chr1:region1-region2" > file_out.bam

Can someone help me here? Thank you in advance

sam bam • 1.7k views
ADD COMMENTlink modified 3.0 years ago by Pierre Lindenbaum131k • written 3.0 years ago by KVC_bioinfo510
2
gravatar for Pierre Lindenbaum
3.0 years ago by
France/Nantes/Institut du Thorax - INSERM UMR1087
Pierre Lindenbaum131k wrote:

using samjdk:

samtools view -bu  in.bam "contig:1-100" | \
java -jar dist/samjdk.jar -e 'return record.getReadUnmappedFlag() || record.getCigar()==null || !record.getCigar().isClipped();'
ADD COMMENTlink modified 3.0 years ago • written 3.0 years ago by Pierre Lindenbaum131k

Thank you. I really appreciate it. But I am absolutely unfamiliar to java.

Is there any other way of doing it?

ADD REPLYlink written 3.0 years ago by KVC_bioinfo510
1

sure, you can try awk:

samtools view -h  in.bam "ref:1-100" |\
awk -F '\t' '/^@/ {print;next;} (!($6 ~ /[HS]/ )){print;} '
ADD REPLYlink modified 3.0 years ago • written 3.0 years ago by Pierre Lindenbaum131k

Thanks! what does ref:1-100 mean here?

ADD REPLYlink written 3.0 years ago by KVC_bioinfo510
1

it's your "chr1:region1-region2"

ADD REPLYlink written 3.0 years ago by Pierre Lindenbaum131k

I am using the command you suggested:

samtools view -h sorted.bam "chr1:region1-region2" | awk -F '\t' '/^@/ {print;next;} (!($6 ~ /[HS]/ )){print;} '

However, this just gives me names such as:

@HD VN:1.3 SO:coordinate

@SQ SN:chr1 LN:248956422

@SQ SN:chr2 LN:242193529

@SQ SN:chr3 LN:198295559

@SQ SN:chr4 LN:190214555

@SQ SN:chr5 LN:181538259

@SQ SN:chr6 LN:170805979

I want to extract the reads that mapped in that region but I don't want the soft clipped reads. I assume the alignment bam file already excludes the hard clipped dreads.

ADD REPLYlink modified 3.0 years ago • written 3.0 years ago by KVC_bioinfo510
2

-h includes the header. The rest of your reads should be after the header. The important bit in that command is !($6 ~ /[HS]/ , which is going to remove any reads where the CIGAR string indicates clipping.

ADD REPLYlink written 3.0 years ago by swbarnes29.2k

Thank you very much. I am not getting any reads in resulting file. I am interested in obtaining reads that mapped to the reference where I do not want the clipped reads in it.

Suppose, original read length is 2000bp and only 50 map to the reference rest of the reads are clipped. Then I am interested in those 50 reads. I guess the above command removes entire 2000bp read.

ADD REPLYlink written 3.0 years ago by KVC_bioinfo510

Also, I am not sure why the header includes all chr when I have specified only chr1 in my command.

ADD REPLYlink written 3.0 years ago by KVC_bioinfo510
1

if your region is really "chr1:region1-region2" then, please, try to understand what you're doing...

ADD REPLYlink modified 3.0 years ago • written 3.0 years ago by Pierre Lindenbaum131k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 2249 users visited in the last hour