Extracting reads from BAM file based on partial read name
1
0
Entering edit mode
2.9 years ago

Hi there,

I am trying to use the FilterSamReads tool from picard to subset reads from BAM files. My BAM files have their reads labelled in the following way:

TRINITY_DN9898_c0_g1_i1

The "_i1" at the end of this name represents a differentially expressed version of the read "TRINITY_DN9898_c0_g1". I need to subset based on the first part of the name ("TRINITY_DN9898_c0_g1"), so that the output BAM will contain all differentially expressed versions of that read (I.e. could be "i1" or "i3" etc...).

This is just one example of a list of 500 differentially expressed reads I am trying to extract from whole-transcriptome libraries. I ran the FilterSamReads tool but the output returned all of the reads unfiltered - I suppose it did not detect the partial read names I gave it. Does anyone know how I can accomplish my task?

Thanks.

BAM CLI picard RNA-Seq • 1.5k views
ADD COMMENT
1
Entering edit mode

If you have a file of patterns you're looking for, a little script in python or perl would allow you to query read names for those patterns. For instance, using samtools to pipe the reads to your program, you could then pass on lines beginning with "#", take the read name from other lines and query whether the suffix is found in a an array of your patterns.

ADD REPLY
0
Entering edit mode

Thanks for your reply. I am unfortunately pretty unfamiliar with coding and am new to bioinformatics, do you happen to have a link to an example code of something similar?

ADD REPLY
1
Entering edit mode

This would be a great excuse to learn some things like awk or python, because it involves some simple but powerful steps (read some input, parse it for a pattern, print it back out). However, shell scripting can also be very powerful. Pierre supplied an answer below that would be easy to incorporate using a shell script and "grep" (a linux command that finds patterns). If you can place the 500 names you're looking for in a file, you can convert it into a shell script that would do what Pierre suggested for each of the patterns you're looking for. This would amount to reading your input BAM file 500 times, but it would get the job done, and you'd learn something about scripting. Use small toy examples to get started. You simply have to know how to write a script. See the example below for just 3 patterns. Google bash script for clues on how to run it. Short version is: place commands in a text file with the first line #!/bin/bash, make sure the file is executable (chmod 755 myscript.sh), then you can run it by typing it's name (./myscript.sh). Google and toy examples are your friend when trying to figure out the pieces.

#!/bin/bash

PATTERNS="TRINITY_DN9898_c0_g1
TRINITY_DN9899_c0_g1
TRINITY_DN7823_c0_g2"

for p in $PATTERNS
do
  outname="${p}.sam"
  echo "parsing for $outname"
  samtools view -h in.bam | grep -E '^(@|$p)' > $outname
done
ADD REPLY
2
Entering edit mode
2.9 years ago
java -jar /path/to/picard.jar FilterSamReads --JAVASCRIPT_FILE <( echo 'record.getReadName().startsWith("TRINITY_DN9898_c0_g1");') -I in.bam --FILTER includeJavascript -O out.bam

(must be a java supporting javascript e.g: oracle/java. I don't know if openjdk/java supports javascript )

Otherwise just use grep ?

samtools view -h in.bam | grep -E '^(@|TRINITY_DN9898_c0_g1)' > out.sam
ADD COMMENT

Login before adding your answer.

Traffic: 3269 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6