Extract specific nucleotide AND total read information (tags) from BAM
1
0
Entering edit mode
3.9 years ago
Roman Hillje ▴ 40

Hi guys,

my question is quite similar to some other questions that have been asked here before. Unfortunately, none of them answer it all the way which is why I'll make another (this) post.

• What I have is a BAM file with ~100 million reads.
• What I need is to extract the reads that have information about a certain genomic position along with a tag of the reads.

The reason why I'm saying "read that has information about a certain genomic position" is that with samtools view in.bam chrX:22222 I also extract all reads that only stretch along this position but don't actually overlap it. These reads are useless for me. Ideally, I'd like to only get the information in that position instead of the whole read.

Additionally, I need to carry over a barcode that is saved as a tag in the read to link this information together. It would also be fine to just keep all the tags and then parse it later.

Does anybody know of a way to do this? To me it looks like I gotta write my own little script to do this but I'd like to avoid reinventing the wheel if this already exists somewhere. Also samtools mpileup is helping in that it uses only the informative reads but it returns only the nucleotides in that position and throws away the read tags.

Thanks a lot!

bam samtools • 1.3k views
1
Entering edit mode

my question is quite similar to some other questions that have been asked here before.

cool, and you got some answers, ( e.g: TCGA: Stratify breast cancer cases based on presence of Her2 mutation and get transcription data? ), please validate your previous questions using the green mark on the left.

0
Entering edit mode

My bad, done! Don't know why I didn't do it before.

0
Entering edit mode

I also extract all reads that only stretch along this position but don't actually overlap it.

forgive my french: what is the difference ?

These reads are useless for me

no clear: so by 'extract' do you want to keep them or to remove them ?

0
Entering edit mode

Again, sorry, could've made it a bit more clear. Some of the reads in my BAM file are cut in pieces with one piece aligning before the position of interest and the other piece after the position of interest. The CIGAR string then looks like this: 12M1441N39M. samtools view also reports these reads even though they do not align to the position of interest. Does that make any more sense? Not sure this explains it any better.

Regarding the second question: With useless reads I was referring to the reads I just described above. They overlap/stretch the position of interest but don't contain information about it. I would like to only have the reads from which I can extract information about the nucleotide position of interest.

2
Entering edit mode
3.9 years ago

I'm not sure I understand but here is my code using bioalcidaejdk http://lindenb.github.io/jvarkit/BioAlcidaeJdk.html , where looking for the base rotavirus:200 and for the tag "XS"

samtools view -h input.bam "rotavirus:200-200" | \
java -jar dist/bioalcidaejdk.jar -F SAM -e 'stream().forEach(record->{final String contig= "rotavirus"; final int mutpos = 200; if(record.getReadUnmappedFlag()) return ;if(!record.getContig().equals(contig)) return ;if(record.getEnd() < mutpos) return ;if(record.getStart() > mutpos) return ; int readpos = record.getReadPositionAtReferencePosition(mutpos); if(readpos<1) return ; readpos--; final byte[] bases= record.getReadBases();println(record.getReadName()+" "+record.getContig()+" "+record.getStart()+" "+record.getEnd()+" base["+mutpos+"]="+(char)bases[readpos]+" XS="+record.getAttribute("XS"));});'


output:

(...)
rotavirus_149_661_3:0:0_6:0:0_30c rotavirus 149 213 base[200]=G XS=0
rotavirus_149_673_4:0:0_3:0:0_14d rotavirus 149 218 base[200]=G XS=0
rotavirus_149_688_8:0:0_8:0:0_3bb rotavirus 149 218 base[200]=G XS=0
rotavirus_150_554_9:0:0_6:0:0_2d rotavirus 150 207 base[200]=G XS=0
rotavirus_150_666_2:0:0_4:0:0_ef rotavirus 150 219 base[200]=G XS=0
rotavirus_150_567_5:0:0_1:0:0_2cb rotavirus 150 219 base[200]=G XS=0
rotavirus_150_689_6:0:0_2:0:0_11a rotavirus 150 219 base[200]=T XS=0
rotavirus_141_689_9:0:0_7:0:0_18 rotavirus 151 210 base[200]=G XS=0
rotavirus_151_679_3:0:0_6:0:0_31b rotavirus 151 220 base[200]=G XS=0
rotavirus_148_587_6:0:0_7:0:0_358 rotavirus 151 217 base[200]=G XS=0
rotavirus_151_689_6:0:0_8:0:0_353 rotavirus 151 220 base[200]=G XS=0
rotavirus_152_642_9:0:0_7:0:0_2d3 rotavirus 152 221 base[200]=G XS=0
rotavirus_137_587_7:0:0_5:0:0_351 rotavirus 152 206 base[200]=G XS=0
rotavirus_152_719_6:0:0_7:0:0_dd rotavirus 152 221 base[200]=T XS=0
rotavirus_152_621_5:0:0_6:0:0_36e rotavirus 152 212 base[200]=G XS=0
rotavirus_152_722_7:0:0_9:0:0_209 rotavirus 152 221 base[200]=G XS=0
rotavirus_149_639_8:0:0_4:0:0_286 rotavirus 153 214 base[200]=G XS=0
rotavirus_153_708_4:0:0_7:0:0_3c rotavirus 153 222 base[200]=G XS=0
rotavirus_153_609_5:0:0_8:0:0_1f8 rotavirus 153 222 base[200]=G XS=0
rotavirus_154_610_2:0:0_8:0:0_98 rotavirus 154 223 base[200]=G XS=0
rotavirus_154_713_5:0:0_7:0:0_140 rotavirus 154 223 base[200]=G XS=0
rotavirus_154_721_5:0:0_0:0:0_300 rotavirus 154 223 base[200]=G XS=0
rotavirus_155_558_4:0:0_5:0:0_81 rotavirus 155 224 base[200]=T XS=0
rotavirus_155_683_8:0:0_6:0:0_247 rotavirus 155 207 base[200]=G XS=0
rotavirus_155_741_6:0:0_6:0:0_8b rotavirus 155 224 base[200]=G XS=0
rotavirus_155_768_9:0:0_5:0:0_205 rotavirus 155 208 base[200]=G XS=0
rotavirus_155_632_6:0:0_5:0:0_302 rotavirus 155 224 base[200]=G XS=0
rotavirus_155_663_8:0:0_8:0:0_357 rotavirus 155 224 base[200]=G XS=0
rotavirus_156_621_8:0:0_8:0:0_46 rotavirus 156 225 base[200]=G XS=0
rotavirus_147_599_10:0:0_4:0:0_4b rotavirus 156 216 base[200]=T XS=0

0
Entering edit mode

Just tested it and this is exactly what I needed. Thank you so much!

0
Entering edit mode

Hey Pierre! Sorry to bother you again but I was wondering if it's possible to modify the script so that it also outputs reads that have a deletion in the given position? I've been spending some time trying to figure it out myself but haven't been able to. Best would be if it could write something like "-" instead of the nucleotide. I hate having to ask again since I would prefer to find out myself and I greatly appreciate all your efforts, but if could help me out once more you would make me very happy :)

Thank you!