What Is The Most Direct Way To Extract The Splice Junctions From A Sam File?
2
1
Entering edit mode
11.0 years ago
Geparada ★ 1.5k

Hi!

I need to get the introns coordinates (chr:strat-end and the strand) of the spliced reads wich are in SAM file.

I have no experience with this kind of format, so I plan to transform the SAMs files to BED12 files with this pipe line:

samtools view -bS -o file.bam file.sam && bamToBed -bed12 -i file.bam > file.bed12

And then, use galaxy or own python script (which I haven't wrote yet) to extract the introns from the spliced reads.

The problem with this method is weight of the files... it'll take some time to convert to BED12 and then extract the introns...

Do you know a more direct way to extract the introns coordinates from the SAMs files?

sam intron rna • 11k views
ADD COMMENT
0
Entering edit mode

The intron coordinates aren't really there, if anything, they are implicit in (many of) the alignments. Are you sure you want to roll your own instead of running tophat or mgene or similar?

ADD REPLY
0
Entering edit mode

Yes, I know there are implicit in bed12... But I'm not worry about that, because I wrote a scrip to extract introns coordinates from psl alignment format, so I'll try to adapt to bed12 inputs, or do as brentp proposes...

ADD REPLY
16
Entering edit mode
11.0 years ago
brentp 24k

It's not clear exactly what you want to do, but you can extract only the spliced alignments with:

samtools view -S file.sam | awk '($6 ~ /N/)'

and you may be able to pipe that command directly to bamToBed as:

samtools view -hS file.sam | awk '($6 ~ /N/)' \
   | samtools view -bS - \
   | bamToBed -bed12 > file.bed12

From there, you can get introns by subtracting exons:

so:

bed12ToBed6 -i file.bed12 > file.bed6
subtractBed -a file.bed12 -b file.bed6 -s | cut -f 1-6 > file.introns.bed
ADD COMMENT
1
Entering edit mode

Geparada, I updated the post to get you all the way to a file with the introns.

ADD REPLY
0
Entering edit mode

ha thanks Brent, I didn't know this syntax for awk.

ADD REPLY
0
Entering edit mode

Hi brentp!... I not only want to get the spliced reads... I want to get the genomics coordinates of the introns in those reads, I mean where they are in the genome (chromosome, start, end and strand). From a bed12 you can easily extract this information, but I wonder if are a script to extract direct from the SAM file.

ADD REPLY
0
Entering edit mode

and I should add, in many cases, you'll want to infer exon junctions from paired-end reads. This obviously wont help you there.

ADD REPLY
0
Entering edit mode

I tried convert the SAMs files to bed12 as you propose and I noticed that the samtools view -bS - need the SAM's header... so the -h flag must be present in samtools view -S file.sam to to keep the headers... thanks you so much for your help!

ADD REPLY
0
Entering edit mode

I tried convert the SAMs files to bed12 as you propose and I noticed that the [samtools view -bS -] need the SAM's header... so the -h flag must be present in [samtools view -S file.sam] to to keep the headers... thanks you so much for your help!!

ADD REPLY
0
Entering edit mode

I added the -h as you suggested.

ADD REPLY
1
Entering edit mode
11.0 years ago
Malcolm.Cook ★ 1.3k

if you swing with R, a slight change to my post to account for strandedness should get you there...

ADD COMMENT

Login before adding your answer.

Traffic: 1191 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6