How to extract reads/fragments aligned to intergenic regions from bam/sam file?
1
0
Entering edit mode
2.2 years ago
MACRODER • 0

Hi. I'm new to RNA-seq data analysis, and I'm trying to extract all reads/fragments aligned in intergenic regions from a sam/bam file. I have a .GTF file with genomic features. I'm working with an organism that does not have introns, so, I would like to use the end position of each gene as a start position of my region of interest, and the start position of the following gene as the end position of my roi. I would like the output to be a bam file with only the reads that map to intergenic regions. Can anyone help me? Sorry if it is a basic question.

And I would also like to construct a GTF file with this intergenic regions as if they were features, but my skills in programming languages are not really good.

Any help will be appreciated. Thank you

RNA-Seq alignment sequencing bam intergenic • 1.3k views
1
Entering edit mode

You should be able to do it with bedtools intersect see here: https://bedtools.readthedocs.io/en/latest/content/tools/intersect.html

2
Entering edit mode
2.2 years ago

extract the gene positions and convert as bed.

awk -F '\t' '($3=="gene") {printf("%s\t%d\t%s\n",$1,int($4)-1,$5);}' in.gtf | sort -t $'\t' -k1,1 -k2,2n | bedtools merge > genes.bed  get the complement bed with bedtools complement extract the bam: samtools view -L intergenic in.bam  ADD COMMENT 0 Entering edit mode Pretty sure bedtools needs -i - to read from stdin ;-) ADD REPLY 0 Entering edit mode I don't thing 'merge' needs it. ADD REPLY 0 Entering edit mode Thank you so much. I tried your code with minor modifications and I think it worked perfectly. If anyone ever has the same problem, I used the following: 1-Convert genes.gtf to genes.bed awk -F '\t' '($3=="gene") {printf("%s\t%d\t%s\n",$1,int($4)-1,$5);}' genes.gtf | sort -t$'\t' -k1,1 -k2,2n | bedtools merge > genes.bed


2-Extract intergenic regions from genes.bed

bedtools complement -i genes.bed -g file.genome > intergenic.bed


file.genome is a file containing the length of the chromosomes/contigs

3-Output alignments in a bam file overlaping the intergenic.bed file

samtools view -b -L intergenic.bed input.bam | samtools sort > intergenic.bam

0
Entering edit mode

I don't think you need samtools sort if input.bam was already sorted.

Please check the green mark on the left to validate and close the quesion.

0
Entering edit mode

Yes, you are right, my input bam was already sorted. I supposed samtools sort was not necessary, but just in case I added it. Green mark checked. Thanks again