Tool to retrieve regions with reads?
5
1
Entering edit mode
8.9 years ago
manekineko ▴ 150

Hi,

Is there a tool for retrieving genomic regions occupied with reads as in the picture (reads are ~17-25nt)? I was thinking to use assembly on the reads to reconstruct the regions covered with reads but maybe they are not so dense/overlapped, so using velvet with kmer 15 I got some regions reconstructed but very very small part? So any idea will help, probably there is more smart method to do what I want?

RNA-Seq • 2.1k views
ADD COMMENT
0
Entering edit mode

No, it is a plant and yes they can be TE regions expressing small RNAs, but I'm interested in them too. So want to collect such regions and have the sequence that was occupied by such cloud of reads.

ADD REPLY
2
Entering edit mode
8.9 years ago
James Ashmore ★ 3.4k

If I understand your question correctly, you can retrieve the regions using bedtools. First make a coverage track of reads along the genome (only recording regions with coverage > 0). Then merge any regions which are book-ended (You can change this limit, to merge regions which are within N bp of each other). Then get the FASTA sequence of the merged regions.

bedtools genomcov -ibam sample.bam -bg > sample.bedgraph
sort -k1,1 -k2,2n sample.bedgraph > sample.sorted.bedgraph
bedtools merge -i sample.sorted.bedgraph > sample.sorted.merged.bedgraph
bedtools getfasta -fi genome.fasta -bed sample.sorted.merged.bedgraph -fo regions.fasta
ADD COMMENT
0
Entering edit mode

Thanks , seems what I need probably, will try it!

ADD REPLY
0
Entering edit mode

If I want to make GFF of the regions and not get their sequence how should I proceed?

Particulary I want to see in sample.sorted.merged.bedgraph where are intersections with another GFF?

Can I do it just by merging all my mapped reads with close to each other within Nbp? so skipping genomecov?

ADD REPLY
1
Entering edit mode

A simple approach would be to change the GFF file into a BED file using awk:

awk -v OFS="\t" '{print $1, $4 - 1, $5}' input.gff > output.bed

Subtract 1 from the start position to make it 0-based (thanks to dariober for pointing this out.)

Then use bedtools to look for intersections between the two BED files:

bedtools intersect -a first.bed -b second.bed
ADD REPLY
0
Entering edit mode

Shouldn't it be

awk -v OFS="\t" '{print $1, $4 - 1, $5}' ...

Since start position in gff is 1-based while in bed is 0-based?

ADD REPLY
0
Entering edit mode

Ah yes you are correct, I'll edit my comment, thank you

ADD REPLY
0
Entering edit mode
8.9 years ago

It sounds like you're after stringTie, or cufflinks, or similar. Note that if you have a well annotated genome (e.g., human or mouse), then these regions are likely just repeat regions (e.g., LTRs).

ADD COMMENT
0
Entering edit mode
8.9 years ago

As I understand it, you have reads (RNAseq?) mapped to a reference genome and you want to find clusters of reads, right? If so, you could try to scan your bam file with a peak caller devised for ChIP-Seq data, e.g. macs. Alternatively, you could divide your genome in overlapping windows of desired length, say 100bp windows, count reads in each window and consider as clusters windows with read count above a threshold (in the end that's what peak callers do). for this see the bedtools suite.

Once you have the positions of the clusters you can obtain the sequence with a tool like fastaFromBed (in bedtools).

ADD COMMENT
0
Entering edit mode

Not RNA-seq, it is smallRNA-seq, but probably finding the regions of clusters will be better with some peak caller rather than trying to assemble reads...

ADD REPLY
0
Entering edit mode
8.9 years ago
Malcolm.Cook ★ 1.5k

There are so many variants of your question, it is hard to know exactly what you desire to do.

Trying to literally understand your word "genomic regions occupied with reads", in your coverage map picture thre appear to be six such regions, separated from each other by short regions of zero coverage ("regions NOT occupied by reads").

It that is what you want, then bedtools merge (which works with bam and bed files) will directly answer this question, and identify the coordinates of the six regions.

Docs: http://bedtools.readthedocs.org/en/latest/content/tools/merge.html

ADD COMMENT
0
Entering edit mode
8.9 years ago
A. Domingues ★ 2.7k

If you think these might be piRNA clusters, running proTrac on your data might help you:

proTRAC predicts and analyzes genomic piRNA clusters based on mapped piRNA sequence reads. proTRAC 2.0 and later versions apply a sliding window approach to detect loci that exhibit high sequence read coverage. Subsequently, sequences mapped to these loci are analyzed with respect to typical piRNA and piRNA cluster characteristics to ensure high specificity.

You can find all the scripts needed here.

ADD COMMENT

Login before adding your answer.

Traffic: 1951 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6