Align chip seq data to specific sequences not in reference genome
1
0
Entering edit mode
3.8 years ago

Hi all,

I'm very new to bio-informatics, but have many years of coding experience. I am studying un-referenced parts of the genome. I.e. satellite repeats which aren't included in reference genomes. I want to align some raw chip seq data to some specific set of sequences. Basically, I want to make my own reference genome that's based on a small set of sequences and use that to perform chip seq.

I want to create a small custom genome, rather than add to an existing genome, so that I can save computational time.

Can anyone give me some pointers of where to get started? Am I thinking about this the right way?

Any tips/info/thoughts would be greatly appreciated!

ChIP-Seq alignment • 768 views
ADD COMMENT
1
Entering edit mode
3.8 years ago
GenoMax 141k

Am I thinking about this the right way?

That is debatable. We understand you want to do this because you are interested in un-referenced parts of the genome and want to save computational time. If your sample comes from entire genome and if you try to align that data to a reduced representation of the genome (like one you want), there is always a possibility that aligners will align data (they try their best) in locations where the data may not have originated in first place.

If you still want to do this then create a multi-fasta file with sequences you are interested in, create a suitable index with aligner you want to use and align away. Remember the point about reduced representation and keep chances of multi-mapping reads (if your sequences contain repeats and you have short reads) in mind when you look at the results.

Creating a custom genome with added bits that are missing from the reference may be the best option.

ADD COMMENT
0
Entering edit mode

Agreeing with genomax here. Aligning to a subset of regions is always problematic because off-target effects, unspecific pulldown of regions and random DNA sequences that somehow found their way into the library could come from regions not included in the custom reference. The aligner will still try to find best matches in the given reference and this leads to false-positive alignments. Better do as suggested, add your custom sequences to a reference genome (just append them to the genome fasta file as separate sequences), make a new index and align against that.

ADD REPLY

Login before adding your answer.

Traffic: 2006 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6