Reduce poly-N regions longer than 200bp back to 200bp in a reference genome FASTA
1
0
Entering edit mode
13 months ago
William ★ 5.3k

I have a reference genome with some large chromosomes with long poly-N regions.

Unfortunately, these long poly-N regions (i.e. size estimated gaps in the reference genome) cause the chromosomes to be longer than some downstream bio-informatics tools accept.

Collapsing these long poly-N regions to be max e.g. 200bp would likely bring the chromosomes within the accepted maximum chromosome size of the downstream tools. And these size estimated gaps are of course not used for read mapping or SNP calling.

I thought about using linux tr -s 'N'

https://www.gnu.org/software/coreutils/manual/html_node/Squeezing-and-deleting.html

But that would reduce all poly-N sequences in the reference genome to just 1 N.

And I would like to reduce only poly-N regions longer than e.g. 200bp back to e.g 200bp.

Is there a good way to do this?

FASTA poly-N • 648 views
ADD COMMENT
2
Entering edit mode
13 months ago

eg for chr22, using sed ,j replace 1000 N with a few N :

samtools faidx ref.fasta "chr22" |\
awk '/^>/ {printf("%s%s\t",(N>0?"\n":""),$0);N++;next;} {printf("%s",$0);} END {printf("\n");}'  |\
sed -r 's/N{1000,}/NNNNNNNNNNNNNNNN/g' |\
tr "\t" "\n" |\
fold -w 100 
ADD COMMENT
0
Entering edit mode

Something like this looks like it should work. Unfortunately also SED does not like the long chomosomes.

sed: regex input buffer length larger than INT_MAX

So probably need to change this to work on a multi-line FASTA text steam with long lines.

ADD REPLY
1
Entering edit mode

An easier solution is maybe even to create a multi-line fasta , with e.g. line sequence length of 1000bp. And just replace all 1000bp poly-N lines with 100bp poly-N lines. Then reformat fixed width fasta, and maybe run this iteratively.

ADD REPLY

Login before adding your answer.

Traffic: 2249 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6