Hi there,
I have the reference genome for a plant species which is soft-masked and need to determine those regions as a separate BED file coordinates for all chromosomes.
Now, I was looking up and few posts mentioned seqkit
which I have tried but the results were not correct, apparently. For instance, I have the first 191 bases soft-masked, whereas the output for the tool is different...
Is there another way/a more consistent way to do so, that is extracting the total number of soft-mask regions per chromosome and store them in a BED? Let me know, thanks in advance!
Fully agree with this approach, only would I personally go for replacing them with X or such not to get confused by the Ns that might be already there to denote gaps in the assembly.
lieven.sterck how can I do so?
Picard
seems to accept only {N,ACGT,BOTH}...you can do it using two sed
/^[^>]/s/[ATGCN]/A/g'
followed by/^[^>]/s/[^ATGCN]/N/g'
@Pierre Lindenbaum, I thought about two consecutive
sed
as well; however, I wonder whether there was a way to do it in a single instance. Possibly, that's the best way anyway, thanks again!nevermind, just follow what Pierre Lindenbaum says (after all those years I should know not to contradict Pierre :) )
(I just realised that he's doing the opposite as what I thought of doing, it replaces all the uppercase letters,leaving the lowercases and then you can, following the linked post, get your lowercase regions, no risk at all to confuse with the gap-Ns thus)