Question: Modifying fastq base at specific reference location on different length reads
0
gravatar for yryan
5 weeks ago by
yryan0
yryan0 wrote:

Hi folks,

I'm interested in using oxford nanopore's taiyaki tool in order to train a new basecaller for modified bases at a known position. In order to train a new model basecaller I need to modify the fastq (or sam and convert back) for each fast5 file in order to signify this modified base. However I have around 10k reads, combined with minion's inherent error rate it's not really something I can edit in a regex way as far as I know.

Does anyone know of a method or script that can use a sam file aligned to a consensus where I can modify the base at a specific location which would get around the previous issues?

ADD COMMENTlink written 5 weeks ago by yryan0
1
gravatar for Pierre Lindenbaum
5 weeks ago by
France/Nantes/Institut du Thorax - INSERM UMR1087
Pierre Lindenbaum126k wrote:

or script that can use a sam file aligned to a consensus where I can modify the base at a specific location

see How to introduce artificial mutation in bam

ADD COMMENTlink written 5 weeks ago by Pierre Lindenbaum126k

that looks like just the thing, thanks!

ADD REPLYlink written 5 weeks ago by yryan0

please flag the question as answered if it fulfills your needs (green tick on the left)

ADD REPLYlink written 5 weeks ago by Pierre Lindenbaum126k

I was wondering if I could get a bit more help... When I run the command

java -jar /bioinformatics_tools/jvarkit/dist/biostar404363.jar -o modified.bam -p basecalled.vcf original.bam

The output is only partially converting all of my T's to N's for the first 30 or so entries, and the remainder (~6k) are not changing, even with no AF ratio in the VCF (below) which I'd assumed would convert all T's to N's?

##fileformat=VCFv4.2
##FILTER=<ID=PASS,Description="All filters passed">
##samtoolsVersion=1.9+htslib-1.9
##samtoolsCommand=samtools mpileup -v -f reads.fasta basecalled/basedcalled_sorted.bam
##reference=file://reads.fasta
##contig=<ID=X,length=6000>
##INFO=<ID=AF,Number=A,Type=Float,Description="Allele Frequency among genotypes, for each ALT allele, in the same order as listed">
#CHROM  POS ID  REF ALT QUAL    FILTER  INFO
X   4605    .   T   N   .   .   .

Using the samtools -tview command in the link only a small proportion are being converted to N's, and these are the reads at the end of the terminal output, all of those at the beginning are unchanged. Is there anything I can do to alter this?

Also I realise this may be a bit much to ask but would it be possible to allow for the use of non cannonical bases, say Y in this workflow as this would be a very useful tool in order to create a training set for nanopore basecalling for novel modifications.

ADD REPLYlink modified 4 weeks ago by genomax78k • written 4 weeks ago by yryan0

hard to answer without seeing the BAM and the VCF. Please use https://github.com/lindenb/jvarkit/issues , narrow the bam around the position please.

ADD REPLYlink written 4 weeks ago by Pierre Lindenbaum126k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 770 users visited in the last hour