Question

align PacBio data using unaligned bam file

1

Entering edit mode

2.1 years ago

pingu77 ▴ 20

Hello!

I don’t have much experience with PacBio data so I have a very simple question. I would like to align the reads and I have the unaligned bam file (subread.bam). I saw that this tool called pbmm2 (https://github.com/PacificBiosciences/pbmm2/blob/develop/README.md) can align PacBio data using the unaligned bam file directly. However, I am wondering if the HiFi reads (>=Q20) will be ‘extracted’ during the alignment phase or If I have to extract them using for example the extracthifi tool and then I have to use that file for the alignment phase. I tried to align the reads using pbmm2 tool without applying any other analysis but looking at the alignment, I saw that several reads have a MAPQ = 60 but a base Q equals to 0

Thank you in advance for your help.

PacBio pbmm2 HiFi • 3.9k views

ADD COMMENT • link 2.0 years ago by pingu77 ▴ 20

score 2 · Answer 1 · 2022-03-18

2

Entering edit mode

2.1 years ago

Istvan Albert 100k

I believe you first need to produce a consensus sequences with the css consensus caller

then you can either convert that to fasta/fastq or use extracthifi tool to extract reads with Q>=20

then you can use the resulting files as input to other aligners.

ADD COMMENT • link 2.1 years ago by Istvan Albert 100k

0

Entering edit mode

thanks a lot for your reply. I thought those steps were not necessary since the method description states 'Its purpose is to support native PacBio', so I wasn't sure if that means the tool automatically extract the reads with Q >=20

ADD REPLY • link 2.1 years ago by pingu77 ▴ 20

3

Entering edit mode

pbmm2 aligns reads and supports native PacBio file formats, e.g. starting with a uBAM and producing an aBAM. Other aligners like minimap2 require you to start with FASTQ or FASTA input. pbmm2 is still just an aligner. Like described above, the workflow is:

subreads.bam
 |   ccs (without the --all parameter)
 v
hifi_reads.bam
 |   pbmm2
 v
aligned.hifi_reads.bam

If you provide the ccs --all flag, you will generate a reads.bam instead of a hifi_reads.bam, and there's the added step of running extracthifi to extract only the HiFi (>=Q20) reads before alignment.

ccs documentation extract hifi mention

If you're working with PacBio data, you always have the option of emailing support@pacb.com for more information.

ADD REPLY • link 2.1 years ago by William Rowell ▴ 280

1

Entering edit mode

pbmm2 supports aligning both newer PacBio HiFi data, as well as the older CLR data that used to be produced by default PacBio instruments. It doesn't call consensus and generate "HiFi" reads for you, which is the first step that needs to occur after getting your *.subreads.bam file. Often times the service provider will do this part for you, but not all of the time, especially if it's older data.

Subreads are just that, subreads, not Circular Consensus Sequences (CCS that Istvan meant to say). subreads need to be fed into the CCS algorithmm to generate CCS sequences, from which HiFi data can be extracted.

Please see the CCS how-to here: https://ccs.how/

ADD REPLY • link 2.1 years ago by gconcepcion ▴ 410

0

Entering edit mode

Hi all! I have another question.. I am wondering if I have to create the consensus (CSS) even if I don't want to extract hifi reads. I would like to use all reads but if I run the CSS I have to specify the --min-rq parameters and therefore some reads will be discarded.

ADD REPLY • link 2.0 years ago by pingu77 ▴ 20

2

Entering edit mode

If you want Circular Consensus Sequences, (CCS), but not the HiFi subset of CCS, you still need to run the algorithm, but you don't need to explicitly extract the HiFi reads, you can just use the entire *.reads.bam file. If you don't want Circular Consensus Sequences at all and you would prefer to work with the native CLR subreads, then no, you don't need to run the CCS algorithm.

ADD REPLY • link 2.0 years ago by gconcepcion ▴ 410

0

Entering edit mode

Thanks a lot for your help! It's not clear to me in which context it's recommend to work with CCS and in which with CLR subreads. Does it depend on the biological question/analysis? Sorry, I am new to these types of data and I am trying to understand.. in my case, after the alignment, I would like to look at the length of the repeat motifs. Would it be better to work with CSS or with CLR? and why? Thank you in advance!

ADD REPLY • link 2.0 years ago by pingu77 ▴ 20

2

Entering edit mode

Yes, whether or not you work with CCS or CLR is entirely dependent on the question you're trying to answer. In most circumstances you are going to want to work with the CCS data as it's much higher quality data which is important, especially in the case of repetitive sequence. What kind of repeats are you looking at? There are transposon-derived repeats, processed pseudogenes which can occur like repeats, simple sequence repeats, and blocks of tandemly repeated sequences. For transposon derived repeats & pseudogenes you might be able to get away with using CLR data, but for simple sequence repeats and blocks of tandemly repeated sequences, you are probably going to want to use the CCS reads, and more specifically the HiFi reads as they are going to be much higher quality and won't have as many errors in these highly repetitive regions, especially if you are trying to look at repeat lengths. If you don't have enough HiFi reads, you can still use just the larger batch of CCS reads, but you should be careful about interpreting results as the data will not look as clean as if you were using purely HiFi data. there are 1 & 2 pass CCS reads included in the bulk CCS reads will have a higher error rate than the CCS reads with more passes, introducing significant ambiguity into your results.

ADD REPLY • link 2.0 years ago by gconcepcion ▴ 410

0

Entering edit mode

Thanks a lot!!! I am looking at TA repeats motif.. The problem is that if try to get the consensus, most of the reads are filtered out, this is an example of the report I got:

ZMWs input : 6592544

ZMWs pass filters : 7489 (0.11%)

ZMWs fail filters : 6585055 (99.89%)

ZMWs shortcut filters : 0 (0.00%)

ZMWs with tandem repeats : 1256619 (19.06%)

ADD REPLY • link 2.0 years ago by pingu77 ▴ 20

1

Entering edit mode

It could be that the library wasn't created with CCS in mind so therefore the data isn't appropriate for the CCS algorithm, or it could be poor quality DNA. Was this a dataset that you generated / commissioned? or is it an older CLR dataset that you found online somewhere?

At any rate, you can also interrogate the *.json files for the failure reason:

$ zcat chunk99.CCS.zmw_metrics.json.gz | jq '.zmws[] | select(.has_tandem_repeat== true) | {status,zmw}'
{
  "status": "POOR_QUALITY",
  "zmw": "m54006_210727_195308/10748493"
}
{
  "status": "POOR_QUALITY",
  "zmw": "m54006_210727_195308/19071596"
}
{
  "status": "SUCCESS",
  "zmw": "m54006_210727_195308/20578660"
}
{
  "status": "SUCCESS",
  "zmw": "m54006_210727_195308/21954705"
}

At this point it's sounding like a question for PacBio tech support - so I would refer you to support@pacb.com to help you troubleshoot your dataset.

Good Luck

ADD REPLY • link 2.0 years ago by gconcepcion ▴ 410

0

Entering edit mode

Thanks!! It's this dataset https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSM4664302

ADD REPLY • link 2.0 years ago by pingu77 ▴ 20

1

Entering edit mode

Ahh yes, you're never going to get good CCS reads from that dataset. You can still use it as CLR data, but realize that the error rate will be significantly higher. If coverage is deep enough, you might get enough subreads over your region of interest to still be able to work with though

ADD REPLY • link 2.0 years ago by gconcepcion ▴ 410

0

Entering edit mode

how can you say that I can't get a good CCS reads from that dataset? Is there anything in particular that can help me understand when I can or cannot get a good CCS? PS. I checked the json files and in most cases I get poor quality as status. Thank you again! you are really helping me a lot!

ADD REPLY • link 2.0 years ago by pingu77 ▴ 20

3

Entering edit mode

CLR datasets are generated by creating a library with very long (generally unsheared) DNA so that the resulting subreads are as long as possible. Ideally you'll have a single subread for each ZMW which will be very long. Sometimes you get smaller molecules inserted in the SMRTBell, and because SMRTBells are circular, you will get several passes of the subread which in theory could be used to generate a CCS sequence, but because you are selecting for a long library, ZMWs with multiple subread passes in them will be in the minority. There are very few CLR datasets that would be useful for generating any significant quantity of CCS data

CCS datasets on the other hand are generated by shearing and size selecting the DNA to be in a specific size range( e.g. ~20kb). Because the polymerase read for each ZMW can extend out to hundreds of kb, there will be many subread passes for each individual ZMW, thus the majority of the ZMWs can subsequently be converted into CCS data.

In short - to generate CCS / HiFi data, it all depends on the upstream sample prep and how the SMRTBells were initially size selected / created.