Question: BWA-MEM using long PacBio reads
0
gravatar for Roxane Boyer
17 months ago by
Roxane Boyer560
Auvergne
Roxane Boyer560 wrote:

Hello again everyone,

I'm struggling with the outputs of the tool BWA-MEM on my assembly alignement.

I have two input files : - pb_268_p_ctg.fasta : an assembly of my genome, that I want to be polish, that's why I need an alignement - filtered-reads.fastq : my raw reads from PacBio, a almost 90 Gb file

I tried to align my raw reads on my assembly using BWA-MEM in the following way :

./bwa index /path/pb_268_p_ctg.fasta
./bwa mem -x pacbio -t 20 /path/pb_268_p_ctg.fasta /path/filtered_subreads.fastq > /path/falcon_aligned_reads.sam

But the results are very strange. After indexing the sam file, IGV show a very strange message :

File: /media/loutre/SUZUKII/assembly_tries/falcon_aligned_reads.sam does not contain any sequence names which match the current genome.  File:      null, Genome: 000000F, 000001F, 000002F, 000003F, ...

The sequences name seems to be the same within the assembly file and the sam header file...

Did I missed something ?

Cheers,

Roxane

pacbio sam alignement bwa-mem • 2.2k views
ADD COMMENTlink modified 17 months ago by Brian Bushnell15k • written 17 months ago by Roxane Boyer560

Can you post a few sequence headers from pb_268_p_ctg.fasta? Are there spaces in fasta headers?

ADD REPLYlink modified 17 months ago • written 17 months ago by genomax49k

Ah ! And here some sequence headers from pb_268_p_ctg.fasta :

Loutre:~$ head -1 '/media/loutre/SUZUKII/assembly_tries/pb_268_p_ctg.fasta' 
>000000F 002170011:B~002650282:E~001494317:E~001208944:E ctg_linear 20318800 109944874
ADD REPLYlink written 17 months ago by Roxane Boyer560

Here is an example of the lines contained within the sam file pro

Loutre:~$ head '/media/loutre/SUZUKII/assembly_tries/falcon_aligned_reads.sam' 
@SQ SN:000000F  LN:20279970
@SQ SN:000001F  LN:15392955
@SQ SN:000002F  LN:12197397
@SQ SN:000003F  LN:8269814
@SQ SN:000004F  LN:7009319
@SQ SN:000005F  LN:6740481
@SQ SN:000006F  LN:5550981
@SQ SN:000007F  LN:4663438
@SQ SN:000008F  LN:4675199
@SQ SN:000009F  LN:4040179
ADD REPLYlink written 17 months ago by Roxane Boyer560

Which genome do you load for igv? Are the chromosome identifiers in that genome the same as those you used for the alignment using bwa mem?

ADD REPLYlink written 17 months ago by WouterDeCoster29k

Within igv, I load pb_268_p_ctg.fasta, so yes the identifiers are supposed to be the same, but I think that something went wrong with identifiers

ADD REPLYlink written 17 months ago by Roxane Boyer560

You should either truncate the fasta headers after the first space (as long as they are unique) or convert the spaces to an "_" (I would go with the former). That should fix the IGV issues. You may need to realign the data.

ADD REPLYlink written 17 months ago by genomax49k

Okay, so within the assembly file I change spaces with "_" or troncate them, and then I realign using, right ? Let's try this, I'll tell you. Thanks for your quick answers!

ADD REPLYlink written 17 months ago by Roxane Boyer560

Go for the truncation (as long as ID's are unique). Those "chromosome" names would get ugly if you change them with an "_".

ADD REPLYlink modified 17 months ago • written 17 months ago by genomax49k

Sorry for the late reply, it was quite long to re run. Well, sorry as I wasn't totally sure that the ID's were uniq, I opt for the ugly version with the "_". But in fact, it didn't solved my issue, the problem remains the same.

head -2 '/media/loutre/SUZUKII/assembly_tries/falcon_aligned_reads.sam' 
@SQ SN:000000F_002170011:B~002650282:E~001494317:E~001208944:E_ctg_linear_20318800_109944874    LN:20279970
@SQ SN:000001F_000757637:B~002089426:B~001707116:B~002616577:E_ctg_linear_15392828_83151272 LN:15392955

head -1 '/media/loutre/SUZUKII/assembly_tries/pb_268_p_ctg_s.fasta' 
>000000F_002170011:B~002650282:E~001494317:E~001208944:E_ctg_linear_20318800_109944874

The message from IGV remains the same

ADD REPLYlink written 17 months ago by Roxane Boyer560

I actually suspect my raw reads file... Because when I try to type head -1 my_raw_read_file.fastq, everything start is shown on my screen...

An example of strange line :

,&**..+-/.+*+".%*"."-&-,+*-..$/,%+&-'#,//,-)%%%+.-++,*.,*+'.--./,-",*/%/+#..$/&/-/.*+-/+-&,.,-+--/+*(,.+,++)+-+#*'%.,&#-'-#-+*.%".&-.*%$*.,*(/,,-.,/*#-&-+%"'+*.,-'.+./"-.-+-)./**/.-)./*,//--//#-//$&$+%./+#+,.)*"$)..,"/-.,//)+,+&),,*-../)'('.&*.$.+"..+,&'(.',&#.,//.-',)/+,/,-%(./)-/(-$.%,..(*-$%*%%)+"#,.,--++*,#.)%+,./*-(+,"-/,,,&.,+.".+)*$"&,)+-(',#.&.(-/.+')/.+"/&.*(+$//)-,---/.-)+.,--,.++$./,(..-,/'.*/-."-.."//.,../.')/,%/''-//.".../-)/-//,.&%.,".-&*-,&)($+,*--+%.+.-.)./.(/,*/-'.'&..(&,),//.*.&-)/./,)&@m160112_141523_42237_c100966892550000001823200405121626_s1_p0/398/13732_27193TTTGACATAGGCTATCGGAAGGCTTCTGGGTAATATATCTTAGCACAGAACTCCTAACCTAATTCGATAGTGCTGCGTAATTGAGTAGATTGTCATATATCTGTTATGTGATCTTTCGCCTCCGTGGCACTCAAGCTTTGTGCGCCTATGAGGCATGCATTTTTTGCATGGTTTAATAGCCTTTAACCGCGCATATAAGGCTCCAAACGATTAAACGTCTTCAGGGATCCCACTTAAGATGTTAGGAATACAACTGAACCTGATTCTGAAATCAACTTATATTTTTCGCAGCGATAAAGGTATTTCTTTTATATAATACCATTTTTAATGTAAGTTAAGGAATTAAGTCGATTTTTATGATACCATTTAATATTGGTAAAAATGCGATTTGACAAAAAAAGCAACAGAAATCTGTTTGCTTAACGAAATAGTTATTAAAAAACAAATGTCAAA

And so, everything is on one line... Myabe the problem is here. Because I was first tickled when bwa mem said that he read only 1 read...

I'm going to try to fix this file now.

ADD REPLYlink written 17 months ago by Roxane Boyer560

Actually, I have absolutely no idea of how to seperate the different lines...

Should I use awk ? I like also perl scripts even if it's a bit old school...

ADD REPLYlink written 17 months ago by Roxane Boyer560

How did that happen? If the file was corrupted somewhere in the process you would be working with tarnished data.

Does the true "original" also look like this?

ADD REPLYlink written 17 months ago by genomax49k

Sadly, this is the original file...

A collaborator from Munich gave it to me, I downloaded it from a server, and I never modified it.

I even used this file to perform an assembly using canu, and Canu worked fine with it... This is very strange.

Should I rerun hdf5 tools on it ? Or should I manually clean the file ?

ADD REPLYlink written 17 months ago by Roxane Boyer560

What do you feel about this? What any of us think may be less important since you would be the one presenting/defending this data/analysis, which seems to have some flaw in the raw data.

ADD REPLYlink modified 17 months ago • written 17 months ago by genomax49k

Well, I'm still a student, that's why I'm not always sure about my opinions, and ask a lot of questions. But you guys were right, I don't know how and when, but it seems that the file was corrupted. The last modification date is the 10 January... I didn't even used this file at this moment. I really don't know how it could happen. Thankfully, I can download it again from a distant server. It will take time, but maybe one day I will be able to successfully polish my data ! Thanks again for all the answers and the support !

ADD REPLYlink modified 17 months ago • written 17 months ago by Roxane Boyer560

I wouldn't expect that Canu would work with a corrupted file like that and assume that the corruption happened later...

ADD REPLYlink written 17 months ago by WouterDeCoster29k
0
gravatar for Charles Warden
17 months ago by
Charles Warden5.0k
Duarte, CA
Charles Warden5.0k wrote:

BWA is OK to use with CCS reads (preferably with at least 5 or 10 cycles), but that .fastq file seems quite large (and, unfortunately, unless you get a more raw data format, you can't use PacBio functions to define CCS reads).

You could also use BLASR with a .fasta file, but the recommendation is to use an unaligned .bam file for aligning subreads with BLASR (which includes additional information about each read). Also, some downstream tools designed for Illumina data might have issues with some differences in the PacBio .bam file format used by BLASR.

There could also be issues with the chromosome name formatting, but it looks like you've got a lot of good comments addressing that issue.

ADD COMMENTlink written 17 months ago by Charles Warden5.0k

Hi Charles ! Thanks for your answer. Indeed, I already know Blasr, but I have a lot of trouble with it, the last version is not working anymore, because blasr do not support --bam or --sam option anymore.. That's why I was looking for an alternative ! Bwa mem seems to be the solution

ADD REPLYlink written 17 months ago by Roxane Boyer560

Last time I checked, the acceptable output formats depended upon the input format - for example, I remember the change to not allow direct .sam file creation if the input is a .bam file. However, I have a version of BLASR installed in this Docker image that will at least work with the following command:

blasr [unaligned.bam] [ref.fa] --bam --bestn 1 --nproc [2] --out [aligned.bam]
ADD REPLYlink written 17 months ago by Charles Warden5.0k

Well, me and some other people are encountering the same problem with blasr. Just look at this issue in github : https://github.com/PacificBiosciences/blasr/issues/324 It indeed seems that it accept only "valid" (whatever it means...) input files. Even if my files come directly from PacBio tools, I'm still struggling with blasr... Well, I think I'll just do it with bwa-mem then ! At least, it works

ADD REPLYlink written 17 months ago by Roxane Boyer560
0
gravatar for Brian Bushnell
17 months ago by
Walnut Creek, USA
Brian Bushnell15k wrote:

BBMap works well with raw PacBio data, depending on your needs. It does break reads longer than 6kbp into 6kbp segments, but it's capable of handling raw PacBio's high error rates with extremely high accuracy.

Usage:

mapPacBio.sh in=reads.fq out=mapped.sam ref=ref.fa maxlen=6000
ADD COMMENTlink written 17 months ago by Brian Bushnell15k
1

BBMap was covered in an earlier thread that Roxane Boyer had started : C: Alternative to BLASR ?

ADD REPLYlink modified 17 months ago • written 17 months ago by genomax49k

Ah! Thanks.

Roxane Boyer, did you encounter a problem with it?

ADD REPLYlink written 17 months ago by Brian Bushnell15k

I think she thought only first 6kb would be mapped and decided to do bwa mem first.

ADD REPLYlink written 17 months ago by genomax49k

Ah, that's indeed what I first thought. Maybe I'll try it later to compare my results with BWA-MEM !

ADD REPLYlink written 17 months ago by Roxane Boyer560

Hello all,

I know that this is an old thread but I have a large amount of raw PacBio reads that I wish to map to an existing mitochondrial sequence (only 90kbp) to extract unmapped reads to use in assembly.

Can I use the above script with bbmap? I can't find this usage in the manual. It seems to take only one input (reads.fq)?

many thanks for your help.

ADD REPLYlink written 4 months ago by peri.tobias0

You should be able to. Specify outu= to capture reads that do not map to that file.

ADD REPLYlink modified 4 months ago • written 4 months ago by genomax49k

Thanks, I saw this in the manual. What I am unclear about is the inputs. I will try this to see how it runs:

bbmap.sh ref=AP_MT.fa
bbmap/mapPacBio.sh in=/project/Assembly/*.fasta outu=unmapped.fasta outm=mapped.fasta ref=AP_MT.fa maxlen=6000
ADD REPLYlink written 4 months ago by peri.tobias0

The above script does not work. For mutiple fasta inputs you need to list them all. *.fasta is not recognised. The following seems to work.

bbmap.sh ref=AP_MT.fa
mapPacBio.sh in=reads1.fasta reads2.fasta reads3.fasta outu=unmapped.fasta outm=mapped.fasta maxlen=6000
ADD REPLYlink modified 4 months ago • written 4 months ago by peri.tobias0

another way to list all the fasta files is:

mapPacBio.sh in=`echo ./foo/*.fasta` outu=unmapped.fasta outm=mapped.fasta maxlen=6000
ADD REPLYlink written 4 months ago by peri.tobias0

Just wanted to add that bbmap only seems able to take one in=file. My above trials at listing files does not work. Perhaps you could concatenate but my fasta files are huge so not going to risk. Will run them individually and output multiple unmapped and mapped .fasta files.

ADD REPLYlink written 4 months ago by peri.tobias0

Good to know that mapPacBio takes in fasta files. Do you not have fastq format files for this data? It would be useful to have the Q-scores when BBMap does the alignment.

ADD REPLYlink written 4 months ago by genomax49k

Thanks very much for your comment, genomax. I initially used dextract to output fasta, quiver and arrow files from bam and hdf5. https://dazzlerblog.wordpress.com/tag/arrow/ Quiver files are fastq-like but not exactly fastq format - though a simple conversion is probably possible. I intend to polish the assembly using arrow files once complete so was not too concerned about using fasta inputs at this stage. Perhaps I should though?

ADD REPLYlink written 4 months ago by peri.tobias0

You could use bam2fastx utility from PacBio to get the fastq format files.

ADD REPLYlink written 4 months ago by genomax49k

thanks, appreciated!

ADD REPLYlink written 4 months ago by peri.tobias0
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1516 users visited in the last hour