Question: convert paired fastq files to fasta with EMBOSS:seqret or FASTX-toolkit:fastq_to_fasta
0
gravatar for marongiu.luigi
12 months ago by
Germany, Mannheim, UMM
marongiu.luigi420 wrote:

Dear all,

I have converted a BAM file into paired fastq files with:

bamToFastq --bam {file}.bam -fq {file}_1.fq -fq2 {file}_2.fq

Now I need to obtain a fasta file from these. I can see that EMBOSS' seqret and FASTX-Toolkit's fastq_to_fasta can do that; the manual for the operation are given for instance here:

seqret -sequence reads.fastq -outseq reads.fasta
fastq_to_fasta ... -i {file}.fq -o {file}.fa

However, the syntax os for a single fastq file input. How do I apply them on paired fastq files?

ADD COMMENTlink modified 12 months ago by finswimmer12k • written 12 months ago by marongiu.luigi420
2
gravatar for genomax
12 months ago by
genomax74k
United States
genomax74k wrote:

Convert them independently/individually.

Otherwise use reformat.sh from BBMap suite.

reformat.sh in=file.fq out=file.fa

I have not tried it but reformat.sh in1=R1.fq in2=R2.fq out1=R1.fa out2=R2.fa may work.

ADD COMMENTlink modified 12 months ago • written 12 months ago by genomax74k

Won't a loose information by doing it independently?

ADD REPLYlink written 12 months ago by marongiu.luigi420

Won't a loose information by doing it independently?

You are only stripping away the quality scores not touching the sequence. If your files are not in sync then you would want to use repair.sh first.

ADD REPLYlink written 12 months ago by genomax74k
1
gravatar for finswimmer
12 months ago by
finswimmer12k
Germany
finswimmer12k wrote:

Hello marongiu.luigi ,

you can use samtools fasta to convert bam directly to fasta. To do this you need to sort the bam file by readname:

$ samtools sort -n input.bam > namesorted.bam
$ samtools fasta -1 1.fa -2 2.fa namesorted.bam

fin swimmer

ADD COMMENTlink written 12 months ago by finswimmer12k

I tried that but, in my hands, this gave me a pile of lines where simply the header @<something> was converted into ><something>; the result was, therefore, a multi-fasta. That is:

$ head 1.fa
>DHGDKXP1:247:C3MF6ACXX:5:1101:1166:95391
ATTGATAAAGCTGTTGCCGATACTTGGAACAATTTCTGGAAAGACGGTAAAGCTGATGGTATTGGCTCTAATTTGTCTAGGAAATAACCGTCAGGATTGAC
>DHGDKXP1:247:C3MF6ACXX:5:1101:1234:39592
GCCAGTTAAATAGCTTGCAAAATACGTGGCCTTATGGTTACAGTATGCCCATCGCAGTTCGCTACACGCAGGACGCTTTTTCACGTTCTGGTTGGTTGTGG

I need instead to merge all the fastq lines into a single stretch to generate a single-entry fasta file.

ADD REPLYlink written 12 months ago by marongiu.luigi420

I need instead to merge all the fastq lines into a single stretch to generate a single-entry fasta file.

That is not what you were asking in original question. Two programs you had originally mentioned only convert fastq to fasta format. Nothing else. This is a completely different need.

What does this mean? You want to merge individual R1/R2 reads and then make a longer representation in fasta format? Or you want to assemble all that sequence data?

ADD REPLYlink modified 12 months ago • written 12 months ago by genomax74k

Or maybe get the consensus sequence?

ADD REPLYlink written 12 months ago by finswimmer12k

That would be great but the methods I have used don't work. For instance I have used:

samtools mpileup -d8000 -f {ref.fa} {bam_output_srt} |  bcftools call -c \
| vcfutils.pl vcf2fq | seqtk seq -aQ64 [-q20] -n N > {outfile.fa}

and

samtools mpileup -AuDS -Q 10 -q 10 -d 10000 -f {ref} {bam_output_srt} \
| bcftools view -cg - | vcfutils.pl vcf2fq | awk \
'BEGIN{P=1}{if(P==1||P==2){gsub(/^[@]/,">");print}; if(P==4)P=0; P++}' \
| sed -n '/+/q;p' > {out_file.fa}

But the output is empty. Somethign worng with the syntax perhaps?

ADD REPLYlink written 12 months ago by marongiu.luigi420
1

Which version of samtools/bcftools are you using? The second one looks like 0.1.19 (which is outdated for 5 years now).

To get the consensus sequence you have to do variant calling first to incoorperate the variants into the reference sequence to get the consensus sequence.

Update to the current samtools/bcftools version (v1.9) first.

$ bcftools mpileup -f reference.fa input.bam|bcftools call -m -v -Ov -o variants.vcf
$ bcftools consensus -f reference.fa variants.vcf > consensus.fa

fin swimmer

ADD REPLYlink modified 12 months ago • written 12 months ago by finswimmer12k
$ samtools --version
samtools 1.9
Using htslib 1.9
Copyright (C) 2018 Genome Research Ltd.
$ bcftools --version
bcftools 1.7
Using htslib 1.7-2
Copyright (C) 2016 Genome Research Ltd.
License Expat: The MIT/Expat license
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.
ADD REPLYlink written 12 months ago by marongiu.luigi420

I updated bcftools

$ bcftools --version
bcftools 1.9
Using htslib 1.9
Copyright (C) 2018 Genome Research Ltd.
License Expat: The MIT/Expat license
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.

but I don't have a vcf for the reference file. When I launched the first command I got:

[E::hts_open_format] Failed to open file 504-147_variants.vcf
Failed to open 504-147_variants.vcf: No such file or directory
[mpileup] 1 samples in 1 input files

Since no vcf file was created in this step, I could not launch the second command.

ADD REPLYlink written 12 months ago by marongiu.luigi420

Oops, sorry. There was in -o missing before the output file name. Edited it.

ADD REPLYlink written 12 months ago by finswimmer12k

So, I could create the vcf file but then bcftools consensus wanted it bgzipped so I had to use bgzip 504-147_variants.vcf with bgzip installed with sudo apt install tabix. However, bcftools created a fasta of the reference, not the bam file:

$ bcftools consensus -f ~/refSeq/virChr.fa 504-147_variants.vcf.gz > 504-147.fa
$ head -n 3 504-147.fa
>V dna:chromosome chromosome:GRCh38:V:1:353664355:1 REF
TAGTATTACCCTCGGCACCTTGGAACCAGGAACCACCTTGGAACCCACCCGAGATGGCGG
AGGCTGAAGGCCCGGCGGGGGGAAAGACGCGGCGGCCCCGTGAAGCACCAGCGGCCCGCT

and this is not a multi-fasta with reference plus query since there is only one sequence:

$ grep '>' 504-147.fa
>V dna:chromosome chromosome:GRCh38:V:1:353664355:1 REF

Here, 504-147 is the subset of the original BAM file I am investigating and V dna is the custom genome I am using.

ADD REPLYlink modified 12 months ago • written 12 months ago by marongiu.luigi420

Hello marongiu.luigi ,

I would suggest we go back to start. Please describe what data you have and what you like to find out. Be careful: This is not the same as describing how you think you can solve the problem.

These things should always be in your initial post:

  • Describing the data you have (if possible with examples)
  • Describing the goal of the investigation of this data
  • What you have tried so far and where you have problems

Doing so it is much easier for people to help you, because now one can also tell you, if the way you are trying to go, is the right one.

fin swimmer

ADD REPLYlink written 12 months ago by finswimmer12k

I agree that this post has been split in two. The question I asked was How do I apply them [EMBOSS and/or FASTX] on paired fastq files [both at the same time]?. The answer to that is no. For this, I simply needed the syntax of the command, so I deemed not relevant to give actual data. The second half of the post, the part pointed out by you, that is get the consensus sequence is more relevant for this previous post of mine. Sorry for any inconveniences.

ADD REPLYlink written 12 months ago by marongiu.luigi420

Hello again marongiu.luigi ,

the answer on a "How do I?"-question can never be no. If you think the the get the consensus sequence isn't that relevant here, then what is your question? Or better, which task are you trying to solve?

fin swimmer

ADD REPLYlink written 12 months ago by finswimmer12k

My question was if I could use either of these two tools to process the paired files at once. As genomax pointed out, they Convert fastq to fasta format. Nothing more or less., which I understand meaning no, they can't be used to process both paired files at once. Hence the next question is: how do I process both paired files at once? whose answer is -- among others -- by doing a consensus fasta. But this is now another question that I don't think belongs here anymore.

ADD REPLYlink written 12 months ago by marongiu.luigi420

I thought I defined the task with the line

Now I need to obtain a fasta file from these.

I know that they convert a single fastq into a single fasta, that is why I was asking whether I can instead

apply them [these tools] on paired fastq files?

Sorry I wasn't clear: I have a pair of fastq file, how do I make a single fasta? I know there might be a gap between the two paired fastq files, this is something I can live with. If it not possible to merge paired fastq into a single fasta, that would be all right.

ADD REPLYlink modified 12 months ago • written 12 months ago by marongiu.luigi420

Actually, also seqret and fastq_to_fasta do the same thing. This is the output of the fasta file generated by fastq_to_fasta:

$ head -n 5 F2F.fa
>DHGDKXP1:247:C3MF6ACXX:5:1101:1166:95391/1
ATTGATAAAGCTGTTGCCGATACTTGGAACAATTTCTGGAAAGACGGTAAAGCTGATGGTATTGGCTCTAATTTGTCTAGGAAATAACCGTCAGGATTGAC
>DHGDKXP1:247:C3MF6ACXX:5:1101:1234:39592/1
GCCAGTTAAATAGCTTGCAAAATACGTGGCCTTATGGTTACAGTATGCCCATCGCAGTTCGCTACACGCAGGACGCTTTTTCACGTTCTGGTTGGTTGTGG
>DHGDKXP1:247:C3MF6ACXX:5:1101:1320:15306/1
GGTTAATCGTGCCAAGAAAAGCGGCATGGTCAATATAACCAGTAGTGTTAACAGTCGGGAGAGGAGTGGCATTAACACCATCCTTCATGAACTTAATCCAC

and this is that of seqret:

$ head -n 4 SQT.fa
>DHGDKXP1:247:C3MF6ACXX:5:1101:1166:95391/1
ATTGATAAAGCTGTTGCCGATACTTGGAACAATTTCTGGAAAGACGGTAAAGCTGATGGT
ATTGGCTCTAATTTGTCTAGGAAATAACCGTCAGGATTGAC
>DHGDKXP1:247:C3MF6ACXX:5:1101:1234:39592/1
GCCAGTTAAATAGCTTGCAAAATACGTGGCCTTATGGTTACAGTATGCCCATCGCAGTTC
GCTACACGCAGGACGCTTTTTCACGTTCTGGTTGGTTGTGG
ADD REPLYlink written 12 months ago by marongiu.luigi420

That is because those tools are designed to do just that. Convert fastq to fasta format. Nothing more or less.

ADD REPLYlink written 12 months ago by genomax74k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1777 users visited in the last hour