Question: Is there a way to convert a FASTQ file to FASTA file?
0
gravatar for as2779
6 weeks ago by
as27790
as27790 wrote:

Hello.

I am trying to use RepeatModeler to identify transposable elements in a genome of C. Remanei. I have a FASTQ file that came from a genome analysis. I'm trying to convert the FASTQ file to a FASTA file with the following format:

> name
ACGCTGCGT..... (sequence)

When I looked around on this site, I saw commands that converts FASTQ to FASTA. However, I used two of such commands and got the same output. For example, the first few lines of my input is:

@NB551191:275:HMT7LBGX7:1:11101:1614:1054 1:N:0:ATCACG
TAAATNAGATCATTTTTGTAGAGAAAAANGANGGCTTNCGAATGGTATGAAAATCTCTGTGATCCGTCAAAAACTGACTGAGTTCTGATAAAAAATGTATTGGCAGAAAATACCACTTGGACCAAATCTCAAAAATTGACGGAAATGTCAC
+
AAAAA#EEEEEEEEEEEEEEEEEAEEEE#EE#EEEEE#EEEEEEEEEEEEEEEEEEEEEEEEEEAEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEAEEEEEEEEE/AAAEEEEEAEEEAA<EEAEEEEAEEEEEEEAAEEEE

@NB551191:275:HMT7LBGX7:1:11101:18472:1054 1:N:0:ATCACG
TTTCCNGAAAACGCATCCAGCATTGTTTNACNTCATTNGAGAGCTGAAAATTTTCAAACCTGTATTTTCCAATCGCATAATAACTCGTGTCTCCTTCTCCATAATCCGTGGGAAGCTTTCAACTCAATAAATTTTAGGAAAAAAGTTTATT
+
AAAAA#EEEE6EEE/AEEEEEEEEEEEE#AE#EEEEE#EEEEEEEEEEEE/EEEAEE/EEEEEEEEEEEAE/EEEAAAEAEEEEEEEE/EEEEEEEEEAEE/EE<<EEEAEEEAEE<<<EA/EEAA</AEEEAAEAEEEA/EEEA/EAEAA

> (ad infinitum)

And when I use the command to convert to FASTA, I get this output:

>NB551191:275:HMT7LBGX7:1:11101:1614:1054 1:N:0:ATCACG
TAAATNAGATCATTTTTGTAGAGAAAAANGANGGCTTNCGAATGGTATGAAAATCTCTGTGATCCGTCAAAAACTGACTGAGTTCTGATAAAAAATGTATTGGCAGAAAATACCACTTGGACCAAATCTCAAAAATTGACGGAAATGTCAC
>NB551191:275:HMT7LBGX7:1:11101:18472:1054 1:N:0:ATCACG
TTTCCNGAAAACGCATCCAGCATTGTTTNACNTCATTNGAGAGCTGAAAATTTTCAAACCTGTATTTTCCAATCGCATAATAACTCGTGTCTCCTTCTCCATAATCCGTGGGAAGCTTTCAACTCAATAAATTTTAGGAAAAAAGTTTATT
> (ad infinitum)

This is not the format I want; I want a FASTA file that only contains 1 description and the rest of the file be the sequence. From a FASTQ file, is it possible to obtain this, and if so how do I do so? If not, how should I run the data through RepeatModeler? Thank you for your help!

ADD COMMENTlink modified 6 weeks ago by swbarnes27.2k • written 6 weeks ago by as27790
1

Unfortunately what you want to do is not correct, the FastQ files represent some data sequencing of your genome, that means that the genome was fragmented in such small sequences. I guess what you want is to first assemble your reads into contigs and use those to predict/detect repetitive elements.

ADD REPLYlink written 6 weeks ago by JC9.3k

Ah that makes sense. Is there some sort of tool to assembly contigs from FASTQ files? I'm also moderately proficient in Python and Java if there's some simple lines of code that I can write to do this.

ADD REPLYlink written 6 weeks ago by as27790

No. Do not reinvent the wheel. Google for programs that will do what you want. You can likely find answers on biostars that are relevant.

ADD REPLYlink written 6 weeks ago by swbarnes27.2k

Are you sure this is what you need to do?

If so, search is your friend. See, for example: HOw to merge multifasta sequence into a single sequence having only one header?.

ADD REPLYlink modified 6 weeks ago • written 6 weeks ago by Brice Sarver3.3k

I want a FASTA file that only contains 1 description and the rest of the file be the sequence.

You could simply drop all lines that start with > by piping your file through grep -v "^>" and then append a header you want at top.

But these would still be individual reads and not represent what the sequence of the genome is. Which I assume you ultimately want to use with repeatmodeler?

ADD REPLYlink modified 6 weeks ago • written 6 weeks ago by genomax76k

Yes you are correct. I want the sequence of the whole genome (as a FASTA file) given a FASTQ file. How can I obtain this?

ADD REPLYlink written 6 weeks ago by as27790
0
gravatar for swbarnes2
6 weeks ago by
swbarnes27.2k
United States
swbarnes27.2k wrote:

The procedure you were given does convert a fastq to fasta. So that's not actually what you want to do.

The reads in a fastq are almost certainly unplaced. You can't just string them together in order and get a sequence that makes sense. I think you want a consensus sequence, so you need to look up how to do that. You can either do de novo assembly, or align to a reference, and make a consensus sequence taking in account the points where your reads differ from the consensus.

ADD COMMENTlink modified 6 weeks ago • written 6 weeks ago by swbarnes27.2k

Hi, thank you for the response! I'm still new to the field of computational biology. How do I make a consensus sequence? By doing so, I will be able to create a FASTA file?

ADD REPLYlink written 6 weeks ago by as27790
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1559 users visited in the last hour