How to convert a FASTQ file to FASTA file without the number of reads
2
0
Entering edit mode
3 months ago

Hello. I have converted my FASTQ file to FASTA file by using the tool seqtk for the conversion.However, the FASTA file which i have received as an output is also showing the number of reads in it. I wanted a continuous FASTA file with no reads in it. I am attaching the picture of the FASTA format with my query. How could I achieve a FASTA file with no number of reads in it?

FASTA FASTQ Seqtk • 1.1k views
0
Entering edit mode

0
Entering edit mode

No, I just want a clean Fasta file with no number of reads as we get from ncbi

1
Entering edit mode
3 months ago
Mensur Dlakic ★ 21k

The reads in a FASTQ file are distinct sequencing events that don't have any connection to each other the way they are ordered. To make a single sequence (or at least several long sequences) from it, one has to assemble them first. So to get a "clean" FASTA file as you call it is not a simple matter of converting from FASTQ and removing all headers. The reads need to be assembled, and that is usually done from FASTQ files, as they contain base quality information which FASTA doesn't. Eventually, the assembly will result in a contiguous FASTA file such as those "we get from NCBI."

0
Entering edit mode

How do I assemble the reads then? Is there any way to do it?

1
Entering edit mode

Your original question betrays your level of understanding of this topic, so I am not inclined to spend a lot of time explaining it. On balance, it is not likely to be productive for either one of us. Let's just say that assembly is a straightforward but not a trivial process, so I think you need to read up on that topic before attempting to do it.

I am assuming that you are doing de novo assembly, so here is a short list of assemblers:

https://en.wikipedia.org/wiki/De_novo_sequence_assemblers

0
Entering edit mode

0
Entering edit mode

So you do want to assemble your reads? How you tried googling?

1
Entering edit mode
3 months ago
tomas4482 ▴ 360

If you only wish to remove that number, you can use following command:

sed -r -e '1~2s/[ ][0-9]{1,}//g' test.txt

where 1~2 means searching text by skipping every 2nd line. [ ] means the space. [0-9]{1,} means matching number 0 to 9 for more than 1 time.

Test file:

>chr1 1:
aaaaa
>chr1_part1 20:
aaaaa
>chr1_part2 31:
aaaaa
>chr1_part3 3:
aaaaa
>chr1_part4 32:
aaaaa
>chr1_part5 3:
aaaaa
>chr1_part6 3:
aaaaa
>chr1_part7 3:
aaaaa
>chr1_part8 3:
aaaaa


Result:

>chr1:
aaaaa
>chr1_part1:
aaaaa
>chr1_part2:
aaaaa
>chr1_part3:
aaaaa
>chr1_part4:
aaaaa
>chr1_part5:
aaaaa
>chr1_part6:
aaaaa
>chr1_part7:
aaaaa
>chr1_part8:
aaaaa


But like Mensur said, maybe you should try de novo genome assembly using trinity I guess. If there is a reference, you can use hisat or STAR to align reads rather than converting fastq to fasta.

0
Entering edit mode

My sequence is newly synthesized.where could i get the reference genome for it?

0
Entering edit mode

As I said, de novo assembly does not require a reference. A commonly used software trinity can do this job. I can't help much only if I could know what is your purpose and how do you get your fastq files.

0
Entering edit mode

Actually, I have been given a new fasta sequence of chromosome 1 of a cow,but i was not provided with its fastq files as these fastq files were destroyed during the formatting of server. So, I built the fastq format using quality score 30 by using the tool from BBmap suite. As a result, I carried out FASTqc on the file and found that there are many over-representative sequences in it (N's). So I trimmed the reads having the N, by using trimmomatic tool (SLIDING WINDOW) option. This gave me a new improved fastq file with no N,s in it. I wanted to convert this altered fastq file to its fasta format (with no reads, just like ncbi)for the sake of learning. Now, i do not know what to do to achieve this, should i assemble the reads in fastq file using trinity now or the fasta format i made using seqtk is enough?

1
Entering edit mode

Let me get the thing straight first because I'm a little bit confused.

1. You already have the fasta sequence of chromosome 1 of cow at the very first beginning? How do you get that?

2. If you already have that fasta, why bothering for re-assemble the cow genome. Besides, cow is a general species. Ensembl or some other databases must have the reference. You can directly download the genome fasta and its annotation file for learning. For example here

3. It seems you have some corrupted fastq files. Did you rebuild the fastq using the corrupted fastq? The quality control steps are fine for alignment. But if the altered fastq are created by corrupted fastq, it is discouraged to proceed genome assembly. You may find very low alignment rate referring to reference. If you use these corrupted data for de novo assembly, the assembly may not be correct.

0
Entering edit mode