Question: fastq file format error
0
gravatar for hafiz.talhamalik
12 months ago by
Pakistan
hafiz.talhamalik230 wrote:

my fastq file looks like this. What's wrong with this ? I mean usually fastq file starts with@ sign and after sequence we have + sign and then quality score. if someone could help me which format is this and how can I convert this to normal fastq file ?

A00183:232:H5Y5JDSXX:3:1101:25437:1016 1:N:0:CGGATTGC+GAGTTAGC  +       ENST00000482771.1       262  GGGAAAAGCAGCCACCACATGATGCGGGAGAACCCAGAGCTGGTGGAGGGCCGTGACCTGCTGAGCTGCACCAGCTCTGAGCCTCTGACCCTCTGAGAGATGATGTCCTGCCCAGGCCCGATGGCCACTAGGACCCTGCAAGCAACTCTG  FFFFFFFFFFFFFFFFF:FFFFFF:FFFFFFFFFFF:FFFFFFFFFFFFFFFFFFFFFFFFF,F:FFFFFFFFFFFFF,FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF  3
A00183:232:H5Y5JDSXX:3:1101:4444:1016 1:N:0:CGGATTGC+GAGTTAGC   +       ENST00000354694.11      692     GGGCAGCCCATCGTGTGGATCACTCCCTATGCCTTCTCCCATGACCACCCGACAGACGTGGACTACAGGGTCATGGCCACCTTCACCGAGTTCTACACCACGCTGCTGGGCTTTGTCAACTTCCGCCTTTACCAGTTGCTCAACCTCCAC  FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF:FFFF,FFF:FFFFFFFFFFFFFFFFFFFFFF:FFFFFFFFFFFFFFFFFFFFFFFFFFF:FFFFFFFFFFFF,FFFFFFFFFFFFF:FFFFFFFFF:FF:FFFFFFFFFFFFF  4
fastq illumina • 508 views
ADD COMMENTlink modified 12 months ago by Pierre Lindenbaum128k • written 12 months ago by hafiz.talhamalik230
2

What's wrong with this ?

answer: it's not fastq

ADD REPLYlink written 12 months ago by Pierre Lindenbaum128k

yeah I know. actually my file extension says it's a fastq file that's why I asked. Do you know which format is this ? any idea ?

ADD REPLYlink written 12 months ago by hafiz.talhamalik230

File extensions don't carry any actual meaning, particularly in the Unix environment. A FASTQ file could just as easily be called a .txt, a .py or any other, or indeed no extension at all. Extensions exist only by convention.

The content of the file is what determines what type it is.

The real question isn't really what's wrong with your file, its "How has it ended up like this?". Where has the file come from? Has any upstream processing happened to it that you know of? It certainly looks like it could have been a FASTQ file, once upon a time...

ADD REPLYlink modified 12 months ago • written 12 months ago by Joe16k
1

Actually i dont know "How has it ended up like this?" a friend of mine processed that file few months ago and now he don't remember what he did and how this happened. Its appears to me a somewhat like .sam file without header. and most probably he renamed it or wrote a false name to that and fastq file got overwritten.

ADD REPLYlink written 12 months ago by hafiz.talhamalik230

That is bad in terms of reproducibility. Always save code and log files of a job. If you have the original data better re-run the entire analysis.

ADD REPLYlink written 12 months ago by ATpoint34k

this is the real problem. he used the original file without making copy of it.

ADD REPLYlink written 12 months ago by hafiz.talhamalik230
1

If it is a SAM file, its probably best just to convert it back to FASTQ and start over. That SAM is not going to be much use to you if you don't know what reference its built around etc anyway.

ADD REPLYlink written 12 months ago by Joe16k

Looks like part of SAM file. You have alignment information in there.

ADD REPLYlink modified 12 months ago • written 12 months ago by ATpoint34k
2

input:

$ cat test.fastq 
A00183:232:H5Y5JDSXX:3:1101:25437:1016 1:N:0:CGGATTGC+GAGTTAGC  +       ENST00000482771.1       262  GGGAAAAGCAGCCACCACATGATGCGGGAGAACCCAGAGCTGGTGGAGGGCCGTGACCTGCTGAGCTGCACCAGCTCTGAGCCTCTGACCCTCTGAGAGATGATGTCCTGCCCAGGCCCGATGGCCACTAGGACCCTGCAAGCAACTCTG  FFFFFFFFFFFFFFFFF:FFFFFF:FFFFFFFFFFF:FFFFFFFFFFFFFFFFFFFFFFFFF,F:FFFFFFFFFFFFF,FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF  3
A00183:232:H5Y5JDSXX:3:1101:4444:1016 1:N:0:CGGATTGC+GAGTTAGC   +       ENST00000354694.11      692     GGGCAGCCCATCGTGTGGATCACTCCCTATGCCTTCTCCCATGACCACCCGACAGACGTGGACTACAGGGTCATGGCCACCTTCACCGAGTTCTACACCACGCTGCTGGGCTTTGTCAACTTCCGCCTTTACCAGTTGCTCAACCTCCAC  FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF:FFFF,FFF:FFFFFFFFFFFFFFFFFFFFFF:FFFFFFFFFFFFFFFFFFFFFFFFFFF:FFFFFFFFFFFF,FFFFFFFFFFFFF:FFFFFFFFF:FF:FFFFFFFFFFFFF 4

output:

$ sed 's/\s\+/\t/2g' test.fastq | awk -F "\t" -v  OFS="\n" '{print "@"$1,$5,$2,$6}'               
@A00183:232:H5Y5JDSXX:3:1101:25437:1016 1:N:0:CGGATTGC+GAGTTAGC
GGGAAAAGCAGCCACCACATGATGCGGGAGAACCCAGAGCTGGTGGAGGGCCGTGACCTGCTGAGCTGCACCAGCTCTGAGCCTCTGACCCTCTGAGAGATGATGTCCTGCCCAGGCCCGATGGCCACTAGGACCCTGCAAGCAACTCTG
+
FFFFFFFFFFFFFFFFF:FFFFFF:FFFFFFFFFFF:FFFFFFFFFFFFFFFFFFFFFFFFF,F:FFFFFFFFFFFFF,FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF
@A00183:232:H5Y5JDSXX:3:1101:4444:1016 1:N:0:CGGATTGC+GAGTTAGC
GGGCAGCCCATCGTGTGGATCACTCCCTATGCCTTCTCCCATGACCACCCGACAGACGTGGACTACAGGGTCATGGCCACCTTCACCGAGTTCTACACCACGCTGCTGGGCTTTGTCAACTTCCGCCTTTACCAGTTGCTCAACCTCCAC
+
FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF:FFFF,FFF:FFFFFFFFFFFFFFFFFFFFFF:FFFFFFFFFFFFFFFFFFFFFFFFFFF:FFFFFFFFFFFF,FFFFFFFFFFFFF:FFFFFFFFF:FF:FFFFFFFFFFFFF

output stats:

$ sed 's/\s\+/\t/2g' test.fastq | awk -F "\t" -v  OFS="\n" '{print "@"$1,$5,$2,$6}' | seqkit stats
file  format  type  num_seqs  sum_len  min_len  avg_len  max_len
-     FASTQ   DNA          2      300      150      150      150
ADD REPLYlink modified 12 months ago • written 12 months ago by cpad011213k
1

If you have the newer versions of samtools, just use samtools fastq infile > outfile.fastq

You may need to convert the sam to a bam first.

ADD REPLYlink written 12 months ago by Joe16k

cpad0112 That is more elegant of course. Good job

ADD REPLYlink modified 12 months ago • written 12 months ago by 2nelly180
1

You can restore your fastq file using the code below. I split the commands in a step by step pipe so you can easily see what every command is doing. I am not sure if the FFFFFFFs you get are base quality from sequencer or alignment quality. But in any case this will not affect the realignment.

sed "s/^/@/g" file | sed "s/ /_/" | sed "s/ \+/\t/g" | cut -f 1,2,5,6 | sed "s/\t/\n/g" > output

sed "s/^/@/g" adds @ at the beggining of every line (the Seq ID always starts with @)

sed "s/ /_/" replaces first space with _, because I see that 1:N:0****** is part of the name of your reads

sed "s/ \+/\t/g" replaces spaces and more than one spaces with tab , so then using cut -f 1,2,5,6 you can extract the necessary fastq columns

sed "s/\t/\n/g" finally split every tab in a new line.

Your fastq file looks ready.

ADD REPLYlink modified 12 months ago • written 12 months ago by 2nelly180
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1193 users visited in the last hour