Question: fastq file format error
0
gravatar for hafiz.talhamalik
7 days ago by
Pakistan
hafiz.talhamalik60 wrote:

my fastq file looks like this. What's wrong with this ? I mean usually fastq file starts with@ sign and after sequence we have + sign and then quality score. if someone could help me which format is this and how can I convert this to normal fastq file ?

A00183:232:H5Y5JDSXX:3:1101:25437:1016 1:N:0:CGGATTGC+GAGTTAGC  +       ENST00000482771.1       262  GGGAAAAGCAGCCACCACATGATGCGGGAGAACCCAGAGCTGGTGGAGGGCCGTGACCTGCTGAGCTGCACCAGCTCTGAGCCTCTGACCCTCTGAGAGATGATGTCCTGCCCAGGCCCGATGGCCACTAGGACCCTGCAAGCAACTCTG  FFFFFFFFFFFFFFFFF:FFFFFF:FFFFFFFFFFF:FFFFFFFFFFFFFFFFFFFFFFFFF,F:FFFFFFFFFFFFF,FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF  3
A00183:232:H5Y5JDSXX:3:1101:4444:1016 1:N:0:CGGATTGC+GAGTTAGC   +       ENST00000354694.11      692     GGGCAGCCCATCGTGTGGATCACTCCCTATGCCTTCTCCCATGACCACCCGACAGACGTGGACTACAGGGTCATGGCCACCTTCACCGAGTTCTACACCACGCTGCTGGGCTTTGTCAACTTCCGCCTTTACCAGTTGCTCAACCTCCAC  FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF:FFFF,FFF:FFFFFFFFFFFFFFFFFFFFFF:FFFFFFFFFFFFFFFFFFFFFFFFFFF:FFFFFFFFFFFF,FFFFFFFFFFFFF:FFFFFFFFF:FF:FFFFFFFFFFFFF  4
fastq illumina • 134 views
ADD COMMENTlink modified 7 days ago by Pierre Lindenbaum120k • written 7 days ago by hafiz.talhamalik60
2

What's wrong with this ?

answer: it's not fastq

ADD REPLYlink written 7 days ago by Pierre Lindenbaum120k

yeah I know. actually my file extension says it's a fastq file that's why I asked. Do you know which format is this ? any idea ?

ADD REPLYlink written 7 days ago by hafiz.talhamalik60

File extensions don't carry any actual meaning, particularly in the Unix environment. A FASTQ file could just as easily be called a .txt, a .py or any other, or indeed no extension at all. Extensions exist only by convention.

The content of the file is what determines what type it is.

The real question isn't really what's wrong with your file, its "How has it ended up like this?". Where has the file come from? Has any upstream processing happened to it that you know of? It certainly looks like it could have been a FASTQ file, once upon a time...

ADD REPLYlink modified 7 days ago • written 7 days ago by jrj.healey12k
1

Actually i dont know "How has it ended up like this?" a friend of mine processed that file few months ago and now he don't remember what he did and how this happened. Its appears to me a somewhat like .sam file without header. and most probably he renamed it or wrote a false name to that and fastq file got overwritten.

ADD REPLYlink written 7 days ago by hafiz.talhamalik60

That is bad in terms of reproducibility. Always save code and log files of a job. If you have the original data better re-run the entire analysis.

ADD REPLYlink written 7 days ago by ATpoint16k

this is the real problem. he used the original file without making copy of it.

ADD REPLYlink written 7 days ago by hafiz.talhamalik60
1

If it is a SAM file, its probably best just to convert it back to FASTQ and start over. That SAM is not going to be much use to you if you don't know what reference its built around etc anyway.

ADD REPLYlink written 7 days ago by jrj.healey12k

Looks like part of SAM file. You have alignment information in there.

ADD REPLYlink modified 7 days ago • written 7 days ago by ATpoint16k
2

input:

$ cat test.fastq 
A00183:232:H5Y5JDSXX:3:1101:25437:1016 1:N:0:CGGATTGC+GAGTTAGC  +       ENST00000482771.1       262  GGGAAAAGCAGCCACCACATGATGCGGGAGAACCCAGAGCTGGTGGAGGGCCGTGACCTGCTGAGCTGCACCAGCTCTGAGCCTCTGACCCTCTGAGAGATGATGTCCTGCCCAGGCCCGATGGCCACTAGGACCCTGCAAGCAACTCTG  FFFFFFFFFFFFFFFFF:FFFFFF:FFFFFFFFFFF:FFFFFFFFFFFFFFFFFFFFFFFFF,F:FFFFFFFFFFFFF,FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF  3
A00183:232:H5Y5JDSXX:3:1101:4444:1016 1:N:0:CGGATTGC+GAGTTAGC   +       ENST00000354694.11      692     GGGCAGCCCATCGTGTGGATCACTCCCTATGCCTTCTCCCATGACCACCCGACAGACGTGGACTACAGGGTCATGGCCACCTTCACCGAGTTCTACACCACGCTGCTGGGCTTTGTCAACTTCCGCCTTTACCAGTTGCTCAACCTCCAC  FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF:FFFF,FFF:FFFFFFFFFFFFFFFFFFFFFF:FFFFFFFFFFFFFFFFFFFFFFFFFFF:FFFFFFFFFFFF,FFFFFFFFFFFFF:FFFFFFFFF:FF:FFFFFFFFFFFFF 4

output:

$ sed 's/\s\+/\t/2g' test.fastq | awk -F "\t" -v  OFS="\n" '{print "@"$1,$5,$2,$6}'               
@A00183:232:H5Y5JDSXX:3:1101:25437:1016 1:N:0:CGGATTGC+GAGTTAGC
GGGAAAAGCAGCCACCACATGATGCGGGAGAACCCAGAGCTGGTGGAGGGCCGTGACCTGCTGAGCTGCACCAGCTCTGAGCCTCTGACCCTCTGAGAGATGATGTCCTGCCCAGGCCCGATGGCCACTAGGACCCTGCAAGCAACTCTG
+
FFFFFFFFFFFFFFFFF:FFFFFF:FFFFFFFFFFF:FFFFFFFFFFFFFFFFFFFFFFFFF,F:FFFFFFFFFFFFF,FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF
@A00183:232:H5Y5JDSXX:3:1101:4444:1016 1:N:0:CGGATTGC+GAGTTAGC
GGGCAGCCCATCGTGTGGATCACTCCCTATGCCTTCTCCCATGACCACCCGACAGACGTGGACTACAGGGTCATGGCCACCTTCACCGAGTTCTACACCACGCTGCTGGGCTTTGTCAACTTCCGCCTTTACCAGTTGCTCAACCTCCAC
+
FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF:FFFF,FFF:FFFFFFFFFFFFFFFFFFFFFF:FFFFFFFFFFFFFFFFFFFFFFFFFFF:FFFFFFFFFFFF,FFFFFFFFFFFFF:FFFFFFFFF:FF:FFFFFFFFFFFFF

output stats:

$ sed 's/\s\+/\t/2g' test.fastq | awk -F "\t" -v  OFS="\n" '{print "@"$1,$5,$2,$6}' | seqkit stats
file  format  type  num_seqs  sum_len  min_len  avg_len  max_len
-     FASTQ   DNA          2      300      150      150      150
ADD REPLYlink modified 7 days ago • written 7 days ago by cpad011211k
1

If you have the newer versions of samtools, just use samtools fastq infile > outfile.fastq

You may need to convert the sam to a bam first.

ADD REPLYlink written 7 days ago by jrj.healey12k

cpad0112 That is more elegant of course. Good job

ADD REPLYlink modified 7 days ago • written 7 days ago by 2nelly150
1

You can restore your fastq file using the code below. I split the commands in a step by step pipe so you can easily see what every command is doing. I am not sure if the FFFFFFFs you get are base quality from sequencer or alignment quality. But in any case this will not affect the realignment.

sed "s/^/@/g" file | sed "s/ /_/" | sed "s/ \+/\t/g" | cut -f 1,2,5,6 | sed "s/\t/\n/g" > output

sed "s/^/@/g" adds @ at the beggining of every line (the Seq ID always starts with @)

sed "s/ /_/" replaces first space with _, because I see that 1:N:0****** is part of the name of your reads

sed "s/ \+/\t/g" replaces spaces and more than one spaces with tab , so then using cut -f 1,2,5,6 you can extract the necessary fastq columns

sed "s/\t/\n/g" finally split every tab in a new line.

Your fastq file looks ready.

ADD REPLYlink modified 7 days ago • written 7 days ago by 2nelly150
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1315 users visited in the last hour