Question: ncbi sra toolkit, how to modify the fastq format? Need filter information
0
gravatar for claude.loverdo
2.4 years ago by
claude.loverdo0 wrote:

I want to use data from NCBI. The classic fatsq format is:

@SRR2192406.1 HWI-ST0860_91:8:1101:1925:1972 length=100
GTTTGNACCATCTTGACAGACTTCAAAAATTGGCTGGGGTCTAAATTGTTCCCCAAGCTGCCCGGCCTCCCCTTCATCTCTTGTCAAGATCGGAAGAGCA
+SRR2192406.1 HWI-ST0860_91:8:1101:1925:1972 length=100
CBCFF#2ACFHFHJJIJIJIJJJJJJIGIGHHIJJI??D?@FGIJJJJIIGFHGHIIIEIIIFHHFEEE2?>ABCACDDACCCAAC@>@AA8<22922?C
@SRR2192406.1 HWI-ST0860_91:8:1101:1925:1972 length=100
TGACAAGAGATGAAGGGGAGGCCGGGCAGCTTGGGGAACAATTTAGACCCCAGCCAATTTTTGAAGTCTGTCAAGATGGTGCAAACAGATCGGAAGAGCG
+SRR2192406.1 HWI-ST0860_91:8:1101:1925:1972 length=100
CCCFFFFFHGHHHGJIJJIJJIJEHHIIJJJGGHIJGIJJIJIJJJHHHGEDEFDEEEDCCDDDDCDDECDDDDDDDDDCCDACDDDDDDDDDBDBDDDB

which I can either get downloading directly in fastq format from NCBI, or using the sra toolkit, with the

fastq-dump --split-spot -X 5 -Z SRR2192406 > test.fastq

(the -X 5 is here only to avoid downloading all while I'm trying to figure out how to get the format I want).

The issue is that I would like another information. In the illumina fastq format, the first line is (below the wikipedia example). @EAS139:136:FC706VJ:2:2104:15343:197393 1:Y:18:ATCACG The Y is if the read is filtered, N otherwise . For what I understand, it has to do with the proximity of the spots (if some spots are too close together, they are scored N). I need to filter them out, because for my application I really want to decrease all noise sources.

I am still quite new to the NCBI archive, but I hope the original format use for downloading in this archive retains all the information.

My question, it is how to get this information? Is there any way to ask for a different fastq format? Or should I use a different route to access the NCBI data? Thanks!

sequencing next-gen • 1.3k views
ADD COMMENTlink modified 2.4 years ago • written 2.4 years ago by claude.loverdo0
2

You are looking for the second word of the description line as found in FASTQ files from Illumina instruments. I have never seen this in data downloaded from the short read archive, neither in SRA files from NCBI nor on FASTQ files from EBI. I guess that it is removed during submission. See also Should I use reads with good quality but failed-vendor flag?

ADD REPLYlink modified 2.4 years ago • written 2.4 years ago by piet1.6k

Thanks for your replies! The submitters didn't use the fastq format. So I found that : illumina-dump -X 5 -x -Z SRR2192406 give me the data I want. The format is a bit different and I still have to figure out how I'll treat it, but at least I have the information I need.

ADD REPLYlink written 2.4 years ago by claude.loverdo0

Actually this is getting closer to what I want: fastq-dump -R pass --split-spot -X 5 -Z SRR2192406 -> it writes only the spots that pass the filter.

ADD REPLYlink written 2.4 years ago by claude.loverdo0
1
gravatar for genomax
2.4 years ago by
genomax58k
United States
genomax58k wrote:

Use the -F or --origfmt option with fastq-dump to get the original illumina fastq headers.

Generally Illumina software failed (N) reads should not be in the data people use/submit to SRA and you can verify that once you dump them in the original format.

If the original submitters had changed the header (from the standard illumina one) then that is what you will get back with the -F option. SRA can only give you what they received from the submitters.

ADD COMMENTlink modified 2.4 years ago • written 2.4 years ago by genomax58k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1926 users visited in the last hour