I want to use data from NCBI. The classic fatsq format is:
@SRR2192406.1 HWI-ST0860_91:8:1101:1925:1972 length=100 GTTTGNACCATCTTGACAGACTTCAAAAATTGGCTGGGGTCTAAATTGTTCCCCAAGCTGCCCGGCCTCCCCTTCATCTCTTGTCAAGATCGGAAGAGCA +SRR2192406.1 HWI-ST0860_91:8:1101:1925:1972 length=100 CBCFF#2ACFHFHJJIJIJIJJJJJJIGIGHHIJJI??D?@FGIJJJJIIGFHGHIIIEIIIFHHFEEE2?>ABCACDDACCCAAC@>@AA8<22922?C @SRR2192406.1 HWI-ST0860_91:8:1101:1925:1972 length=100 TGACAAGAGATGAAGGGGAGGCCGGGCAGCTTGGGGAACAATTTAGACCCCAGCCAATTTTTGAAGTCTGTCAAGATGGTGCAAACAGATCGGAAGAGCG +SRR2192406.1 HWI-ST0860_91:8:1101:1925:1972 length=100 CCCFFFFFHGHHHGJIJJIJJIJEHHIIJJJGGHIJGIJJIJIJJJHHHGEDEFDEEEDCCDDDDCDDECDDDDDDDDDCCDACDDDDDDDDDBDBDDDB
which I can either get downloading directly in fastq format from NCBI, or using the sra toolkit, with the
fastq-dump --split-spot -X 5 -Z SRR2192406 > test.fastq
(the -X 5 is here only to avoid downloading all while I'm trying to figure out how to get the format I want).
The issue is that I would like another information. In the illumina fastq format, the first line is (below the wikipedia example). @EAS139:136:FC706VJ:2:2104:15343:197393 1:Y:18:ATCACG The Y is if the read is filtered, N otherwise . For what I understand, it has to do with the proximity of the spots (if some spots are too close together, they are scored N). I need to filter them out, because for my application I really want to decrease all noise sources.
I am still quite new to the NCBI archive, but I hope the original format use for downloading in this archive retains all the information.
My question, it is how to get this information? Is there any way to ask for a different fastq format? Or should I use a different route to access the NCBI data? Thanks!