FastQ Screen Parameters on CyVerse
1
0
Entering edit mode
3.7 years ago
phamh ▴ 30

Hi,

I want to run FastQ Screen on CyVerse and I don't quite understand the parameters of this app even after looking them up on google. The parameters include:

  1. 'Aligner' (BOWTIE2, BOWTIE, or BWA)
  2. 'Subset' (you're supposed to type in a number. Default = 100000)
  3. Check box 'Illumina v1.3 format (Older format)'

I don't get #2 and #3. Here is the link to the instructions provided by CyVerse on how to run FastQ Screen using sample data https://cyverse.atlassian.net/wiki/spaces/DEapps/pages/241881853/Fastq-screen-0.11.1 Their explanation on parameter #2 and #3 are not very clear to me. I could not find any other source that talks about FastQ Screen parameters without using command line.

Can someone please help me with this?

Thank you.

fastqscreen parameters • 1.1k views
ADD COMMENT
0
Entering edit mode
3.7 years ago

Hi,

subset: Don't use the whole sequence file, but create a temporary dataset of this specified number of reads. The dataset created will be of approximately (within a factor of 2) of this size. If the real dataset is smaller than twice the specified size then the whole dataset will be used. Subsets will be taken evenly from throughout the whole original dataset. By Default FastQ Screen runs with this parameter set to 100000. To process an entire dataset however, adjust --subset to 0.

illumina1_3 : Assume that the quality values are in encoded in Illumina v1.3 format. Defaults to Sanger format if this flag is not specified.

source

ADD COMMENT
0
Entering edit mode

Yes I did read this in CyVerse instructions, but I still don't get it. Do you mind explaining it in a different way?

ADD REPLY
0
Entering edit mode

which part sounds confusing? If you know what a FASTQ file is, #2 and #3 are pretty self-explanatory

ADD REPLY
0
Entering edit mode

1) According to my understanding, FastQ Screen aligns reads to reference genome so that we can check if there's any fragments coming from sources other than the genome of interest. So basically, it's alignment. Now if we run HISAT2, which is also alignment, there's no 'subset' parameter. So how come we see 'subset' in FastQ Screen?

2) How do I know what to set for 'subset' parameter? What do I base it on? Is there a reference point for this, like you should leave it as default if your library contains this many reads...

3) the check box option would depend on the instrument I used to sequence the reads right?

I'm still new to bioinformatics, so I apologize if my questions bother you.

ADD REPLY
1
Entering edit mode

Fastq screen is actually using bwa/bowtie to do the alignments under cover. To run this kind of analysis on your entire dataset would take a long while so sampling a certain number of reads (subset) allows you to speed things up. Since the reads (and thus contamination) should be randomly distributed in your dataset) this should capture/highlight it.

If you only have a small number of reads then you could omit subsetting.

Type of instrument does not have an impact on the quality encoding. If you have data that was produced in last 5+ years your data is Sanger encoded. Older data may be in other types of encoding. You can read more about the types of encodings used at one time or other in past (now all data is in sanger fastq format).

All that said, Fastq screen should only come into play if a large portion of your data is NOT aligning to the expected genome.

ADD REPLY
0
Entering edit mode

Thank you for your detailed response!

ADD REPLY

Login before adding your answer.

Traffic: 2471 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6