Question

Clarification about HiSeq files SE/PE and 8kb/3kb/

0

Entering edit mode

7.2 years ago

celine.petitjean ▴ 30

Hi all,

I am new on this forum, and on genome sequencing/assemblies/ first analyses of a genome so I will ask for your indulgence! I have received a new eukaryote genome, and I have some difficulties to understand to what correspond each file I have!

Particularly it seems that I have one range of files corresponding to a HiSeq sequencing: For which I have a forward and a reverse file, and for each of them I have a PE and a SE file. These files are under a folder named "cleaned", when I also have a folder named raw, containing only two files, a forward and a reverse. I would guess it is the "Pair end" and "single end", but in this case, I don't understand why I have both reverse and forward for the Single end sequencing? I might completely misunderstood what are my data then...

I also have the same type of files for what seems to correspond to a MiSeq sequencing.

Besides, I also have two folders containing LJD, for Long Jumping Distance files (processed and processed_cleaned) And I have 2 ranges of files in them: some "3kb" and "8kb" and for each of them, a forward, a reverse and a singleton. I am not sure to understand if LJD is another way of HiSeq sequencing providing longer reads or type of assembly. And in any case, I don't understand what is the difference between 3kb and 8kb. Are these different files supposed to be used together or are they different treatment of the same obtained sequences

I am completely aware that my questions maybe very naive, so I will be grateful for any help! Thank you in advance!!

next-gen Assembly sequencing • 1.7k views

ADD COMMENT • link updated 7.2 years ago by h.mon 35k • written 7.2 years ago by celine.petitjean ▴ 30

score 2 · Answer 1 · 2017-02-14

Particularly it seems that I have one range of files corresponding to a HiSeq sequencing: For which I have a forward and a reverse file, and for each of them I have a PE and a SE file.

This is a bit fuzzy. Paired-end (PE) data has two files (with R1 and R2 in the name to indicate the files containing reads from two ends of fragments sequenced). I am not sure why there is single-end (SE) data unless the samples were separately sequenced separately.

These files are under a folder named "cleaned", when I also have a folder named raw, containing only two files, a forward and a reverse. I would guess it is the "Pair end" and "single end", but in this case, I don't understand why I have both reverse and forward for the Single end sequencing?

Raw folder should be the untrimmed data. Second part does not make sense (reverse and forward for the Single end sequencing).

I also have the same type of files for what seems to correspond to a MiSeq sequencing.

There is nothing special to indicate the sequencing would be MiSeq unless these are 150+ bp long reads, lengths of up to 300 bp only possible on MiSeq. It is possible to identify data as being from MiSeq by looking at the barcode of the flowcell (which should be in all read headers).

Besides, I also have two folders containing LJD, for Long Jumping Distance files (processed and processed_cleaned) And I have 2 ranges of files in them: some "3kb" and "8kb" and for each of them, a forward, a reverse and a singleton. I am not sure to understand if LJD is another way of HiSeq sequencing providing longer reads or type of assembly. And in any case, I don't understand what is the difference between 3kb and 8kb. Are these different files supposed to be used together or are they different treatment of the same obtained sequences

These could be mate-pair libraries with longer inserts. There may be two insert lengths.

Overall if this all is for a single sample then someone seems to have gone to great lengths to create comprehensive libraries/sequencing which would be useful for doing assemblies.

score 1 · Answer 2 · 2017-02-14

1

Entering edit mode

7.2 years ago

h.mon 35k

You should definitely ask these (and other) questions to the people which gave you the data. Fastq files are very poor on metadata, and what you can guess from file and folder names is very limited. Anyway:

1) you have not received a genome, you received sequencing reads from a genome.

2) Cleaning (removing contaminants, adapters, and low quality bases) may leave orphaned paired reads, which are typically placed on a single SE file, but may be placed on two SE files, one corresponding to R1 and other to R2. You have to ask how the files were cleaned.

3) How do you know the reads come from HiSeq and MiSeq? Folder names?

4) 3kb and 8kb are the expected insert sizes for the LJD or mate-pair libraries, useful for scaffolding genomes.

ADD COMMENT • link 7.2 years ago by h.mon 35k

0

Entering edit mode

Thank you also h.mon!! Specially for the correction in the terminology, I will try to do my best to use it properly! :)

About the SE sequences, it could be that because the difference between SE and PE appears only in the "CLEANING folder", but the problem of the R1 and R2 remains...

For the HiSeq and MiSeq, as I said to genomax2, I have a document which seems to be describing the features of the files. And one of the thing I understood was this difference, and it is noted in the name of the files (which I guess is also visible in the read headers). The MiSeq seems to have a "L001" in there tags and the HiSeq a L002...

About the 3kb and 8kb, I am not sure to understand what you mean, so I will read the mate-pair librairy link, and might come back to you.

In any case, thank you very much for your help!!

ADD REPLY • link 7.2 years ago by celine.petitjean ▴ 30

0

Entering edit mode

The MiSeq seems to have a "L001" in there tags and the HiSeq a L002...

The L00* number designates the lane in which your sample ran. MiSeq has only one lane, so all MiSeq data will always have L001. On HiSeq, depending on flowcell type there may be 2 or 8 lanes. So you will see numbers between L001 and L008 for HiSeq data.

ADD REPLY • link 7.2 years ago by GenoMax 141k

0

Entering edit mode

Hum... So the MiSeq are annotated L001, which makes sens with what you are telling me. And the HiSeq are annotated only L002, for the three files labelled under to this method (and lane, as it is, indeed, written on the readme pdf I have): gDNA, 3kb and 8kb. So I guess that they have been sequenced all together... Thank you!!

ADD REPLY • link 7.2 years ago by celine.petitjean ▴ 30

0

Entering edit mode

So I guess that they have been sequenced all together.

It is possible to pool multiple samples together by using sample specific"tags". The reads for samples will have the tag sequence in the header of the fastq files (and perhaps in file name, depending on which version of Illumina post-processing software was used).

ADD REPLY • link 7.2 years ago by GenoMax 141k

0

Entering edit mode

As genomax2 pointed out, the only easy difference between MiSeq and HiSeq are read lengths: if you have reds 150bp or longer, they came from MiSeq. HiSeq machines have lanes 1 to 8, so a bunch of files with L001 on their names may have originated from HiSeq.

Is there any reason you can not talk with the data providers?

ADD REPLY • link 7.2 years ago by h.mon 35k