Question: Demultiplexing SRA data
0
gravatar for Adrian Pelin
3.4 years ago by
Adrian Pelin2.3k
Canada
Adrian Pelin2.3k wrote:

Hello,

I am looking at this experiment

Looks like the authors uploaded a multiplexed dataset with no info on barcoding. Any way to guess the barcodes? Just by running fastqc you can sorta guess what some barcodes may be, but is there a better way?

I downloaded the sra file and used --split-3, which gave me 2 files for PE.

Here is what the file sorta looks like:

@SRR2046220.999996 HISEQ:108:C24D5ACXX:4:1101:5712:37823 length=102
AAGGATGCAATTCCTGGTGGTGCCATGGAGGTAAAGTCATAGTATTTTTATGATTTATATTTACATATTTTTACACTTCATAGTCATTTTTATAAAACTTTN
+SRR2046220.999996 HISEQ:108:C24D5ACXX:4:1101:5712:37823 length=102
CCCFFFFFHHHHHJJJIFGGFHIIHIIJAHIEGHJHFGGFIIBGIIJJJJJIJJIJJJJIJJJJJJIJIIJJJIIHHIIJIHHHFHHEFFFFFFEDDECEE#
@SRR2046220.999997 HISEQ:108:C24D5ACXX:4:1101:5536:37823 length=102
CAGGACGAAAATGAAGGTTTGGTTTTAACATTTGATCTGAGTTTATAGTATAGAAAGAGATCTATATTGACTCAGCTTTGCATATAAATCATACATTCTAGN
+SRR2046220.999997 HISEQ:108:C24D5ACXX:4:1101:5536:37823 length=102
######################################################################################################
@SRR2046220.999998 HISEQ:108:C24D5ACXX:4:1101:5653:37824 length=102
TAACTCTCTATTCACGAAAATCTGATCAATTGGATGACGGCTCGAAGAGCTTGATTCTACCAGATAGTACAGTTACATCAGGATGAAGTGCAGAAACGCTTN
+SRR2046220.999998 HISEQ:108:C24D5ACXX:4:1101:5653:37824 length=102
0;8@##################################################################################################
@SRR2046220.999999 HISEQ:108:C24D5ACXX:4:1101:5739:37825 length=102
CTCAATTCAATTCGGAGCTTCGTCCCCTACAGGACCTCACCCTTCGATCAAACTAAATTATTATTCTTTTTCCAATATTACAATATCAACAATATGTACGTN
+SRR2046220.999999 HISEQ:108:C24D5ACXX:4:1101:5739:37825 length=102
1++44=BDF>FFFEEG1CFCGIDGHFH@G>FGEHDGGIIIIID;;FFHEG8@FG@AGHH;CAEFFEHHHEEEDDE>CCEDCA5>>CDEC?@?CCCC>@CB<#
@SRR2046220.1000000 HISEQ:108:C24D5ACXX:4:1101:5569:37827 length=102
TCTCTTACAATTCCAAAAGATATAGATAAGGCAATTTATTGGTATGAAGAATCTGCTAAACAAGGAAATCAAGGTGCACAAAATAGTTTAGAAGGACTTCAN
+SRR2046220.1000000 HISEQ:108:C24D5ACXX:4:1101:5569:37827 length=102
+++22?@A+?CCCCBBBCBCABBBBBCBBBBCCBBBBBBBBBABABBBBBBBBBBBBBBBBBBBBBABBBBABB>=ABAAAA<>>@?@@B>@@@@==;???#
illumina multiplex ngs sra • 1.4k views
ADD COMMENTlink modified 9 months ago by RamRS22k • written 3.4 years ago by Adrian Pelin2.3k

Based on the SRA entry this is ddRAD-Seq data. ENA also has just paired end fastq files. This must be EcoRI-MspI digested fragments.

ADD REPLYlink modified 10 months ago by RamRS22k • written 3.4 years ago by genomax69k

How to detect if a certain SRA RNA-seq fastq file has been demultiplexed or not?

ADD REPLYlink written 7 weeks ago by Arindam Ghosh160
1

Check the index sequences in fastq headers. If there are more than one the file is likely not demultiplexed.

ADD REPLYlink written 7 weeks ago by genomax69k
1
gravatar for Istvan Albert
3.4 years ago by
Istvan Albert ♦♦ 80k
University Park, USA
Istvan Albert ♦♦ 80k wrote:

I have a semi empirical "method" cut the first N bases from the first 100,000K sequences then see how many unique you get. Something like (untested just typed it in)

cat data.fq | head -100000 |\
    bioawk -c fastx ' { print substr($seq,1,8) }' | sort |\
    uniq -c | sort -rn

Then keep raising the 8 to 9, 10 etc as long as you read the indexed adapter you'll have clear groups with approximately the same number of sequences.

ADD COMMENTlink modified 10 months ago by RamRS22k • written 3.4 years ago by Istvan Albert ♦♦ 80k

Very very cool idea, love it! I tried on 1 million

So, at 8 got:

1490 TCAATATC
1376 TAAAAAAA
1351 GGCATATC
1144 TGATCGCC
1063 GGACTCAC
1055 GGATATAC
1026 GGCAAGGC
1006 GGCCATCC

Low complexity like TAAAAAA is probably a false one.

At 9:

1465 TCAATATCAA
1347 GGCATATCAA
1130 TGATCGCCAA
1039 GGATATACAA
 996 GGCAAGGCAA
 984 GGCCATCCAA
 965 GGACTCACAA
 907 GAACTTGCAA

Interesting that all end in AA.

At 15:

457 TAANNNNNNNNNNNN
187 TCAATATCAATTCAA
182 TCAATATCAATTCTT
166 GGCATATCAATTCCT
157 TCAATATCAATTCAT
149 TGATCGCCAATTCAA
143 GGCATATCAATTCAA
142 GGCAAGGCAATTCAA
134 GGCCATCCAATTCAA
133 GGATATACAATTCAA
128 GGATATACAATTCTT

Is everything before AA the index sequence?

ADD REPLYlink modified 10 months ago by RamRS22k • written 3.4 years ago by Adrian Pelin2.3k

normally there should be a big drop in the groupings, but make sure to use it on the first file, only that will have the index, the other pair won't

ADD REPLYlink written 3.4 years ago by Istvan Albert ♦♦ 80k

Yeah I am using it on _1 from sra. Will try on _2. A big drop in grouping would make sense, but I don't think I see it. It may be that the file was demultiplexed, barcodes removed, and then all groups merged back together? That hardly makes sense but maybe possible.

ADD REPLYlink written 3.4 years ago by Adrian Pelin2.3k

yeah, your data does not seem to indicate the presence of common sequences at the start.

ADD REPLYlink written 3.4 years ago by Istvan Albert ♦♦ 80k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1645 users visited in the last hour