US20230376788A1 - Nucleic acid-based data storage - Google Patents
Nucleic acid-based data storage Download PDFInfo
- Publication number
- US20230376788A1 US20230376788A1 US18/230,385 US202318230385A US2023376788A1 US 20230376788 A1 US20230376788 A1 US 20230376788A1 US 202318230385 A US202318230385 A US 202318230385A US 2023376788 A1 US2023376788 A1 US 2023376788A1
- Authority
- US
- United States
- Prior art keywords
- nucleic acid
- identifier
- identifiers
- components
- information
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 150000007523 nucleic acids Chemical class 0.000 title claims abstract description 426
- 102000039446 nucleic acids Human genes 0.000 title claims abstract description 213
- 108020004707 nucleic acids Proteins 0.000 title claims abstract description 213
- 238000013500 data storage Methods 0.000 title description 13
- 238000000034 method Methods 0.000 claims abstract description 169
- 108091028043 Nucleic acid sequence Proteins 0.000 claims abstract description 88
- 238000003752 polymerase chain reaction Methods 0.000 claims description 53
- 108010091086 Recombinases Proteins 0.000 claims description 47
- 102000018120 Recombinases Human genes 0.000 claims description 47
- 238000006243 chemical reaction Methods 0.000 claims description 47
- 101710163270 Nuclease Proteins 0.000 claims description 17
- 102000004190 Enzymes Human genes 0.000 claims description 14
- 108090000790 Enzymes Proteins 0.000 claims description 14
- 230000015556 catabolic process Effects 0.000 claims description 11
- 238000006731 degradation reaction Methods 0.000 claims description 11
- 238000012217 deletion Methods 0.000 claims description 11
- 230000037430 deletion Effects 0.000 claims description 11
- 238000013507 mapping Methods 0.000 claims description 7
- 102000003960 Ligases Human genes 0.000 claims description 6
- 108090000364 Ligases Proteins 0.000 claims description 6
- 238000012986 modification Methods 0.000 claims description 5
- 230000004048 modification Effects 0.000 claims description 5
- 230000001351 cycling effect Effects 0.000 claims description 4
- 230000035772 mutation Effects 0.000 claims description 4
- 125000006850 spacer group Chemical group 0.000 claims description 4
- 108010017070 Zinc Finger Nucleases Proteins 0.000 claims description 3
- 238000000638 solvent extraction Methods 0.000 claims description 3
- 238000010459 TALEN Methods 0.000 claims description 2
- 108010043645 Transcription Activator-Like Effector Nucleases Proteins 0.000 claims description 2
- 238000007858 polymerase cycling assembly Methods 0.000 claims description 2
- 108020004414 DNA Proteins 0.000 abstract description 50
- 102000053602 DNA Human genes 0.000 abstract description 50
- 230000015572 biosynthetic process Effects 0.000 abstract description 16
- 238000003786 synthesis reaction Methods 0.000 abstract description 15
- 230000002255 enzymatic effect Effects 0.000 abstract description 2
- 239000000523 sample Substances 0.000 description 57
- 238000012163 sequencing technique Methods 0.000 description 57
- 238000005192 partition Methods 0.000 description 40
- 230000000153 supplemental effect Effects 0.000 description 40
- 239000000047 product Substances 0.000 description 39
- 238000003860 storage Methods 0.000 description 34
- 238000009396 hybridization Methods 0.000 description 32
- 230000000875 corresponding effect Effects 0.000 description 28
- 239000000126 substance Substances 0.000 description 24
- 230000000295 complement effect Effects 0.000 description 23
- 125000003729 nucleotide group Chemical group 0.000 description 22
- 230000015654 memory Effects 0.000 description 20
- 239000002773 nucleotide Substances 0.000 description 18
- 230000003321 amplification Effects 0.000 description 15
- 238000003199 nucleic acid amplification method Methods 0.000 description 15
- 238000003776 cleavage reaction Methods 0.000 description 14
- 230000003287 optical effect Effects 0.000 description 14
- 230000007017 scission Effects 0.000 description 14
- 239000006227 byproduct Substances 0.000 description 12
- 238000013459 approach Methods 0.000 description 10
- 239000003153 chemical reaction reagent Substances 0.000 description 10
- 238000001514 detection method Methods 0.000 description 10
- 108091033409 CRISPR Proteins 0.000 description 9
- 238000004891 communication Methods 0.000 description 9
- 230000008569 process Effects 0.000 description 9
- -1 C. T Chemical compound 0.000 description 8
- 239000011324 bead Substances 0.000 description 8
- 230000001404 mediated effect Effects 0.000 description 8
- 230000002194 synthesizing effect Effects 0.000 description 8
- 238000000605 extraction Methods 0.000 description 7
- 230000008439 repair process Effects 0.000 description 7
- 230000002441 reversible effect Effects 0.000 description 7
- 229920002477 rna polymer Polymers 0.000 description 7
- ISAKRJDGNUQOIC-UHFFFAOYSA-N Uracil Chemical compound O=C1C=CNC(=O)N1 ISAKRJDGNUQOIC-UHFFFAOYSA-N 0.000 description 6
- 238000010276 construction Methods 0.000 description 6
- 239000000499 gel Substances 0.000 description 6
- 238000001668 nucleic acid synthesis Methods 0.000 description 6
- 238000005215 recombination Methods 0.000 description 6
- 230000006798 recombination Effects 0.000 description 6
- RWQNBRDOKXIBIV-UHFFFAOYSA-N thymine Chemical compound CC1=CNC(=O)NC1=O RWQNBRDOKXIBIV-UHFFFAOYSA-N 0.000 description 6
- 108010042407 Endonucleases Proteins 0.000 description 5
- 102000004533 Endonucleases Human genes 0.000 description 5
- MTCFGRXMJLQNBG-UHFFFAOYSA-N Serine Natural products OCC(N)C(O)=O MTCFGRXMJLQNBG-UHFFFAOYSA-N 0.000 description 5
- 238000003753 real-time PCR Methods 0.000 description 5
- 238000004458 analytical method Methods 0.000 description 4
- 238000000429 assembly Methods 0.000 description 4
- 230000000712 assembly Effects 0.000 description 4
- 239000002775 capsule Substances 0.000 description 4
- OPTASPLRGRRNAP-UHFFFAOYSA-N cytosine Chemical compound NC=1C=CNC(=O)N=1 OPTASPLRGRRNAP-UHFFFAOYSA-N 0.000 description 4
- UYTPUPDQBNUYGX-UHFFFAOYSA-N guanine Chemical compound O=C1NC(N)=NC2=C1N=CN2 UYTPUPDQBNUYGX-UHFFFAOYSA-N 0.000 description 4
- 238000002493 microarray Methods 0.000 description 4
- 239000000203 mixture Substances 0.000 description 4
- 238000007857 nested PCR Methods 0.000 description 4
- 102000040430 polynucleotide Human genes 0.000 description 4
- 108091033319 polynucleotide Proteins 0.000 description 4
- 239000002157 polynucleotide Substances 0.000 description 4
- 238000012545 processing Methods 0.000 description 4
- 230000002829 reductive effect Effects 0.000 description 4
- 230000008685 targeting Effects 0.000 description 4
- 108010014303 DNA-directed DNA polymerase Proteins 0.000 description 3
- 102000016928 DNA-directed DNA polymerase Human genes 0.000 description 3
- 108020004682 Single-Stranded DNA Proteins 0.000 description 3
- 230000008901 benefit Effects 0.000 description 3
- 230000005540 biological transmission Effects 0.000 description 3
- 230000002596 correlated effect Effects 0.000 description 3
- 238000007847 digital PCR Methods 0.000 description 3
- 238000003780 insertion Methods 0.000 description 3
- 230000037431 insertion Effects 0.000 description 3
- 230000002427 irreversible effect Effects 0.000 description 3
- 239000011541 reaction mixture Substances 0.000 description 3
- 230000010076 replication Effects 0.000 description 3
- 239000000758 substrate Substances 0.000 description 3
- 229940035893 uracil Drugs 0.000 description 3
- RFLVMTUMFYRZCB-UHFFFAOYSA-N 1-methylguanine Chemical compound O=C1N(C)C(N)=NC2=C1N=CN2 RFLVMTUMFYRZCB-UHFFFAOYSA-N 0.000 description 2
- YSAJFXWTVFGPAX-UHFFFAOYSA-N 2-[(2,4-dioxo-1h-pyrimidin-5-yl)oxy]acetic acid Chemical compound OC(=O)COC1=CNC(=O)NC1=O YSAJFXWTVFGPAX-UHFFFAOYSA-N 0.000 description 2
- FZWGECJQACGGTI-UHFFFAOYSA-N 2-amino-7-methyl-1,7-dihydro-6H-purin-6-one Chemical compound NC1=NC(O)=C2N(C)C=NC2=N1 FZWGECJQACGGTI-UHFFFAOYSA-N 0.000 description 2
- OVONXEQGWXGFJD-UHFFFAOYSA-N 4-sulfanylidene-1h-pyrimidin-2-one Chemical compound SC=1C=CNC(=O)N=1 OVONXEQGWXGFJD-UHFFFAOYSA-N 0.000 description 2
- OIVLITBTBDPEFK-UHFFFAOYSA-N 5,6-dihydrouracil Chemical compound O=C1CCNC(=O)N1 OIVLITBTBDPEFK-UHFFFAOYSA-N 0.000 description 2
- ZLAQATDNGLKIEV-UHFFFAOYSA-N 5-methyl-2-sulfanylidene-1h-pyrimidin-4-one Chemical compound CC1=CNC(=S)NC1=O ZLAQATDNGLKIEV-UHFFFAOYSA-N 0.000 description 2
- KDCGOANMDULRCW-UHFFFAOYSA-N 7H-purine Chemical compound N1=CNC2=NC=NC2=C1 KDCGOANMDULRCW-UHFFFAOYSA-N 0.000 description 2
- 108020004635 Complementary DNA Proteins 0.000 description 2
- 108010017826 DNA Polymerase I Proteins 0.000 description 2
- 102000004594 DNA Polymerase I Human genes 0.000 description 2
- 102000004163 DNA-directed RNA polymerases Human genes 0.000 description 2
- 108090000626 DNA-directed RNA polymerases Proteins 0.000 description 2
- KCXVZYZYPLLWCC-UHFFFAOYSA-N EDTA Chemical compound OC(=O)CN(CC(O)=O)CCN(CC(O)=O)CC(O)=O KCXVZYZYPLLWCC-UHFFFAOYSA-N 0.000 description 2
- OUYCCCASQSFEME-QMMMGPOBSA-N L-tyrosine Chemical compound OC(=O)[C@@H](N)CC1=CC=C(O)C=C1 OUYCCCASQSFEME-QMMMGPOBSA-N 0.000 description 2
- HYVABZIGRDEKCD-UHFFFAOYSA-N N(6)-dimethylallyladenine Chemical compound CC(C)=CCNC1=NC=NC2=C1N=CN2 HYVABZIGRDEKCD-UHFFFAOYSA-N 0.000 description 2
- NQTADLQHYWFPDB-UHFFFAOYSA-N N-Hydroxysuccinimide Chemical class ON1C(=O)CCC1=O NQTADLQHYWFPDB-UHFFFAOYSA-N 0.000 description 2
- 108091034117 Oligonucleotide Proteins 0.000 description 2
- 108010006785 Taq Polymerase Proteins 0.000 description 2
- 108010001244 Tli polymerase Proteins 0.000 description 2
- 229960000643 adenine Drugs 0.000 description 2
- OIRDTQYFTABQOQ-KQYNXXCUSA-N adenosine group Chemical group [C@@H]1([C@H](O)[C@H](O)[C@@H](CO)O1)N1C=NC=2C(N)=NC=NC12 OIRDTQYFTABQOQ-KQYNXXCUSA-N 0.000 description 2
- 238000000246 agarose gel electrophoresis Methods 0.000 description 2
- 238000013019 agitation Methods 0.000 description 2
- 238000003556 assay Methods 0.000 description 2
- 238000005251 capillar electrophoresis Methods 0.000 description 2
- 238000004587 chromatography analysis Methods 0.000 description 2
- 229940104302 cytosine Drugs 0.000 description 2
- 230000003247 decreasing effect Effects 0.000 description 2
- 239000007857 degradation product Substances 0.000 description 2
- 238000006911 enzymatic reaction Methods 0.000 description 2
- 238000013467 fragmentation Methods 0.000 description 2
- 238000006062 fragmentation reaction Methods 0.000 description 2
- 230000002068 genetic effect Effects 0.000 description 2
- 229910052739 hydrogen Inorganic materials 0.000 description 2
- 239000001257 hydrogen Substances 0.000 description 2
- FDGQSTZJBFJUBT-UHFFFAOYSA-N hypoxanthine Chemical compound O=C1NC=NC2=C1NC=N2 FDGQSTZJBFJUBT-UHFFFAOYSA-N 0.000 description 2
- 238000010348 incorporation Methods 0.000 description 2
- 239000007788 liquid Substances 0.000 description 2
- 230000007774 longterm Effects 0.000 description 2
- QJGQUHMNIGDVPM-UHFFFAOYSA-N nitrogen group Chemical group [N] QJGQUHMNIGDVPM-UHFFFAOYSA-N 0.000 description 2
- 239000002777 nucleoside Substances 0.000 description 2
- 150000003833 nucleoside derivatives Chemical class 0.000 description 2
- 238000005580 one pot reaction Methods 0.000 description 2
- 230000002093 peripheral effect Effects 0.000 description 2
- 150000008300 phosphoramidites Chemical class 0.000 description 2
- 230000000704 physical effect Effects 0.000 description 2
- BASFCYQUMIYNBI-UHFFFAOYSA-N platinum Chemical compound [Pt] BASFCYQUMIYNBI-UHFFFAOYSA-N 0.000 description 2
- 108091008146 restriction endonucleases Proteins 0.000 description 2
- 238000005070 sampling Methods 0.000 description 2
- 238000006467 substitution reaction Methods 0.000 description 2
- 229940113082 thymine Drugs 0.000 description 2
- 238000012546 transfer Methods 0.000 description 2
- OUYCCCASQSFEME-UHFFFAOYSA-N tyrosine Natural products OC(=O)C(N)CC1=CC=C(O)C=C1 OUYCCCASQSFEME-UHFFFAOYSA-N 0.000 description 2
- WJNGQIYEQLPJMN-IOSLPCCCSA-N 1-methylinosine Chemical compound C1=NC=2C(=O)N(C)C=NC=2N1[C@@H]1O[C@H](CO)[C@@H](O)[C@H]1O WJNGQIYEQLPJMN-IOSLPCCCSA-N 0.000 description 1
- HLYBTPMYFWWNJN-UHFFFAOYSA-N 2-(2,4-dioxo-1h-pyrimidin-5-yl)-2-hydroxyacetic acid Chemical compound OC(=O)C(O)C1=CNC(=O)NC1=O HLYBTPMYFWWNJN-UHFFFAOYSA-N 0.000 description 1
- SGAKLDIYNFXTCK-UHFFFAOYSA-N 2-[(2,4-dioxo-1h-pyrimidin-5-yl)methylamino]acetic acid Chemical compound OC(=O)CNCC1=CNC(=O)NC1=O SGAKLDIYNFXTCK-UHFFFAOYSA-N 0.000 description 1
- XMSMHKMPBNTBOD-UHFFFAOYSA-N 2-dimethylamino-6-hydroxypurine Chemical compound N1C(N(C)C)=NC(=O)C2=C1N=CN2 XMSMHKMPBNTBOD-UHFFFAOYSA-N 0.000 description 1
- SMADWRYCYBUIKH-UHFFFAOYSA-N 2-methyl-7h-purin-6-amine Chemical compound CC1=NC(N)=C2NC=NC2=N1 SMADWRYCYBUIKH-UHFFFAOYSA-N 0.000 description 1
- KOLPWZCZXAMXKS-UHFFFAOYSA-N 3-methylcytosine Chemical compound CN1C(N)=CC=NC1=O KOLPWZCZXAMXKS-UHFFFAOYSA-N 0.000 description 1
- GJAKJCICANKRFD-UHFFFAOYSA-N 4-acetyl-4-amino-1,3-dihydropyrimidin-2-one Chemical compound CC(=O)C1(N)NC(=O)NC=C1 GJAKJCICANKRFD-UHFFFAOYSA-N 0.000 description 1
- MQJSSLBGAQJNER-UHFFFAOYSA-N 5-(methylaminomethyl)-1h-pyrimidine-2,4-dione Chemical compound CNCC1=CNC(=O)NC1=O MQJSSLBGAQJNER-UHFFFAOYSA-N 0.000 description 1
- WPYRHVXCOQLYLY-UHFFFAOYSA-N 5-[(methoxyamino)methyl]-2-sulfanylidene-1h-pyrimidin-4-one Chemical compound CONCC1=CNC(=S)NC1=O WPYRHVXCOQLYLY-UHFFFAOYSA-N 0.000 description 1
- LQLQRFGHAALLLE-UHFFFAOYSA-N 5-bromouracil Chemical compound BrC1=CNC(=O)NC1=O LQLQRFGHAALLLE-UHFFFAOYSA-N 0.000 description 1
- VKLFQTYNHLDMDP-PNHWDRBUSA-N 5-carboxymethylaminomethyl-2-thiouridine Chemical compound O[C@@H]1[C@H](O)[C@@H](CO)O[C@H]1N1C(=S)NC(=O)C(CNCC(O)=O)=C1 VKLFQTYNHLDMDP-PNHWDRBUSA-N 0.000 description 1
- ZFTBZKVVGZNMJR-UHFFFAOYSA-N 5-chlorouracil Chemical compound ClC1=CNC(=O)NC1=O ZFTBZKVVGZNMJR-UHFFFAOYSA-N 0.000 description 1
- KSNXJLQDQOIRIP-UHFFFAOYSA-N 5-iodouracil Chemical compound IC1=CNC(=O)NC1=O KSNXJLQDQOIRIP-UHFFFAOYSA-N 0.000 description 1
- KELXHQACBIUYSE-UHFFFAOYSA-N 5-methoxy-1h-pyrimidine-2,4-dione Chemical compound COC1=CNC(=O)NC1=O KELXHQACBIUYSE-UHFFFAOYSA-N 0.000 description 1
- LRSASMSXMSNRBT-UHFFFAOYSA-N 5-methylcytosine Chemical compound CC1=CNC(=O)N=C1N LRSASMSXMSNRBT-UHFFFAOYSA-N 0.000 description 1
- DCPSTSVLRXOYGS-UHFFFAOYSA-N 6-amino-1h-pyrimidine-2-thione Chemical compound NC1=CC=NC(S)=N1 DCPSTSVLRXOYGS-UHFFFAOYSA-N 0.000 description 1
- VKKXEIQIGGPMHT-UHFFFAOYSA-N 7h-purine-2,8-diamine Chemical compound NC1=NC=C2NC(N)=NC2=N1 VKKXEIQIGGPMHT-UHFFFAOYSA-N 0.000 description 1
- MSSXOMSJDRHRMC-UHFFFAOYSA-N 9H-purine-2,6-diamine Chemical compound NC1=NC(N)=C2NC=NC2=N1 MSSXOMSJDRHRMC-UHFFFAOYSA-N 0.000 description 1
- 229930024421 Adenine Natural products 0.000 description 1
- GFFGJBXGBJISGV-UHFFFAOYSA-N Adenine Chemical compound NC1=NC=NC2=C1N=CN2 GFFGJBXGBJISGV-UHFFFAOYSA-N 0.000 description 1
- 239000002126 C01EB10 - Adenosine Substances 0.000 description 1
- RYGMFSIKBFXOCR-UHFFFAOYSA-N Copper Chemical compound [Cu] RYGMFSIKBFXOCR-UHFFFAOYSA-N 0.000 description 1
- 108050009160 DNA polymerase 1 Proteins 0.000 description 1
- 241000588724 Escherichia coli Species 0.000 description 1
- 241000701533 Escherichia virus T4 Species 0.000 description 1
- GHASVSINZRGABV-UHFFFAOYSA-N Fluorouracil Chemical compound FC1=CNC(=O)NC1=O GHASVSINZRGABV-UHFFFAOYSA-N 0.000 description 1
- 108020005004 Guide RNA Proteins 0.000 description 1
- UGQMRVRMYYASKQ-UHFFFAOYSA-N Hypoxanthine nucleoside Natural products OC1C(O)C(CO)OC1N1C(NC=NC2=O)=C2N=C1 UGQMRVRMYYASKQ-UHFFFAOYSA-N 0.000 description 1
- 229930010555 Inosine Natural products 0.000 description 1
- UGQMRVRMYYASKQ-KQYNXXCUSA-N Inosine Chemical compound O[C@@H]1[C@H](O)[C@@H](CO)O[C@H]1N1C2=NC=NC(O)=C2N=C1 UGQMRVRMYYASKQ-KQYNXXCUSA-N 0.000 description 1
- SGSSKEDGVONRGC-UHFFFAOYSA-N N(2)-methylguanine Chemical compound O=C1NC(NC)=NC2=C1N=CN2 SGSSKEDGVONRGC-UHFFFAOYSA-N 0.000 description 1
- 238000012408 PCR amplification Methods 0.000 description 1
- 108091093037 Peptide nucleic acid Proteins 0.000 description 1
- 108010002747 Pfu DNA polymerase Proteins 0.000 description 1
- 108010010677 Phosphodiesterase I Proteins 0.000 description 1
- 108010021757 Polynucleotide 5'-Hydroxyl-Kinase Proteins 0.000 description 1
- 102000008422 Polynucleotide 5'-hydroxyl-kinase Human genes 0.000 description 1
- 108010019653 Pwo polymerase Proteins 0.000 description 1
- CZPWVGJYEJSRLH-UHFFFAOYSA-N Pyrimidine Chemical compound C1=CN=CN=C1 CZPWVGJYEJSRLH-UHFFFAOYSA-N 0.000 description 1
- 238000003559 RNA-seq method Methods 0.000 description 1
- 108091028664 Ribonucleotide Proteins 0.000 description 1
- 108010020713 Tth polymerase Proteins 0.000 description 1
- 101150071882 US17 gene Proteins 0.000 description 1
- JLCPHMBAVCMARE-UHFFFAOYSA-N [3-[[3-[[3-[[3-[[3-[[3-[[3-[[3-[[3-[[3-[[3-[[5-(2-amino-6-oxo-1H-purin-9-yl)-3-[[3-[[3-[[3-[[3-[[3-[[5-(2-amino-6-oxo-1H-purin-9-yl)-3-[[5-(2-amino-6-oxo-1H-purin-9-yl)-3-hydroxyoxolan-2-yl]methoxy-hydroxyphosphoryl]oxyoxolan-2-yl]methoxy-hydroxyphosphoryl]oxy-5-(5-methyl-2,4-dioxopyrimidin-1-yl)oxolan-2-yl]methoxy-hydroxyphosphoryl]oxy-5-(6-aminopurin-9-yl)oxolan-2-yl]methoxy-hydroxyphosphoryl]oxy-5-(6-aminopurin-9-yl)oxolan-2-yl]methoxy-hydroxyphosphoryl]oxy-5-(6-aminopurin-9-yl)oxolan-2-yl]methoxy-hydroxyphosphoryl]oxy-5-(6-aminopurin-9-yl)oxolan-2-yl]methoxy-hydroxyphosphoryl]oxyoxolan-2-yl]methoxy-hydroxyphosphoryl]oxy-5-(5-methyl-2,4-dioxopyrimidin-1-yl)oxolan-2-yl]methoxy-hydroxyphosphoryl]oxy-5-(4-amino-2-oxopyrimidin-1-yl)oxolan-2-yl]methoxy-hydroxyphosphoryl]oxy-5-(5-methyl-2,4-dioxopyrimidin-1-yl)oxolan-2-yl]methoxy-hydroxyphosphoryl]oxy-5-(5-methyl-2,4-dioxopyrimidin-1-yl)oxolan-2-yl]methoxy-hydroxyphosphoryl]oxy-5-(6-aminopurin-9-yl)oxolan-2-yl]methoxy-hydroxyphosphoryl]oxy-5-(6-aminopurin-9-yl)oxolan-2-yl]methoxy-hydroxyphosphoryl]oxy-5-(4-amino-2-oxopyrimidin-1-yl)oxolan-2-yl]methoxy-hydroxyphosphoryl]oxy-5-(4-amino-2-oxopyrimidin-1-yl)oxolan-2-yl]methoxy-hydroxyphosphoryl]oxy-5-(4-amino-2-oxopyrimidin-1-yl)oxolan-2-yl]methoxy-hydroxyphosphoryl]oxy-5-(6-aminopurin-9-yl)oxolan-2-yl]methoxy-hydroxyphosphoryl]oxy-5-(4-amino-2-oxopyrimidin-1-yl)oxolan-2-yl]methyl [5-(6-aminopurin-9-yl)-2-(hydroxymethyl)oxolan-3-yl] hydrogen phosphate Polymers Cc1cn(C2CC(OP(O)(=O)OCC3OC(CC3OP(O)(=O)OCC3OC(CC3O)n3cnc4c3nc(N)[nH]c4=O)n3cnc4c3nc(N)[nH]c4=O)C(COP(O)(=O)OC3CC(OC3COP(O)(=O)OC3CC(OC3COP(O)(=O)OC3CC(OC3COP(O)(=O)OC3CC(OC3COP(O)(=O)OC3CC(OC3COP(O)(=O)OC3CC(OC3COP(O)(=O)OC3CC(OC3COP(O)(=O)OC3CC(OC3COP(O)(=O)OC3CC(OC3COP(O)(=O)OC3CC(OC3COP(O)(=O)OC3CC(OC3COP(O)(=O)OC3CC(OC3COP(O)(=O)OC3CC(OC3COP(O)(=O)OC3CC(OC3COP(O)(=O)OC3CC(OC3COP(O)(=O)OC3CC(OC3COP(O)(=O)OC3CC(OC3CO)n3cnc4c(N)ncnc34)n3ccc(N)nc3=O)n3cnc4c(N)ncnc34)n3ccc(N)nc3=O)n3ccc(N)nc3=O)n3ccc(N)nc3=O)n3cnc4c(N)ncnc34)n3cnc4c(N)ncnc34)n3cc(C)c(=O)[nH]c3=O)n3cc(C)c(=O)[nH]c3=O)n3ccc(N)nc3=O)n3cc(C)c(=O)[nH]c3=O)n3cnc4c3nc(N)[nH]c4=O)n3cnc4c(N)ncnc34)n3cnc4c(N)ncnc34)n3cnc4c(N)ncnc34)n3cnc4c(N)ncnc34)O2)c(=O)[nH]c1=O JLCPHMBAVCMARE-UHFFFAOYSA-N 0.000 description 1
- 230000002378 acidificating effect Effects 0.000 description 1
- 229960005305 adenosine Drugs 0.000 description 1
- 150000001412 amines Chemical group 0.000 description 1
- 238000000137 annealing Methods 0.000 description 1
- 230000009286 beneficial effect Effects 0.000 description 1
- 229910002056 binary alloy Inorganic materials 0.000 description 1
- 239000007795 chemical reaction product Substances 0.000 description 1
- 239000000470 constituent Substances 0.000 description 1
- 230000008094 contradictory effect Effects 0.000 description 1
- 238000012937 correction Methods 0.000 description 1
- 230000037029 cross reaction Effects 0.000 description 1
- 230000000593 degrading effect Effects 0.000 description 1
- 239000005547 deoxyribonucleotide Substances 0.000 description 1
- 125000002637 deoxyribonucleotide group Chemical group 0.000 description 1
- 238000000151 deposition Methods 0.000 description 1
- 230000029087 digestion Effects 0.000 description 1
- 230000000694 effects Effects 0.000 description 1
- 239000000839 emulsion Substances 0.000 description 1
- 238000005516 engineering process Methods 0.000 description 1
- LYCAIKOWRPUZTN-UHFFFAOYSA-N ethylene glycol Natural products OCCO LYCAIKOWRPUZTN-UHFFFAOYSA-N 0.000 description 1
- 239000000835 fiber Substances 0.000 description 1
- 239000012530 fluid Substances 0.000 description 1
- 229960002949 fluorouracil Drugs 0.000 description 1
- 230000008014 freezing Effects 0.000 description 1
- 238000007710 freezing Methods 0.000 description 1
- 230000006870 function Effects 0.000 description 1
- 230000014509 gene expression Effects 0.000 description 1
- 238000010438 heat treatment Methods 0.000 description 1
- 238000012165 high-throughput sequencing Methods 0.000 description 1
- 230000002209 hydrophobic effect Effects 0.000 description 1
- WGCNASOHLSPBMP-UHFFFAOYSA-N hydroxyacetaldehyde Natural products OCC=O WGCNASOHLSPBMP-UHFFFAOYSA-N 0.000 description 1
- 230000005847 immunogenicity Effects 0.000 description 1
- 238000011534 incubation Methods 0.000 description 1
- 229960003786 inosine Drugs 0.000 description 1
- 238000011901 isothermal amplification Methods 0.000 description 1
- 230000000670 limiting effect Effects 0.000 description 1
- 238000011068 loading method Methods 0.000 description 1
- 238000007403 mPCR Methods 0.000 description 1
- 238000012423 maintenance Methods 0.000 description 1
- 238000007726 management method Methods 0.000 description 1
- 238000004519 manufacturing process Methods 0.000 description 1
- 239000000463 material Substances 0.000 description 1
- IZAGSTRIDUNNOY-UHFFFAOYSA-N methyl 2-[(2,4-dioxo-1h-pyrimidin-5-yl)oxy]acetate Chemical compound COC(=O)COC1=CNC(=O)NC1=O IZAGSTRIDUNNOY-UHFFFAOYSA-N 0.000 description 1
- 238000002156 mixing Methods 0.000 description 1
- 238000001823 molecular biology technique Methods 0.000 description 1
- 238000007481 next generation sequencing Methods 0.000 description 1
- 230000003647 oxidation Effects 0.000 description 1
- 238000007254 oxidation reaction Methods 0.000 description 1
- 125000002467 phosphate group Chemical group [H]OP(=O)(O[H])O[*] 0.000 description 1
- 230000000865 phosphorylative effect Effects 0.000 description 1
- 229910052697 platinum Inorganic materials 0.000 description 1
- 238000011176 pooling Methods 0.000 description 1
- 239000000843 powder Substances 0.000 description 1
- 230000009257 reactivity Effects 0.000 description 1
- 230000009467 reduction Effects 0.000 description 1
- 238000005057 refrigeration Methods 0.000 description 1
- 239000002336 ribonucleotide Substances 0.000 description 1
- 125000002652 ribonucleotide group Chemical group 0.000 description 1
- 238000007480 sanger sequencing Methods 0.000 description 1
- 231100000241 scar Toxicity 0.000 description 1
- 239000004065 semiconductor Substances 0.000 description 1
- 238000007841 sequencing by ligation Methods 0.000 description 1
- 239000010454 slate Substances 0.000 description 1
- 239000007787 solid Substances 0.000 description 1
- 238000007920 subcutaneous administration Methods 0.000 description 1
- 230000035899 viability Effects 0.000 description 1
- WCNMEQDMUYVWMJ-JPZHCBQBSA-N wybutoxosine Chemical compound C1=NC=2C(=O)N3C(CC([C@H](NC(=O)OC)C(=O)OC)OO)=C(C)N=C3N(C)C=2N1[C@@H]1O[C@H](CO)[C@@H](O)[C@H]1O WCNMEQDMUYVWMJ-JPZHCBQBSA-N 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G16—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
- G16B—BIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
- G16B30/00—ICT specially adapted for sequence analysis involving nucleotides or amino acids
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/12—Computing arrangements based on biological models using genetic models
- G06N3/123—DNA computing
-
- C—CHEMISTRY; METALLURGY
- C12—BIOCHEMISTRY; BEER; SPIRITS; WINE; VINEGAR; MICROBIOLOGY; ENZYMOLOGY; MUTATION OR GENETIC ENGINEERING
- C12N—MICROORGANISMS OR ENZYMES; COMPOSITIONS THEREOF; PROPAGATING, PRESERVING, OR MAINTAINING MICROORGANISMS; MUTATION OR GENETIC ENGINEERING; CULTURE MEDIA
- C12N15/00—Mutation or genetic engineering; DNA or RNA concerning genetic engineering, vectors, e.g. plasmids, or their isolation, preparation or purification; Use of hosts therefor
- C12N15/09—Recombinant DNA-technology
- C12N15/10—Processes for the isolation, preparation or purification of DNA or RNA
- C12N15/102—Mutagenizing nucleic acids
- C12N15/1031—Mutagenizing nucleic acids mutagenesis by gene assembly, e.g. assembly by oligonucleotide extension PCR
-
- C—CHEMISTRY; METALLURGY
- C12—BIOCHEMISTRY; BEER; SPIRITS; WINE; VINEGAR; MICROBIOLOGY; ENZYMOLOGY; MUTATION OR GENETIC ENGINEERING
- C12N—MICROORGANISMS OR ENZYMES; COMPOSITIONS THEREOF; PROPAGATING, PRESERVING, OR MAINTAINING MICROORGANISMS; MUTATION OR GENETIC ENGINEERING; CULTURE MEDIA
- C12N15/00—Mutation or genetic engineering; DNA or RNA concerning genetic engineering, vectors, e.g. plasmids, or their isolation, preparation or purification; Use of hosts therefor
- C12N15/09—Recombinant DNA-technology
- C12N15/10—Processes for the isolation, preparation or purification of DNA or RNA
- C12N15/1034—Isolating an individual clone by screening libraries
- C12N15/1089—Design, preparation, screening or analysis of libraries using computer algorithms
-
- C—CHEMISTRY; METALLURGY
- C12—BIOCHEMISTRY; BEER; SPIRITS; WINE; VINEGAR; MICROBIOLOGY; ENZYMOLOGY; MUTATION OR GENETIC ENGINEERING
- C12N—MICROORGANISMS OR ENZYMES; COMPOSITIONS THEREOF; PROPAGATING, PRESERVING, OR MAINTAINING MICROORGANISMS; MUTATION OR GENETIC ENGINEERING; CULTURE MEDIA
- C12N9/00—Enzymes; Proenzymes; Compositions thereof; Processes for preparing, activating, inhibiting, separating or purifying enzymes
- C12N9/14—Hydrolases (3)
- C12N9/16—Hydrolases (3) acting on ester bonds (3.1)
- C12N9/22—Ribonucleases RNAses, DNAses
-
- C—CHEMISTRY; METALLURGY
- C12—BIOCHEMISTRY; BEER; SPIRITS; WINE; VINEGAR; MICROBIOLOGY; ENZYMOLOGY; MUTATION OR GENETIC ENGINEERING
- C12Q—MEASURING OR TESTING PROCESSES INVOLVING ENZYMES, NUCLEIC ACIDS OR MICROORGANISMS; COMPOSITIONS OR TEST PAPERS THEREFOR; PROCESSES OF PREPARING SUCH COMPOSITIONS; CONDITION-RESPONSIVE CONTROL IN MICROBIOLOGICAL OR ENZYMOLOGICAL PROCESSES
- C12Q1/00—Measuring or testing processes involving enzymes, nucleic acids or microorganisms; Compositions therefor; Processes of preparing such compositions
- C12Q1/68—Measuring or testing processes involving enzymes, nucleic acids or microorganisms; Compositions therefor; Processes of preparing such compositions involving nucleic acids
- C12Q1/6806—Preparing nucleic acids for analysis, e.g. for polymerase chain reaction [PCR] assay
-
- C—CHEMISTRY; METALLURGY
- C40—COMBINATORIAL TECHNOLOGY
- C40B—COMBINATORIAL CHEMISTRY; LIBRARIES, e.g. CHEMICAL LIBRARIES
- C40B50/00—Methods of creating libraries, e.g. combinatorial synthesis
- C40B50/06—Biochemical methods, e.g. using enzymes or whole viable microorganisms
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/20—Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
- G06F16/22—Indexing; Data structures therefor; Storage structures
- G06F16/2228—Indexing structures
- G06F16/2272—Management thereof
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/20—Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
- G06F16/24—Querying
- G06F16/245—Query processing
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F5/00—Methods or arrangements for data conversion without changing the order or content of the data handled
- G06F5/06—Methods or arrangements for data conversion without changing the order or content of the data handled for changing the speed of data flow, i.e. speed regularising or timing, e.g. delay lines, FIFO buffers; over- or underrun control therefor
- G06F5/08—Methods or arrangements for data conversion without changing the order or content of the data handled for changing the speed of data flow, i.e. speed regularising or timing, e.g. delay lines, FIFO buffers; over- or underrun control therefor having a sequence of storage locations, the intermediate ones not being accessible for either enqueue or dequeue operations, e.g. using a shift register
-
- G—PHYSICS
- G11—INFORMATION STORAGE
- G11C—STATIC STORES
- G11C11/00—Digital stores characterised by the use of particular electric or magnetic storage elements; Storage elements therefor
- G11C11/56—Digital stores characterised by the use of particular electric or magnetic storage elements; Storage elements therefor using storage elements with more than two stable states represented by steps, e.g. of voltage, current, phase, frequency
- G11C11/5664—Digital stores characterised by the use of particular electric or magnetic storage elements; Storage elements therefor using storage elements with more than two stable states represented by steps, e.g. of voltage, current, phase, frequency using organic memory material storage elements
-
- G—PHYSICS
- G11—INFORMATION STORAGE
- G11C—STATIC STORES
- G11C13/00—Digital stores characterised by the use of storage elements not covered by groups G11C11/00, G11C23/00, or G11C25/00
- G11C13/0002—Digital stores characterised by the use of storage elements not covered by groups G11C11/00, G11C23/00, or G11C25/00 using resistive RAM [RRAM] elements
- G11C13/0009—RRAM elements whose operation depends upon chemical change
- G11C13/0014—RRAM elements whose operation depends upon chemical change comprising cells based on organic memory material
- G11C13/0019—RRAM elements whose operation depends upon chemical change comprising cells based on organic memory material comprising bio-molecules
-
- G—PHYSICS
- G11—INFORMATION STORAGE
- G11C—STATIC STORES
- G11C13/00—Digital stores characterised by the use of storage elements not covered by groups G11C11/00, G11C23/00, or G11C25/00
- G11C13/02—Digital stores characterised by the use of storage elements not covered by groups G11C11/00, G11C23/00, or G11C25/00 using elements whose operation depends upon chemical change
-
- G—PHYSICS
- G16—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
- G16B—BIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
- G16B30/00—ICT specially adapted for sequence analysis involving nucleotides or amino acids
- G16B30/20—Sequence assembly
-
- G—PHYSICS
- G16—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
- G16B—BIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
- G16B35/00—ICT specially adapted for in silico combinatorial libraries of nucleic acids, proteins or peptides
-
- G—PHYSICS
- G16—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
- G16B—BIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
- G16B50/00—ICT programming tools or database systems specially adapted for bioinformatics
-
- G—PHYSICS
- G16—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
- G16B—BIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
- G16B50/00—ICT programming tools or database systems specially adapted for bioinformatics
- G16B50/30—Data warehousing; Computing architectures
-
- G—PHYSICS
- G16—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
- G16B—BIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
- G16B50/00—ICT programming tools or database systems specially adapted for bioinformatics
- G16B50/40—Encryption of genetic data
-
- G—PHYSICS
- G16—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
- G16B—BIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
- G16B50/00—ICT programming tools or database systems specially adapted for bioinformatics
- G16B50/50—Compression of genetic data
-
- G—PHYSICS
- G16—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
- G16B—BIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
- G16B99/00—Subject matter not provided for in other groups of this subclass
-
- H—ELECTRICITY
- H03—ELECTRONIC CIRCUITRY
- H03M—CODING; DECODING; CODE CONVERSION IN GENERAL
- H03M7/00—Conversion of a code where information is represented by a given sequence or number of digits to a code where the same, similar or subset of information is represented by a different sequence or number of digits
- H03M7/30—Compression; Expansion; Suppression of unnecessary data, e.g. redundancy reduction
- H03M7/40—Conversion to or from variable length codes, e.g. Shannon-Fano code, Huffman code, Morse code
-
- C—CHEMISTRY; METALLURGY
- C12—BIOCHEMISTRY; BEER; SPIRITS; WINE; VINEGAR; MICROBIOLOGY; ENZYMOLOGY; MUTATION OR GENETIC ENGINEERING
- C12N—MICROORGANISMS OR ENZYMES; COMPOSITIONS THEREOF; PROPAGATING, PRESERVING, OR MAINTAINING MICROORGANISMS; MUTATION OR GENETIC ENGINEERING; CULTURE MEDIA
- C12N2310/00—Structure or type of the nucleic acid
- C12N2310/10—Type of nucleic acid
- C12N2310/20—Type of nucleic acid involving clustered regularly interspaced short palindromic repeats [CRISPRs]
-
- G—PHYSICS
- G16—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
- G16B—BIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
- G16B20/00—ICT specially adapted for functional genomics or proteomics, e.g. genotype-phenotype associations
Definitions
- Nucleic acid digital data storage is a stable approach for encoding and storing information for long periods of time, with data stored at higher densities than magnetic tape or hard drive storage systems. Additionally, digital data stored in nucleic acid molecules that are stored in cold and dry conditions can be retrieved as long as 60,000 years later or longer.
- nucleic acid molecules may be sequenced.
- nucleic acid digital data storage may be an ideal method for storing data that is not frequently accessed but may have a high volume of information to be stored or archived for long periods of time.
- nucleic acid e.g., deoxyribonucleic acid, DNA
- bit-value information in the presence or absence of unique nucleic acid sequences within a pool, comprising specifying each bit location in a bit-stream with a unique nucleic sequence and specifying the bit value at that location by the presence or absence of the corresponding unique nucleic acid sequence in the pool.
- specifying unique bytes in a byte stream by unique subsets of nucleic acid sequences are also disclosed.
- methods for generating unique nucleic acid sequences without base-to-base synthesis using combinatorial genomic strategies e.g., assembly of multiple nucleic acid sequences or enzymatic-based editing of nucleic acid sequences).
- the present disclosure provides a method for writing information into nucleic acid sequence(s), comprising: (a) translating the information into a string of symbols; (b) mapping the string of symbols to a plurality of identifiers, wherein an individual identifier of the plurality of identifiers comprises one or more components, wherein an individual component of the one or more components comprises a nucleic acid sequence, and wherein the individual identifier of the plurality of identifiers corresponds to an individual symbol of the string of symbols; and (c) constructing an identifier library comprising at least a subset of the plurality of identifiers.
- each symbol in the string of symbols is one of two possible symbol values.
- one symbol value at each position of the string of symbols may be represented by the absence of a distinct identifier in the identifier library.
- the two possible symbol values are a bit-value of 0 and 1, wherein the individual symbol with the bit-value of 0 in the string of symbols may be represented by an absence of a distinct identifier in the identifier library, wherein the individual symbol with the bit-value of 1 in the string of symbols may be represented by a presence of the distinct identifier in the identifier library, and vice versa.
- each symbol of the string of symbols is one of one or more possible symbol values.
- a presence of an individual identifier in the identifier library corresponds to a first symbol value in a binary suing and an absence of the individual identifier corresponds to a second symbol value in a binary string.
- the first symbol value is a bit value of 1 and the second symbol value is a bit value of 0.
- the first symbol value is a bit value of 0 and the second symbol value is a bit value of 1.
- constructing the individual identifier in the identifier library comprises assembling the one or more components from one or more layers and wherein each layer of the one or more layers comprises a distinct set of components.
- the individual identifier from the identifier library comprises one component from each layer of the one or more layers.
- the one or more components are assembled in a fixed order.
- the one or more components are assembled in a random order.
- the one or more components are assembled with one or more partitioning components disposed between two components from different layers of the one or more layers.
- the individual identifier comprises one component from each layer of a subset of the one or more layers.
- the individual identifier comprises at least one component from each of the one or more layers.
- the one or more components are assembled using overlap-extension polymerase chain reaction (PCR), polymerase cycling assembly, sticky end ligation, biobricks assembly, golden gate assembly, gibson assembly, recombinase assembly, ligase cycling reaction, or template directed ligation.
- PCR polymerase chain reaction
- polymerase cycling assembly sticky end ligation
- biobricks assembly biobricks assembly
- golden gate assembly golden gate assembly
- gibson assembly recombinase assembly
- ligase cycling reaction or template directed ligation.
- constructing the individual identifier in the identifier library comprises deleting, replacing, or inserting at least one component in a parent identifier by applying nucleic acid editing enzymes to the parent identifier.
- the parent identifier comprises a plurality of components flanked by nuclease-specific target sites, recombinase recognition sites, or distinct spacer sequences.
- the nucleic acid editing enzymes are selected from the group consisting of CRISPR-Cas, TALENs, Zinc Finger Nucleases, Recombinases, and functional variants thereof.
- the identifier library comprises a plurality of nucleic acid sequences.
- the plurality of nucleic acid sequences stores metadata of the information and/or conceals the information.
- the metadata comprises secondary information corresponding to a source of the information, an intended recipient of the information, an original format of the information, instrumentation and methods used to encode the information, a date and a time of writing the information into the identifier library, modifications made to the information, and/or a reference to other information.
- one or more identifier libraries are combined and wherein each identifier library of the one or more identifier libraries is tagged with a distinct barcode.
- each individual identifier in the identifier library comprises the distinct barcode.
- the plurality of identifiers is selected for ease of read, write, access, copy, and deletion operations. In some embodiments, the plurality of identifiers is selected to minimize write errors, mutations, degradation, and read errors.
- the present disclosure provides a method for copying information encoded in nucleic acid sequence(s), comprising: (a) providing an identifier library encoding a string of symbols, wherein the identifier library comprises a plurality of identifiers, wherein an individual identifier of the plurality of identifiers comprises one or more components, wherein an individual component of the one or more components comprises a nucleic acid sequence, and wherein the individual identifier of the plurality of identifiers corresponds to an individual symbol of the string of symbols; and (b) constructing one or more copies of the identifier library.
- the plurality of identifiers comprises one or more primer binding sites.
- the identifier library is copied using polymerase chain reaction (PCR).
- the PCR is conventional PCR or linear PCR and wherein a number of copies of the identifier library double or increase linearly, respectively, with each PCR cycle.
- the individual identifier in the identifier library is ligated into a circular vector prior to PCR and wherein the circle vector comprises a barcode at each end of the individual identifier.
- the identifier library comprises a plurality of nucleic acid sequences. In some embodiments, the plurality of nucleic acid sequences is copied. In some embodiments, one or more identifier libraries are combined prior to copying and wherein each library of the one or more identifier libraries comprises a distinct barcode.
- the present disclosure provides a method for accessing information encoded in nucleic acid sequence(s), comprising: (a) providing an identifier library encoding a string of symbols, wherein the identifier library comprises a plurality of identifiers, wherein an individual identifier of the plurality of identifiers comprises one or more components, wherein an individual component of the one or more components comprises a nucleic acid sequence, and wherein the individual identifier of the plurality of identifiers corresponds to an individual symbol of the string of symbols; and (b) extracting a targeted subset of the plurality of identifiers from the identifier library.
- a plurality of probes is combined with the identifier library. In some embodiments, the plurality of probes share complementarity with the targeted subset of the plurality of identifiers from the identifier library. In some embodiments, the plurality of probes hybridizes the targeted subset of the plurality of identifiers in the identifier library. In some embodiments, the plurality of probes comprises one or more affinity tags and wherein the one or more affinity tags is captured by an affinity bead or an affinity column.
- the identifier library is sequentially combined with one or more subsets of the plurality of probes and wherein a portion of the identifier library binds to the one or more subsets of the plurality of probes. In some embodiments, the portion of the identifier library that binds to the one or more subsets of the plurality of probes is removed prior to the addition of another subset of the plurality of probes to the identifier library.
- the individual identifier of the plurality of identifiers comprises one or more common primer binding regions, one or more variable primer binding regions, or any combination thereof.
- the identifier library is combined with primers that bind to the one or more common primer binding regions or to the one or more variable primer binding regions.
- the primers that bind to the one or more variable primer binding regions are used to selectively amplify the targeted subset of the identifier library.
- a portion of identifiers is removed from the identifier library by selective nuclease cleavage.
- the identifier library is combined with Cas9 and guide probes and wherein the guide probes guide the Cas9 to remove specified identifiers from the identifier library.
- the individual identifiers are single-stranded and wherein the identifier library is combined with a single-strand specific endonuclease(s).
- the identifier library is mixed with a complementary set of individual identifiers that protect target individual identifiers from degradation prior to the addition of the single-strand specific endonuclease(s).
- the individual identifiers that are not cleaved by the selective nuclease cleavage are separated by size-selective chromatography. In some embodiments, the individual identifiers that are not cleaved by the selective nuclease cleavage are amplified and wherein the individual identifiers that are cleaved by the selective nuclease cleavage are not amplified. In some embodiments, the identifier library comprises a plurality of nucleic acid sequences and wherein the plurality of nucleic acid sequences are extracted with the targeted subset of the plurality of identifiers in the identifier library.
- the present disclosure provides a method for reading information encoded in nucleic acid sequence(s), comprising: (a) providing an identifier library comprising a plurality of identifiers, wherein an individual identifier of the plurality of identifiers comprises one or more components, wherein an individual component of the one or more components comprises a nucleic acid sequence; (b) identifying the plurality of identifiers in the identifier library; (c) generating a plurality of symbols from the plurality of identifiers identified in (b), wherein an individual symbol of the plurality of symbols corresponds to the individual identifier of the plurality of identifiers; and (d) compiling the information from the plurality of symbols.
- each symbol in the string of symbols is one of two possible symbol values.
- one symbol value at each position of the string of symbols may be represented by the absence of a distinct identifier in the identifier library.
- the two possible symbol values are a bit-value of 0 and 1, wherein the individual symbol with the bit-value of 0 in the string of symbols may be represented by an absence of a distinct identifier in the identifier library, wherein the individual symbol with the bit-value of 1 in the string of symbols may be represented by a presence of the distinct identifier in the identifier library, and vice versa.
- a presence of an individual identifier in the identifier library corresponds to a first symbol value in a binary string and an absence of the individual identifier in the identifier library corresponds to a second symbol value in a binary string.
- the first symbol value is a bit value of 1 and the second symbol value is a bit value of 0.
- the first symbol value is a bit value of 0 and the second symbol value is a bit value of 1.
- identifying the plurality of identifiers comprises sequencing the plurality of identifiers in the identifier library.
- sequencing comprises digital polymerase chain reaction (PCR), quantitative PCR, a microarray, sequencing by synthesis, or massively-parallel sequencing.
- the identifier library comprises a plurality of nucleic acid sequences.
- the plurality of nucleic acid sequences store metadata of the information and/or conceal the information.
- one or more identifier libraries are combined and wherein each identifier library in the one or more identifier libraries comprises a distinct barcode.
- the barcode stores metadata of the information.
- the present disclosure provides a method for nucleic acid-based computer data storage, comprising: (a) receiving computer data, (b) synthesizing nucleic acid molecules comprising nucleic acid sequences encoding the computer data, wherein the computer data is encoded in at least a subset of nucleic acid molecules synthesized and not in a sequence of each of the nucleic acid molecules, and (c) storing the nucleic acid molecules having the nucleic acid sequences.
- the at least the subset of the nucleic acid molecules are grouped together.
- the method further comprises sequencing the nucleic acid molecule(s) to determine the nucleic acid sequence(s), thereby retrieving the computer data.
- (b) is performed in a time period that is less than about 1 day. In some embodiments, (b) is performed at an accuracy of at least about 90%.
- the present disclosure provides a method for nucleic acid-based computer data storage, comprising: (a) receiving computer data. (b) synthesizing a nucleic acid molecule comprising at least one nucleic acid sequence encoding the computer data, which synthesizing the nucleic acid molecule is in the absence of base-by-base nucleic acid synthesis, and (c) storing the nucleic acid molecule comprising the at least one nucleic acid sequence.
- the method further comprises sequencing the nucleic acid molecule to determine the nucleic acid sequence, thereby retrieving the computer data.
- (b) is performed in a time period that is less than about 1 day. In some embodiments, (b) is performed at an accuracy of at least about 90%.
- the present disclosure provides a system for encoding binary sequence data using nucleic acids, comprising: a device configured to construct an identifier library, wherein the identifier library comprises a plurality of identifiers, wherein an individual identifier of the plurality of identifiers comprises one or more components, and wherein an individual component of the one or more components is a nucleic acid sequence; and one or more computer processors operatively coupled to the device, wherein the one or more computer processors are individually or collectively programmed to (i) translate the information into a string of symbols, (ii) map the string of symbols to the plurality of identifiers, wherein the individual identifier of the plurality of identifiers corresponds to an individual symbol of the string of symbols, and (iii) construct an identifier library comprising the plurality of identifiers.
- the device comprises a plurality of partitions and wherein the identifier library is generated in one or more of the plurality of partitions.
- the plurality of partitions comprises wells.
- constructing the individual identifier in the identifier library comprises assembling the one or more components from one or more layers and wherein each layer of the one or more layers comprises a distinct set of components.
- each layer of the one or more layers is stored in a separate portion of the device and wherein the device is configured to combine the one or more components from the one or more layers.
- the identifier library comprises a plurality of nucleic acid sequences.
- one or more identifier libraries are combined in a single area of the device and wherein each identifier library of the one or more identifier libraries comprises a distinct barcode.
- the present disclosure provides a system for reading information encoded in nucleic acid sequence(s), comprising: a database that stores an identifier library comprising a plurality of identifiers, wherein an individual identifier of the plurality of identifiers comprises one or more components, wherein an individual component of the one or more components comprises a nucleic acid sequence; and one or more computer processors operatively coupled to the database, wherein the one or more computer processors are individually or collectively programmed to (i) identify the plurality of identifiers in the identifier library, (ii) generate a plurality of symbols from the plurality of identifiers identified in (i), wherein an individual symbol of the plurality of symbols corresponds to the individual identifier of the plurality of identifiers, and (iii) compile the information from the plurality of symbols.
- the system further comprises a plurality of partitions.
- the partitions are wells.
- a given partition of the plurality of partitions comprises one or more identifier libraries and wherein each identifier library of the one or more identifier libraries comprises a distinct barcode.
- the system further comprises a detection unit configured to identify the plurality of identifiers in the identifier library.
- FIG. 1 schematically illustrates an overview of a process for encoding, writing, accessing, reading, and decoding digital information stored in nucleic acid sequences
- FIGS. 2 A and 2 B schematically illustrate an example method of encoding digital data, referred to as “data at address”, using objects or identifiers (e.g., nucleic acid molecules);
- FIG. 2 A illustrates combining a rank object (or address object) with a byte-value object (or data object) to create an identifier;
- FIG. 2 B illustrates an embodiment of the data at address method wherein the rank objects and byte-value objects are themselves combinatorial concatenations of other objects;
- FIGS. 3 A and 3 B schematically illustrate an example method of encoding digital information using objects or identifiers (e.g., nucleic acid sequences);
- FIG. 3 A illustrates encoding digital information using a rank object as an identifier;
- FIG. 3 B illustrates an embodiment of the encoding method wherein the address objects are themselves combinatorial concatenations of other objects;
- FIG. 4 shows a contour plot, in log space, of a relationship between the combinatorial space of possible identifiers (C, x-axis) and the average number of identifiers (k, y-axis) that may be constructed to store information of a given size (contour lines):
- FIG. 5 schematically illustrates an overview of a method for writing information to nucleic acid sequences (e.g., deoxyribonucleic acid);
- FIGS. 6 A and 6 B illustrate an example method, referred to as the “product scheme”, for constructing identifiers (e.g., nucleic acid molecules) by combinatorially assembling distinct components (e.g., nucleic acid sequences);
- FIG. 6 A illustrates the architecture of identifiers constructed using the product scheme;
- FIG. 6 B illustrates an example of the combinatorial space of identifiers that may be constructed using the product scheme;
- FIG. 7 schematically illustrates the use of overlap extension polymerase chain reaction to construct identifiers (e.g., nucleic acid molecules) from components (e.g., nucleic acid sequences);
- FIG. 8 schematically illustrates the use of sticky end ligation to construct identifiers (e.g., nucleic acid molecules) from components (e.g., nucleic acid sequences);
- FIG. 9 schematically illustrates the use of recombinase assembly to construct identifiers (e.g., nucleic acid molecules) from components (e.g., nucleic acid sequences);
- FIGS. 10 A and 10 B demonstrates template directed ligation;
- FIG. 10 A schematically illustrates the use of template directed ligation to construct identifiers (e.g., nucleic acid molecules) from components (e.g., nucleic acid sequences);
- FIG. 10 B shows a histogram of the copy numbers (abundances) of 256 distinct nucleic acid sequences that were each combinatorially assembled from six nucleic acid sequences (e.g., components) in one pooled template directed ligation reaction:
- FIGS. 11 A- 11 G schematically illustrate an example method, referred to as the “permutation scheme”, for constructing identifiers (e.g., nucleic acid molecules) with permuted components (e.g., nucleic acid sequences);
- FIG. 11 A illustrates the architecture of identifiers constructed using the permutation scheme;
- FIG. 11 B illustrates an example of the combinatorial space of identifiers that may be constructed using the permutation scheme;
- FIG. 11 C shows an example implementation of the permutation scheme with template directed ligation;
- FIG. 11 D shows an example of how the implementation from FIG. 11 C may be modified to construct identifiers with permuted and repeated components;
- FIG. 11 E shows how the example implementation from FIG.
- FIG. 11 D may lead to unwanted byproducts that may be removed with nucleic acid size selection
- FIG. 11 F shows another example of how to use template directed ligation and size selection to construct identifiers with permuted and repeated components
- FIG. 11 G shows an example of when size selection may fail to isolate a particular identifier from unwanted byproducts
- FIGS. 12 A- 12 D schematically illustrate an example method, referred to as the “MchooseK” scheme, for constructing identifiers (e.g., nucleic acid molecules) with any number, K of assembled components (e.g., nucleic acid sequences) out of a larger number, M, of possible components;
- FIG. 12 A illustrates the architecture of identifiers constructed using the MchooseK scheme;
- FIG. 12 B illustrates an example of the combinatorial space of identifiers that may be constructed using the MchooseK scheme;
- FIG. 12 C shows an example implementation of the MchooseK scheme using template directed ligation;
- FIG. 12 D shows how the example implementation from FIG. 12 C may lead to unwanted byproducts that may be removed with nucleic acid size selection;
- FIGS. 13 A and 13 B schematically illustrates an example method, referred to as the “partition scheme” for constructing identifiers with partitioned components;
- FIG. 13 A shows an example of the combinatorial space of identifiers that may be constructed using the partition scheme;
- FIG. 13 B shows an example implementation of the partition scheme using template directed ligation;
- FIGS. 14 A and 14 B schematically illustrates an example method, referred to as the “unconstrained string” (or USS) scheme, for constructing identifiers made up of any string of components from a number of possible components;
- FIG. 14 A shows an example of the combinatorial space of identifiers that may be constructed using the USS scheme;
- FIG. 14 B shows an example implementation of the USS scheme using template directed ligation;
- FIGS. 15 A and 15 B schematically illustrates an example method, referred to as “component deletion” for constructing identifiers by removing components from a parent identifier;
- FIG. 15 A shows an example of the combinatorial space of identifiers that may be constructed using the component deletion scheme;
- FIG. 15 B shows an example implementation of the component deletion scheme using double stranded targeted cleavage and repair;
- FIG. 16 schematically illustrates a parent identifier with recombinase recognition sites where further identifiers may be constructed by applying recombinases to the parent identifier;
- FIGS. 17 A- 17 C schematically illustrate an overview of example methods for accessing portions of information stored in nucleic acid sequences by accessing a number of particular identifiers from a larger number of identifiers;
- FIG. 17 A shows example methods for using polymerase chain reaction, affinity tagged probes, and degradation targeting probes to access identifiers containing a specified component;
- FIG. 17 B shows example methods for using polymerase chain reaction to perform ‘OR’ or ‘AND’ operations to access identifiers containing multiple specified components;
- FIG. 17 C shows example methods for using affinity tags to perform ‘OR’ or ‘AND’ operations to access identifiers containing multiple specified components;
- FIGS. 18 A and 18 B show examples of encoding, writing, and reading data encoded in nucleic acid molecules;
- FIG. 18 A shows an example of encoding, writing, and reading 5,856 bits of data;
- FIG. 18 b shows an example of encoding, writing, and reading 62,824 bits of data;
- FIG. 19 shows a computer system that is programmed or otherwise configured to implement methods provided herein.
- symbol generally refers to a representation of a unit of digital information. Digital information may be divided or translated into a string of symbols. In an example, a symbol may be a bit and the bit may have a value of ‘0’ or ‘1’.
- a distinct, or unique, nucleic acid sequence may be a nucleic acid sequence that does not have the same sequence as any other nucleic acid sequence.
- a distinct, or unique, nucleic acid molecule may not have the same sequence as any other nucleic acid molecule.
- the distinct, or unique, nucleic acid sequence or molecule may share regions of similarity with another nucleic acid sequence or molecule.
- component generally refers to a nucleic acid sequence.
- a component may be a distinct nucleic acid sequence.
- a component may be concatenated or assembled with one or more other components to generate other nucleic acid sequence or molecules.
- layer generally refers to group or pool of components. Each layer may comprise a set of distinct components such that the components in one layer are different from the components in another layer. Components from one or more layers may be assembled to generate one or more identifiers.
- identifier generally refers to a nucleic acid molecule or a nucleic acid sequence that represents the position and value of a bit-string within a larger bit-string. More generally, an identifier may refer to any object that represents or corresponds to a symbol in a string of symbols. In some embodiments, identifiers may comprise one or multiple concatenated components.
- combinatorial space generally refers to the set of all possible distinct identifiers that may be generated from a starting set of objects, such as components, and a permissible set of rules for how to modify those objects to form identifiers.
- the size of a combinatorial space of identifiers made by assembling or concatenating components may depend on the number of layers of components, the number of components in each layer, and the particular assembly method used to generate the identifiers.
- identifier rank generally refers to a relation that defines the order of identifiers in a set.
- identifier library generally refers to a collection of identifiers corresponding to the symbols in a symbol string representing digital information. In some embodiments, the absence of a given identifier in the identifier library may indicate a symbol value at a particular position.
- One or more identifier libraries may be combined in a pool, group, or set of identifiers. Each identifier library may include a unique barcode that identifies the identifier library.
- nucleic acid general refers to deoxyribonucleic acid (DNA), ribonucleic acid (RNA), or a variant thereof.
- a nucleic acid may include one or more subunits selected from adenosine (A), cytosine (C), guanine (G), thymine (T), and uracil (U), or variants thereof.
- a nucleotide can include A, C, G, T, or U, or variants thereof.
- a nucleotide can include any subunit that can be incorporated into a growing nucleic acid strand.
- Such subunit can be A, C, G, T, or U, or any other subunit that may be specific to one of more complementary A, C, G, T, or U, or complementary to a purine (i.e., A or G, or variant thereof) or pyrimidine (i.e., C. T, or U, or variant thereof).
- a nucleic acid may be single-stranded or double stranded, in some cases, a nucleic acid is circular.
- nucleic acid molecule or “nucleic acid sequence,” as used herein, generally refer to a polymeric form of nucleotides, or polynucleotide, that may have various lengths, either deoxyribonucleotides (DNA) or ribonucleotides (RNA), or analogs thereof.
- nucleic acid sequence may refer to the alphabetical representation of a polynucleotide; alternatively, the term may be applied to the physical polynucleotide itself. This alphabetical representation can be input into databases in a computer having a central processing unit and used for mapping nucleic acid sequences or nucleic acid molecules to symbols, or bits, encoding digital information.
- Nucleic acid sequences or oligonucleotides may include one or more non-standard nucleotide(s), nucleotide analog(s) and/or modified nucleotides.
- oligonucleotide generally refers to a single-stranded nucleic acid sequence, and is typically composed of a specific sequence of four nucleotide bases: adenine (A); cytosine (C); guanine (G), and thymine (T) or uracil (U) when the polynucleotide is RNA.
- A adenine
- C cytosine
- G guanine
- T thymine
- U uracil
- modified nucleotides include, but are not limited to diaminopurine, 5-fluorouracil, 5-bromouracil, 5-chlorouracil, 5-iodouracil, hypoxanthine, xantine, 4-acetylcytosine, 5-(carboxyhydroxylmethyl)uracil, 5-carboxymethylaminomethyl-2-thiouridine, 5-carboxymethylaminomethyluracil, dihydrouracil, beta-D-galactosylqueosine, inosine, N6-isopentenyladenine, 1-methylguanine, 1-methylinosine, 2,2-dimethylguanine, 2-methyladenine, 2-methylguanine, 3-methylcytosine, 5-methylcytosine, N6-adenine, 7-methylguanine, 5-methylaminomethyluracil, 5-methoxyaminomethyl-2-thiouracil, beta-D-mannosylqueosine, 5′-methoxycarboxymethyl
- Nucleic acid molecules may also be modified at the base moiety (e.g., at one or more atoms that typically are available to form a hydrogen bond with a complementary nucleotide and/or at one or more atoms that are not typically capable of forming a hydrogen bond with a complementary nucleotide), sugar moiety or phosphate backbone.
- Nucleic acid molecules may also contain amine-modified groups, such as aminoallyl-dUTP (aa-dUTP) and aminohexhylacrylamide-dCTP (aha-dCTP) to allow covalent attachment of amine reactive moieties, such as N-hydroxy succinimide esters (NHS).
- primer generally refers to a strand of nucleic acid that serves as a starting point for nucleic acid synthesis, such as polymerase chain reaction (PCR).
- PCR polymerase chain reaction
- an enzyme that catalyzes replication starts replication at the 3′-end of a primer attached to the DNA sample and copies the opposite strand.
- polymerase or “polymerase enzyme,” as used herein, generally refers to any enzyme capable of catalyzing a polymerase reaction.
- examples of polymerases include, without limitation, a nucleic acid polymerase.
- the polymerase can be naturally occurring or synthesized.
- An example polymerase is a ⁇ 29 polymerase or derivative thereof.
- a transcriptase or a ligase is used (i.e., enzymes which catalyze the formation of a bond) in conjunction with polymerases or as an alternative to polymerases to construct new nucleic acid sequences.
- polymerases examples include a DNA polymerase, a RNA polymerase, a thermostable polymerase, a wild-type polymerase, a modified polymerase, E. coli DNA polymerase I, T7 DNA polymerase, bacteriophage T4 DNA polymerase ⁇ 29 (phi29) DNA polymerase, Taq polymerase, Tth polymerase, Tli polymerase, Pfu polymerase Pwo polymerase, VENT polymerase, DEEPVENT polymerase, Ex-Taq polymerase, LA-Taw polymerase, Sso polymerase Poc polymerase, Pab polymerase, Mth polymerase ES4 polymerase, Tru polymerase, Tac polymerase, Tne polymerase, Tma polymerase, Tca polymerase, Tih polymerase, Tfi polymerase, Platinum Taq polymerases, Tbr polymerase, Tfl polymerase, Pfutubo polymerase,
- Digital information such as computer data, in the form of binary code can comprise a sequence or string of symbols.
- a binary code may encode or represent text or computer processor instructions using, for example, a binary number system having two binary symbols, typically 0 and 1, referred to as bits.
- Digital information may be represented in the form of non-binary code which can comprise a sequence of non-binary symbols.
- Each encoded symbol can be re-assigned to a unique bit string (or “byte”), and the unique bit string or byte can be arranged into strings of bytes or byte streams.
- a bit value for a given bit can be one of two symbols (e.g., 0 or 1).
- a byte which can comprise a string of N bits, can have a total of 2 N unique byte-values.
- a byte comprising 8 bits can produce a total of 2 S or 256 possible unique byte-values, and each of the 256 bytes can correspond to one of 256 possible distinct symbols, letters, or instructions which can be encoded with the bytes.
- Raw data e.g., text files and computer instructions
- Zip files, or compressed data files comprising raw data can also be stored in byte streams, these files can be stored as byte streams in a compressed form, and then decompressed into raw data before being read by the computer.
- Methods and systems of the present disclosure may be used to encode computer data or information in a plurality of identifiers, each of which may represent one or more bits of the original information.
- methods and systems of the present disclosure encode data or information using identifiers that each represents two bits of the original information.
- New methods can encode digital information (e.g., binary code) in a plurality of identifiers, or nucleic acid sequences, comprising combinatorial arrangements of components instead of relying on base-by-base or de-novo nucleic acid synthesis (e.g., phosphoramidite synthesis).
- digital information e.g., binary code
- nucleic acid sequences comprising combinatorial arrangements of components instead of relying on base-by-base or de-novo nucleic acid synthesis (e.g., phosphoramidite synthesis).
- new strategies may produce a first set of distinct nucleic acid sequences (or components) for the first request of information storage, and can there-after re-use the same nucleic acid sequences (or components) for subsequent information storage requests.
- These approaches can significantly reduce the cost of DNA-based information storage by reducing the role of de-novo synthesis of nucleic acid sequences in the information-to-DNA encoding and writing process.
- new methods of information-to-DNA writing using identifier construction from components are highly parallelizable processes that do not necessarily use cyclical nucleic acid elongation.
- new methods may increase the speed of writing digital information to DNA compared to older methods.
- a method for encoding information into nucleic acid sequences may comprise (a) translating the information into a string of symbols, (b) mapping the string of symbols to a plurality of identifiers, and (c) constructing an identifier library comprising at least a subset of the plurality of identifiers.
- An individual identifier of the plurality of identifiers may comprise one or more components.
- An individual component of the one or more components may comprise a nucleic acid sequence.
- Each symbol at each position in the string of symbols may correspond to a distinct identifier.
- the individual identifier may correspond to an individual symbol at an individual position in the string of symbols.
- one symbol at each position in the string of symbols may correspond to the absence of an identifier.
- a string of binary symbols e.g., bits
- each occurrence of ‘0’ may correspond to the absence of an identifier.
- a method for nucleic acid-based computer data storage may comprise (a) receiving computer data, (b) synthesizing nucleic acid molecules comprising nucleic acid sequences encoding the computer data, and (c) storing the nucleic acid molecules having the nucleic acid sequences.
- the computer data may be encoded in at least a subset of nucleic acid molecules synthesized and not in a sequence of each of the nucleic acid molecules.
- the present disclosure provides methods for writing and storing information in nucleic acid sequences.
- the method may comprise, (a) receiving or encoding a virtual identifier library that represents information, (b) physically constructing the identifier library, and (c) storing one or more physical copies of the identifier library in one or more separate locations.
- An individual identifier of the identifier library may comprise one or more components.
- An individual component of the one or more components may comprise a nucleic acid sequence.
- a method for nucleic acid-based computer data storage may comprise (a) receiving computer data, (b) synthesizing a nucleic acid molecule comprising at least one nucleic acid sequence encoding the computer data, and (c) storing the nucleic acid molecule comprising the at least one nucleic acid sequence. Synthesizing the nucleic acid molecule may be in the absence of base-by-base nucleic acid synthesis.
- a method for writing and storing information in nucleic acid sequences may comprise, (a) receiving or encoding a virtual identifier library that represents information, (b) physically constructing the identifier library, and (c) storing one or more physical copies of the identifier library in one or more separate locations.
- An individual identifier of the identifier library may comprise one or more components.
- An individual component of the one or more components may comprise a nucleic acid sequence.
- FIG. 1 illustrates an overview process for encoding information into nucleic acid sequences, writing information to the nucleic acid sequences, reading information written to nucleic acid sequences, and decoding the read information.
- Digital information, or data may be translated into one or more strings of symbols.
- the symbols are bits and each bit may have a value of either ‘0’ or ‘1’.
- Each symbol may be mapped, or encoded, to an object (e.g., identifier) representing that symbol.
- Each symbol may be represented by a distinct identifier.
- the distinct identifier may be a nucleic acid molecule made up of components.
- the components may be nucleic acid sequences.
- the digital information may be written into nucleic acid sequences by generating an identifier library corresponding to the information.
- the identifier library may be physically generated by physically constructing the identifiers that correspond to each symbol of the digital information. All or any portion of the digital information may be accessed at a time. In an example, a subset of identifiers is accessed from an identifier library. The subset of identifiers may be read by sequencing and identifying the identifiers. The identified identifiers may be associated with their corresponding symbol to decode the digital data.
- a method for encoding and reading information using the approach of FIG. 1 can, for example, include receiving a bit stream and mapping each one-bit (bit with bit-value of ‘1’) in the bit stream to a distinct nucleic acid identifier using an identifier rank or a nucleic acid index. Constructing a nucleic acid sample pool, or identifier library, comprising copies of the identifiers that correspond to bit values of 1 (and excluding identifiers for bit values of 0).
- Reading the sample can comprise using molecular biology methods (e.g., sequencing, hybridization, PCR, etc), determining which identifiers are represented in the identifier library, and assigning bit-values of ‘1’ to the bits corresponding to those identifiers and bit-values of ‘0’ elsewhere (again referring to the identifier rank to identify the bits in the original bit-stream that each identifier corresponds to), thus decoding the information into the original encoded bit stream.
- molecular biology methods e.g., sequencing, hybridization, PCR, etc
- determining which identifiers are represented in the identifier library e.g., determining which identifiers are represented in the identifier library, and assigning bit-values of ‘1’ to the bits corresponding to those identifiers and bit-values of ‘0’ elsewhere (again referring to the identifier rank to identify the bits in the original bit-stream that each identifier corresponds to), thus decoding the information into the original encoded bit stream.
- Encoding a string of N distinct bits can use an equivalent number of unique nucleic acid sequences as possible identifiers.
- This approach to information encoding may use de-novo synthesis of identifiers (e.g., nucleic acid molecules) for each new item of information (string of N bits) to store.
- identifiers e.g., nucleic acid molecules
- the cost of newly synthesizing identifiers (equivalent in number to or less than N) for each new item of information to store can be reduced by the one-time de-novo synthesis and subsequent maintenance of all possible identifiers, such that encoding new items of information may involve mechanically selecting and mixing together pre-synthesized (or pre-fabricated) identifiers to form an identifier library.
- both the cost of (1) de-novo synthesis of up to N identifiers for each new item of information to store or (2) maintaining and selecting from N possible identifiers for each new item of information to store, or any combination thereof may be reduced by synthesizing and maintaining a number (less than N, and in some cases much less than N) of nucleic acid sequences and then modifying these sequences through enzymatic reactions to generate up to N identifiers for each new item of information to store.
- the identifiers may be rationally designed and selected for ease of read, write, access, copy, and deletion operations.
- the identifiers may be designed and selected to minimize wnte errors, mutations, degradation, and read errors.
- FIGS. 2 A and 2 B schematically illustrate an example method, referred to as “data at address”, of encoding digital data in objects or identifiers (e.g., nucleic acid molecules).
- FIG. 2 A illustrates encoding a bit stream into an identifier library wherein the individual identifiers are constructed by concatenating or assembling a single component that specifies an identifier rank with a single component that specifies a byte-value.
- the data at address method uses identifiers that encode information modularly by comprising two objects: one object, the “byte-value object” (or “data object”), that identifies a byte-value and one object, the “rank object” (or “address object”), that identifies the identifier rank (or the relative position of the byte in the original bit-stream).
- FIG. 2 B illustrates an example of the data at address method wherein each rank object may be combinatorially constructed from a set of components and each byte-value object may be combinatorially constructed from a set of components.
- Such combinatorial construction of rank and byte-value objects enables more information to be written into identifiers than if the objects where made from the single components alone (e.g., FIG. 2 A ).
- FIGS. 3 A and 3 B schematically illustrate another example method of encoding digital information in objects or identifiers (e.g., nucleic acid sequences).
- FIG. 3 A illustrates encoding a bit stream into an identifier library wherein identifiers are constructed from single components that specify identifier rank. The presence of an identifier at a particular rank (or address) specifies a bit-value of ‘1’ and the absence of an identifier at a particular rank (or address) specifies a bit-value of ‘0’.
- This type of encoding may use identifiers that solely encode rank (the relative position of a bit in the original bit stream) and use the presence or absence of those identifiers in an identifier library to encode a bit-value of ‘1’ or ‘0’, respectively. Reading and decoding the information may include identifying the identifiers present in the identifier library, assigning bit-values of ‘1’ to their corresponding ranks and assigning bit-values of ‘0’ elsewhere.
- FIG. 3 B illustrates an example encoding method where each identifier may be combinatorially constructed from a set of components such that each possible combinatorial construction specifies a rank.
- a component set may comprise five distinct components.
- the five distinct components may be assembled to generate ten distinct identifiers, each comprising two of the five components.
- the ten distinct identifiers may each have a rank (or address) that corresponds to the position of a bit in a bit stream.
- An identifier library may include the subset of those ten possible identifiers that corresponds to the positions of bit-value ‘1’, and exclude the subset of those ten possible identifiers that corresponds to the positions of the bit-value ‘0’ within a bit stream of length ten.
- FIG. 4 shows a contour plot, in log space, of a relationship between the combinatorial space of possible identifiers (C, x-axis) and the average number of identifiers (k, y-axis) to be physically constructed in order to store information of a given original size in bits (D, contour lines) using the encoding method shown in FIGS. 3 A and 3 B .
- This plot assumes that the original information of size D is re-coded into a string of C bits (where C may be greater than D) where a number of bits, k, has a bit-value of ‘1’.
- the plot assumes that information-to-nucleic-acid encoding is performed on the re-coded bit string and that identifiers for positions where the bit-value is ‘1’ are constructed and identifiers for positions where the bit-value is ‘0’ are not constructed.
- Cchoosek may be the mathematical formula for the number of ways to pick k unordered outcomes from C possibilities.
- FIG. 5 shows an overview method for writing information into nucleic acid sequences.
- the information Prior to writing the information, the information may be translated into a string of symbols and encoded into a plurality of identifiers.
- Writing the information may include setting up reactions to produce possible identifiers.
- a reaction may be set up by depositing inputs into a compartment.
- the inputs may comprise nucleic acids, components, templates, enzymes, or chemical reagents.
- the compartment may be a well, a tube, a position on a surface, a chamber in a microfluidic device, or a droplet within an emulsion.
- Multiple reactions may be set up in multiple compartments. Reactions may proceed to produce identifiers through programmed temperature incubation or cycling.
- Reactions may be selectively or ubiquitously removed (e.g., deleted). Reactions may also be selectively or ubiquitously interrupted, consolidated, and purified to collect their identifiers in one pool. Identifiers from multiple identifier libraries may be collected in the same pool.
- An individual identifier may include a barcode or a tag to identify to which identifier library it belongs. Alternatively, or in addition to, the barcode may include metadata for the encoded information.
- Supplemental nucleic acids or identifiers may also be included in an identifier pool together with an identifier library. The supplemental nucleic acids or identifiers may include metadata for the encoded information or serve to obfuscate or conceal the encoded information.
- An identifier rank (e.g., nucleic acid index) can comprise a method or key for determining the ordering of identifiers.
- the method can comprise a look-up table with all identifiers and their corresponding rank.
- the method can also comprise a look up table with the rank of all components that constitute identifiers and a function for determining the ordering of any identifier comprising a combination of those components.
- Such a method may be referred to as lexicographical ordering and may be analogous to the manner in which words in a dictionary are alphabetically ordered.
- the identifier rank (encoded by the rank object of the identifier) may be used to determine the position of a byte (encoded by the byte-value object of the identifier) within a bit stream.
- the identifier rank (encoded by the entire identifier itself) for a present identifier may be used to determine the position of bit-value of ‘1’ within a bit stream.
- a key may assign distinct bytes to unique subsets of identifiers (e.g., nucleic acid molecules) within a sample. For example, in a simple form, a key may assign each bit in a byte to a unique nucleic acid sequence that specifies the position of the bit, and then the presence or absence of that nucleic acid sequence within a sample may specify the bit-value of 1 or 0, respectively.
- Reading the encoded information from the nucleic acid sample can comprise any number of molecular biology techniques including sequencing, hybridization, or PCR. In some embodiments, reading the encoded dataset may comprise reconstructing a portion of the dataset or reconstructing the entire encoded dataset from each nucleic acid sample.
- the nucleic acid index can be used along with the presence or absence of a unique nucleic acid sequence and the nucleic acid sample can be decoded into a bit stream (e.g., each string of bits, byte, bytes, or string of bytes).
- Identifiers may be constructed by combinatorially assembling component nucleic acid sequences.
- information may be encoded by taking a set of nucleic acid molecules (e.g., identifiers) from a defined group of molecules (e.g., combinatorial space).
- Each possible identifier of the defined group of molecules may be an assembly of nucleic acid sequences (e.g., components) from a prefabricated set of components that may be divided into layers.
- Each nucleic acid sequence from X can be assembled to each nucleic acid sequence from Y.
- the total number of nucleic acid sequences maintained in the two sets may be the sum of x and y
- the total number of nucleic acid molecules, and hence possible identifiers, that can be generated may be the product of x and y.
- Even more nucleic acid sequences (e.g., identifiers) can be generated if the sequences from X can be assembled to the sequences of Y in any order.
- the number of nucleic acid sequences (e.g., identifiers) generated may be twice the product of x and y if the assembly order is programmable.
- This set of all possible nucleic acid sequences that can be generated may be referred to as XY.
- the order of the assembled units of unique nucleic acid sequences in XY can be controlled using nucleic acids with distinct 5′ and 3′ ends, and restriction digestion, ligation, polymerase chain reaction (PCR), and sequencing may occur with respect to the distinct 5′ and 3′ ends of the sequences.
- PCR polymerase chain reaction
- two layers of 10 distinct nucleic acid molecules may be assembled in a fixed order to produce 10*10 or 100 distinct nucleic acid molecules (e.g., identifiers), or one layer of 5 distinct nucleic acid molecules (e.g., components) and another layer of 10 distinct nucleic acid molecules (e.g., components) may be assembled in any order to produce 100 distinct nucleic acid molecules (e.g., identifiers).
- Nucleic acid sequences within each layer may comprise a unique (or distinct) sequence, or barcode, in the middle, a common hybridization region on one end, and another common hybridization region on another other end.
- the barcodes may be designed to be randomly generated. Alternatively, the barcodes may be designed to avoid sequences that may create complications to the construction chemistry of identifiers or sequencing. Additionally, barcodes may be designed so that each may have a minimum hamming distance from the other barcodes, thereby decreasing the likelihood that base-resolution mutations or read errors may interfere with the proper identification of the barcode.
- the hybidization region on one end of the nucleic acid sequence may be different in each layer, but the hybridization region may be the same for each member within a layer.
- Adjacent layers are those that have complementary hybridization regions on their components that allow them to interact with one another.
- any component from layer X may be able to attach to any component from layer Y because they may have complementary hybridization regions.
- the hybridization region on the opposite end may serve the same purpose as the hybridization region on the first end.
- any component from layer Y may attach to any component of layer X on one end and any component of layer Z on the opposite end.
- FIGS. 6 A and 6 B illustrate an example method, referred to as the “product scheme”, for constructing identifiers (e.g., nucleic acid molecules) by combinatorially assembling a distinct component (e.g., nucleic acid sequence) from each layer in a fixed order.
- FIG. 6 A illustrates the architecture of identifiers constructed using the product scheme. An identifier may be constructed by combining a single component from each layer in a fixed order. For M layers, each with N components, there are N M possible identifiers.
- FIG. 6 B illustrates an example of the combinatorial space of identifiers that may be constructed using the product scheme. In an example, a combinatorial space may be generated from three layers each comprising three distinct components. The components may be combined such that one component from each layer may be combined in a fixed order. The entire combinatorial space for this assembly method may comprise twenty-seven possible identifiers.
- FIGS. 7 - 10 illustrate chemical methods for implementing the product scheme (see FIG. 6 ). Methods depicted in FIGS. 7 - 10 , along with any other methods for assembling two or more distinct components in a fixed order may be used, for example, to produce any one or more identifiers in an identifier library. Identifiers may be constructed using any of the implementation methods described in FIGS. 7 - 10 , at any time during the methods or systems disclosed herein. In some instances, all or a portion of the combinatorial space of possible identifiers may be constructed before digital information is encoded or written, and then the writing process may involve mechanically selecting and pooling the identifiers (that encode the information) from the already existing set. In other instances, the identifiers may be constructed after one or more steps of the data encoding or writing process may have occurred (i.e., as information is being written).
- Enzymatic reactions may be used to assemble components from the different layers or sets. Assembly can occur in a one pot reaction because components (e.g., nucleic acid sequences) of each layer have specific hybridization or attachment regions for components of adjacent layers. For example, a nucleic acid sequence (e.g., component) X1 from layer X, a nucleic acid sequence Y1 from layer Y, and a nucleic acid sequence Z1 from layer Z may form the assembled nucleic acid molecule (e.g., identifier) X1Y1Z1. Additionally, multiple nucleic acid molecules (e.g., identifiers) may be assembled in one reaction by including multiple nucleic acid sequences from each layer.
- Y1 and Y2 in the one pot reaction of the previous example may yield two assembled products (e.g., identifiers), X1Y1Z1 and X1Y2Z1.
- This reaction multiplexing may be used to speed up writing time for the plurality of identifiers that are physically constructed.
- Assembly of the nucleic acid sequences may be performed in a time period that is less than or equal to about 1 day, 12 hours, 10 hours, 9 hours, 8 hours, 7 hours, 6 hours, 5 hours, 4 hours, 3 hours, 2 hours, or 1 hour.
- the accuracy of the encoded data may be at least about or equal to about 90%, 95%, 96%, 97%, 98%, 99%, or greater.
- Identifiers may be constructed in accordance with the product scheme using overlap extension polymerase chain reaction (OEPCR), as illustrated in FIG. 7 .
- OEPCR overlap extension polymerase chain reaction
- Each component in each layer may comprise a double-stranded or single stranded (as depicted in the figure) nucleic acid sequence with a common hybridization region on the sequence end that may be homologous and/or complementary to the common hybridization region on the sequence end of components from an adjacent layer.
- An individual identifier may be constructed by concatenating one component (e.g., unique sequence) from a layer X (or layer 1) comprising components X 1 -X A , a second component (e.g., unique sequence) from a layer Y (or layer 2) comprising Y 1 -Y A , and a third component (e.g., unique sequence) from layer Z (or layer 3) comprising Z 1 -Z B .
- the components from layer X may have a 3′ end that shares complementarity with the 3′ end on components from layer Y.
- single-stranded components from layer X and Y may be annealed together at the 3′ end and may be extended using PCR to generate a double-stranded nucleic acid molecule.
- the generated double-stranded nucleic-acid molecule may be melted to generate a 3′ end that shares complementarity with a 3′ end of a component from layer Z.
- a component from layer Z may be annealed with the generated nucleic acid molecule and may be extended to generate a unique identifier comprising a single component from layers X, Y, and Z in a fixed order.
- DNA size selection e.g., with gel extraction
- PCR polymerase chain reaction
- Identifiers may be assembled in accordance with the product scheme using sticky end ligation, as illustrated in FIG. 8 .
- Three layers, each comprising double stranded components e.g., double stranded DNA (dsDNA)
- dsDNA double stranded DNA
- identifiers comprising one component from the layer X (or layer 1) comprising components X 1 -X A , a second component from the layer Y (or layer 2) comprising Y 1 -Y B , and a third component from the layer Z (or layer 3) comprising Z 1 -Z C .
- the components in layer X can comprise a common 3′ overhang, FIG. 8 labeled a, and the components in layer Y can comprise a common, complementary 3′ overhang, a*.
- the elements in layer Y can comprise a common 3′ overhang, FIG. 8 labeled b, and the elements in layer Z can comprise a common, complementary 3′ overhang, b*.
- the 3′ overhang in layer X components can be complementary to the 3′ end in layer Y components and the other 3′ overhang in layer Y components can be complementary to the 3′ end in layer Z components allowing the components to hybridize and ligate.
- a single component from layer Y can ligate to a single component of layer X and a single component of layer Z, ensuring the formation of a complete identifier.
- DNA size selection for example with gel extraction
- PCR with primers flanking the outer most layers may be implemented to isolate identifier products from other byproducts that may form in the reaction.
- the sticky ends for sticky end ligation may be generated by treating the components of each layer with restriction endonucleases.
- the components of multiple layers may be generated from one “parent” set of components.
- a single parent set of double-stranded components may have complementary restrictions sites on each end (e.g., restriction sites for BamHI and BglII). Any two components may be selected for assembly, and individually digested with one or the other complementary restriction enzymes (e.g., BglII or BamHI) resulting in complementary sticky ends that can be ligated together resulting in an inert scar.
- complementary restriction enzymes e.g., BglII or BamHI
- the product nucleic acid sequence may comprise the complementary restriction sites on each end (e.g., BamHI on the 5′ end and BglII on the 3′ end), and can be further ligated to another component from the parent set following the same process. This process may cycle indefinitely. If the parent comprises N components, then each cycle may be equivalent to adding an extra layer of N components to the product scheme.
- a method for using ligation to construct a sequence of nucleic acids comprising elements from set X (e.g., set 1 of dsDNA) and elements from set Y (e.g., set 2 of dsDNA) can comprise the steps of obtaining or constructing two or more pools (e.g., set 1 of dsDNA and set 2 of dsDNA) of double stranded sequences wherein a first set (e.g., set 1 of dsDNA) comprises a sticky end (e.g., a) and a second set (e.g., set 2 of dsDNA) comprises a sticky end (e.g., a*) that is complementary to the sticky end of the first set.
- a first set e.g., set 1 of dsDNA
- a sticky end e.g., a
- a second set e.g., set 2 of dsDNA
- Any DNA from the first set e.g., set 1 of dsDNA
- any subset of DNA from the second set e.g., set 2 of dsDNA
- Identifiers may be assembled in accordance with the product scheme using site specific recombination, as illustrated in FIG. 9 .
- Identifiers may be constructed by assembling components from three different layers.
- the components in layer X (or layer 1) may comprise double-stranded molecules with an attB x recombinase site on one side of the molecule
- components from layer Y (or layer 2) may comprise double-stranded molecules with an attP x recombinase site on one side and an attB y recombinase site on the other side
- components in layer Z (or layer 3) may comprise an attP y recombinase site on one side of the molecule.
- AttB and attP sites within a pair are capable of recombining in the presence of their corresponding recombinase enzyme.
- One component from each layer may be combined such that one component from layer X associates with one component from layer Y, and one component from layer Y associates with one component from layer Z.
- Application of one or more recombinase enzymes may recombine the components to generate a double-stranded identifier comprising the ordered components.
- DNA size selection for example with gel extraction
- PCR with primers flanking the outer most layers may be implemented to isolate identifier products from other byproducts that may form in the reaction.
- multiple orthogonal attB and attP pairs may be used, and each pair may be used to assemble a component from an extra layer.
- up to six orthogonal attB and attP pairs may be generated per recombinases, and multiple orthogonal recombinases may be implemented as well.
- thirteen layers may be assembled by using twelve orthogonal attB and attP pairs, six orthogonal pairs from each of two large serine recombinases, such as BxbI and PhiC31. Orthogonality of attB and attP pairs ensures that an attB site from one pair does not react with an attP site from another pair.
- Recombinase-mediated recombination reactions may be reversible or irreversible depending on the recombinase system implemented.
- the large serine recombinase family catalyzes irreversible recombination reactions without requiring any high energy cofactors
- the tyrosine recombinase family catalyzes reversible reactions.
- Identifiers may be constructed in accordance with the product scheme using template directed ligation (TDL), as shown in FIG. 10 A .
- Template directed ligation utilizes single stranded nucleic acid sequences, referred to as “templates” or “staples”, to facilitate the ordered ligation of components to form identifiers.
- the templates simultaneously hybridize to components from adjacent layers and hold them adjacent to each other (3′ end against 5′ end) while a ligase ligates them. In the example from FIG. 10 A , three layers or sets of single-stranded components are combined.
- a first layer of components e.g., layer X or layer 1 that share common sequences a on their 3′ end, which are complementary to sequences a*; a second layer of components (e.g., layer Y or layer 2) that share common sequences b and c on their 5′ and 3′ ends respectively, which are complementary to sequences b* and c*; a third layer of components (e.g., layer Z or layer 3) that share common sequence d on their 5′ end, which may be complementary to sequences d*; and a set of two templates or “staples” with the first staple comprising the sequence a*b* (5′ to 3′) and the second staple comprising a sequence c*d* (‘5 to 3’).
- one or more components from each layer may be selected and mixed into a reaction with the staples, which, by complementary annealing may facilitate the ligation of one component from each layer in a defined order to form an identifier.
- DNA size selection for example with gel extraction
- PCR with primers flanking the outer most layers may be implemented to isolate identifier products from other byproducts that may form in the reaction.
- FIG. 10 B shows a histogram of the copy numbers (abundances) of 256 distinct nucleic acid sequences that were each assembled with 6-layer TDL.
- the edge layers first and final layers
- each of the internal layers had four components.
- Each edge layer component was 28 bases including a 10 base hybridization region.
- Each internal layer component was 30 bases including a 10 base common hybridization region on the 5′ end, a 10 base variable (barcode) region, and a 10 base common hybridization region on the 3′ end.
- Each of the three template strands was 20 bases in length.
- Identifiers may be constructed in accordance with the product scheme using various other chemical implementations including golden gate assembly, gibson assembly, and ligase cycling reaction assembly.
- FIGS. 11 A and 11 B schematically illustrate an example method, referred to as the “permutation scheme”, for constructing identifiers (e.g., nucleic acid molecules) with permuted components (e.g., nucleic acid sequences).
- FIG. 11 A illustrates the architecture of identifiers constructed using the permutation scheme. An identifier may be constructed by combining a single component from each layer in a programmable order.
- FIG. 11 B illustrates an example of the combinatorial space of identifiers that may be constructed using the permutation scheme. In an example, a combinatorial space of size six may be generated from three layers each comprising one distinct component. The components may be concatenated in any order. In general, with M layers, each with N components, the permutation scheme enables a combinatorial space of N M M! total identifiers.
- FIG. 11 C illustrates an example implementation of the permutation scheme with template directed ligation (TDL).
- TDL template directed ligation
- Components from multiple layers are assembled in between fixed left end and right end components, referred to as edge scaffolds.
- These edge scaffolds are the same for all identifiers in the combinatorial space and thus may be added as part of the reaction master mix for the implementation.
- Templates or staples exist for any possible junction between any two layers or scaffolds such that the order in which components from different layers are incorporated into an identifier in the reaction depends on the templates selected for the reaction.
- M of those templates form junctions between layers and themselves and may be excluded for the purposes of permutation assembly as described herein. However, their inclusion can enable a larger combinatorial space with identifiers comprising repeat components as illustrated in FIGS. 11 D-G .
- DNA size selection for example with gel extraction
- PCR with primers targeting the edge scaffolds may be implemented to isolate identifier products from other byproducts that may form in the reaction.
- FIGS. 11 D-G illustrate example methods of how the permutation scheme may be expanded to include certain instances of identifiers with repeated components.
- FIG. 11 D shows an example of how the implementation form FIG. 11 C may be used to construct identifiers with permuted and repeated components.
- an identifier may comprise three total components assembled from two distinct components.
- a component from a layer may be present multiple times in an identifier.
- Adjacent concatenations of the same component may be achieved by using a staple with adjacent complementary hybridization regions for both the 3′ end and 5′ end of the same component, such as the a*b* (5′ to 3′) staple in the figure.
- a*b* 5′ to 3′
- FIG. 11 E shows how the example implementation from FIG. 11 D may lead to non-targeted nucleic acid sequences, besides the identifier, that are assembled between the edge scaffolds.
- the appropriate identifier cannot be isolated from non-targeted nucleic acid sequence with PCR because they share the same primer binding sites on the edge.
- DNA size selection (e.g., with gel extraction) may be implemented to isolate the targeted identifier (e.g., the second sequence from the top) from the non-targeted sequences since each assembled nucleic acid sequence can be designed to have a unique length (e.g., if all components have the same length).
- FIG. 11 F shows another example where constructing an identifier with repeated components may generate multiple nucleic acid sequences with equal edge sequences but distinct lengths in the same reaction. In this method, templates that assemble a components in one layer with components in other layers in an alternating pattern may be used. As with the method shown in FIG. 11 E , size selection may be used to select identifiers of the designed length.
- FIG. 11 F shows another example where constructing an identifier with repeated components may generate multiple nucleic acid sequences with equal edge sequences but distinct lengths in the same reaction.
- templates that assemble a components in one layer with components in other layers in an alternating pattern may be used.
- size selection may be used to select identifie
- 11 G shows an example where constructing an identifier with repeated components may generate multiple nucleic acid sequences with equal edge sequences and for some nucleic acid sequences (e.g., the third and fourth from the top and the sixth and seventh from the top), equal lengths.
- those nucleic acid sequences that share equal lengths may be excluded from both being individual identifiers as it may not be possible to construct one without also constructing the other, even if PCR and DNA size selection are implemented.
- FIGS. 12 A- 12 D schematically illustrate an example method, referred to as the “MchooseK scheme”, for constructing identifiers (e.g., nucleic acid molecules) with any number, K. of assembled components (e.g., nucleic acid sequences) out of a larger number, M, of possible components.
- FIG. 12 A illustrates the architecture of identifiers constructed using the MchooseK scheme. Using this method identifiers are constructed by assembling one component form each layer in any subset of all layers (e.g., choose components from k layers out of M possible layers).
- FIG. 12 B illustrates an example of the combinatorial space of identifiers that may be constructed using the MchooseK scheme.
- the combinatorial space may comprise N K MchooseK possible identifiers for M layers, N components per layer, and an identifier length of K components.
- N K MchooseK possible identifiers for M layers, N components per layer, and an identifier length of K components.
- the MchooseK scheme may be implemented using template directed ligation, as shown in FIG. 12 C .
- components in this example are assembled between edge scaffolds that may or may not be included in the reaction master mix.
- Templates comprise nucleic acid sequences for the 3′ to 5′ ligation of any two components with lower rank to higher rank, respectively. There are ((M+1) 2 +M+1)/2 such templates.
- An individual identifier of any K components from distinct layers may be constructed by combining those selected components in a ligation reaction with the corresponding K+1 staples used to bring the K components together with the edge scaffolds in their rank order.
- Such a reaction set up may yield the nucleic acid sequence corresponding to the target identifier between the edge scaffolds.
- a reaction mix comprising all templates may be combined with the select components to assemble the target identifier.
- This alternative method may generate various nucleic acid sequences with the same edge sequences but distinct lengths (if all component lengths are equal), as illustrated in FIG. 12 D .
- the target identifier (bottom) may be isolated from byproduct nucleic acid sequences by size.
- FIGS. 13 A and 13 B schematically illustrate an example method, referred to as the “partition scheme” for constructing identifiers with partitioned components.
- FIG. 13 A shows an example of the combinatorial space of identifiers that may be constructed using the partition scheme.
- An individual identifier may be constructed by assembling one component from each layer in a fixed order with the optional placement of any partition (specially classified component) between any two components of different layers. For example, a set of components may be organized into one partition component and four layers containing one component each. A component from each layer may be combined in a fixed order and a single partition component may be assembled in various locations between layers.
- An identifier in this combinatorial space may comprise no partition components, a partition component between the components from the first and second layer, a partition between the components from the second and third layer, and so on to make a combinatorial space of eight possible identifiers.
- N K (p+1) M-1 possible identifiers that may be constructed. This method may generate identifiers of various lengths.
- FIG. 13 B shows an example implementation of the partition scheme using template directed ligation.
- Templates comprise nucleic acid sequences for ligating together one component from each of M layers in a fixed order. For each partition component, additional pairs of templates exist that enable the partition component to ligate in between the components from any two adjacent layers.
- a pair of templates such that one template (with sequence g*b* (5′ to 3′) for example) in a pair enables the 3′ end of layer 1 (with sequence b) to ligate to the 5′ end of the partition component (with sequence g) and such that the second template in the pair (with sequence c*h* (5′ to 3′) for example) enables the 3′ end of the partition component (with sequence h) to ligate to the 5′ end of layer 2 (with sequence c).
- the standard template for ligating together those layers may be excluded in the reaction and the pair of templates for ligating the partition in that position may be selected in the reaction.
- targeting the partition component between layer 1 and layer 2 may use the pair of templates c*h* (5′ to 3′) and g*b* (5′ to 3′) to select for the reaction rather than the template c*b* (5′ to 3′).
- Components may be assembled between edge scaffolds that may be included in the reaction mix (along with their corresponding templates for ligating to the first and Mth layers, respectively).
- a total of around M ⁇ 1+2*p*(M ⁇ 1) selectable templates may be used for this method for M layers and p partition components.
- This implementation of the partition scheme may generate various nucleic acid sequences in a reaction with the same edge sequences but distinct lengths.
- the target identifier may be isolated from byproduct nucleic acid sequences by DNA size selection. Specifically, there may be exactly one nucleic acid sequence product with exactly M layer components. If the layer components are designed large enough compared to the partition components, it may be possible to define a universal size selection region whereby the identifier (and none of the non-targeted byproducts) may be selected regardless of the particular partitioning of the components within the identifier, thereby allowing for multiple partitioned identifiers from multiple reactions to be isolated in the same size selection step.
- FIGS. 14 A and 14 B schematically illustrates an example method, referred to as the “unconstrained string scheme” or “USS”, for constructing identifiers made up of any string of components from a number of possible components.
- FIG. 14 A shows an example of the combinational space of 3-component (or 4-scaffold) length identifiers that may be constructed using the unconstrained string scheme.
- the unconstrained string scheme constructs an individual identifier of length K components with one or more distinct components each taken from one or more layers, where each distinct component can appear at any of the K component positions in the identifier (allowing for repeats). For example, for two layers, each comprising one component, there are eight possible 3-component length identifiers.
- FIG. 14 B shows an example implementation of the unconstrained string scheme using template directed ligation.
- K+1 single-stranded and ordered scaffold DNA components (including two edge scaffolds and K ⁇ 1 internal scaffolds) are present in the reaction mix.
- An individual identifier comprises a single component ligated between every pair of adjacent scaffolds. For example, a component ligated between scaffolds A and B, a component ligated between scaffolds C and D, and so on until all K adjacent scaffold junctions are occupied by a component.
- selected components from different layers are introduced to scaffolds along with selected pairs of staples that direct them to assemble onto the appropriate scaffolds.
- the pair of staples a*L* (5′ to 3′) and A*b* (5′ to 3′) direct the layer 1 component with a 5′ end region ‘a’ and 3′ end region ‘b’ to ligate in between the L and A scaffolds.
- 2*A*K selectable staples may be used to construct any USS identifier of length K.
- nucleic acid byproducts may form in the reaction with equal edge scaffolds as the target identifier, but with less than K components (less than K+1 scaffolds) or with more than K components (more than K+1 scaffolds).
- the targeted identifier may form with exactly K components (K+1 scaffolds) and may therefore be selectable through techniques like DNA size selection if all components are designed to be equal in length and all scaffolds are designed to be equal in length.
- that component may solely comprise a single distinct nucleic acid sequence that fulfills all three roles of (1) an identification barcode, (2) a hybridization region for staple-mediated ligation of the 5′ end to a scaffold, and (3) a hybridization region for staple mediated ligation of the 3′ end to a scaffold.
- the internal scaffolds illustrated in FIG. 14 B may be designed such that they use the same hybridization sequence for both the staple-mediated 5′ ligation of the scaffold to a component and the staple-mediated 3′ ligation of the scaffold to another (not necessarily distinct) component.
- the depicted one-scaffold, two-staple stacked hybridization events in FIG. 14 B represent the statistical back-and-forth hybridization events that occur between the scaffold and each of the staples, thus enabling both 5′ component ligation and 3′ component ligation.
- the scaffold may be designed with two concatenated hybridization regions—a distinct 3′ hybridization region for staple-mediated 3′ ligation and a distinct 5′ hybridization region for staple-mediated 5′ ligation.
- FIGS. 15 A and 15 B schematically illustrate an example method, referred to as the “component deletion scheme”, for constructing identifiers by deleting nucleic acid sequences (or components) from a parent identifier.
- FIG. 15 A shows an example of the combinatorial spaces of possible identifiers that may be constructed using the component deletion scheme.
- a parent identifier may comprise multiple components.
- a parent identifier may comprise more than or equal to about 2, 3, 4, 5, 6, 7, 8, 9, 10, 20, 30, 40, 50 or more components.
- An individual identifier may be constructed by selectively deleting any number of components from N possible components, leading to a “full” combinatorial space of size 2 N , or by deleting a fixed number of K components from N possible components, thus leading to an “NchooseK” combinatorial space of size NchooseK.
- the full combinatorial space may be 8 and the 3choose2 combinatorial space may be 3.
- FIG. 15 B shows an example implementation of the component deletion scheme using double stranded targeted cleavage and repair (DSTCR).
- the parent sequence may be a single stranded DNA substrate comprising components flanked by nuclease-specific target sites (which can be 4 or less bases in length), and where the parent may be incubated with one or more double-strand-specific nucleases corresponding to the target sites.
- An individual component may be targeted for deletion with a complementary single stranded DNA (or cleavage template) that binds the component DNA (and flanking nuclease sites) on the parent, thus forming a stable double stranded sequence on the parent that may be cleaved on both ends by the nucleases.
- Another single stranded DNA hybridizes to the resulting disjoint ends of the parent (between which the component sequence had been) and brings them together for ligation, either directly or bridged by a replacement sequence, such that the ligated sequences on the parent no longer contain active nuclease-targeted sites.
- DSTC Double Stranded Targeted Cleavage
- the parent identifier may be a double or single stranded nucleic acid substrate comprising components separated by spacer sequences such that no two components are flanked by the same sequence.
- the parent identifier may be incubated with Cas9 nuclease.
- An individual component may be targeted for deletion with guide ribonucleic acids (the cleavage templates) that bind to the edges of the component and enable Cas9-mediated cleavage at its flanking sites.
- a single stranded nucleic acid (the repair template) may hybridize to the resulting disjoint ends of the parent identifier (e.g., between the ends where the component sequence had been), thus bringing them together for ligation.
- Ligation may be done directly or by bridging the ends with a replacement sequence, such that the ligated sequences on the parent no longer contain spacer sequences that can be targeted by Cas9.
- sequence specific targeted cleavage and repair or “SSTCR”.
- Identifiers may be constructed by inserting components into a parent identifier using a derivative of DSTCR
- a parent identifier may be single stranded nucleic acid substrate comprising nuclease-specific target sites (which can be 4 or less bases in length), each embedded within a distinct nucleic acid sequence.
- the parent identifier may be incubated with one or more double-strand-specific nucleases corresponding to the target sites.
- An individual target site on the parent identifier may be targeted for component insertion with a complementary single stranded nucleic acid (the cleavage template) that binds the target site and the distinct surrounding nucleic acid sequence on the parent identifier, thus forming a double stranded site.
- the double-stranded site may be cleaved by a nuclease.
- Another single stranded nucleic acid (the repair template) may hybridize to the resulting disjoint ends of the parent identifier and bring them together for ligation, bridged by a component sequence, such that the ligated sequences on the parent no longer contain active nuclease-targeted sites.
- a derivative of SSTCR may be used to insert components into a parent identifier.
- the parent identifier may be a double or single-stranded nucleic acid and the parent may be incubated with a Cas9 nuclease.
- a distinct site on the parent identifier may be targeted for cleavage with a guide RNA (the cleavage template).
- a single stranded nucleic acid (the repair template) may hybridize to the disjoint ends of the parent identifier and bring them together for ligation, bridged by a component sequence, such that the ligated sequences on the parent identifier no longer contain active nuclease-targeted sites. Size selection may be used to select for identifiers with a certain number of component insertions.
- FIG. 16 schematically illustrates a parent identifier with recombinase recognition sites.
- Recognition sites of different patterns can be recognized by different recombinases. All recognition sites for a given set of recombinases are arranged such that the nucleic acids in between them may be excised if the recombinase is applied.
- unique molecules can be generated using recombinases to excise, shift, invert, and transpose segments of DNA to create different nucleic acid molecules.
- N recombinases there can be 2 N possible identifiers built from a parent.
- multiple orthogonal pairs of recognition sites from different recombinases may be arranged on a parent identifier in an overlapping fashion such that the application of one recombinase affects the type of recombination event that occurs when a downstream recombinase is applied (see Roquet et al., Synthetic recombinase-based state machines in living cells, Science 353 (6297): aad8559 (2016), which is entirely incorporated herein by reference).
- Such a system may be capable of constructing a different identifier for every ordering of N recombinases, N!.
- Recombinases may be of the tyrosine family such as Flp and Cre, or of the large serine recombinase family such as PhiC31, BxbI, TP901, or A 118.
- the use of recombinases from the large serine recombinase family may be advantageous because they facilitate irreversible recombination and therefore may produce identifiers more efficiently than other recombinases.
- a single nucleic acid sequence can be programmed to become many distinct nucleic acid sequences by applying numerous recombinases in a distinct order. Approximately ⁇ e 1 M! distinct nucleic acid sequences may be generated by applying M recombinases in different subsets and orders thereof, when the number of recombinases, M, may be less than or equal to 7 for the large serine recombinase family.
- M When the number of recombinases, M, may be greater than 7, the number of sequences that can be produced approximates 3.9 M , see e.g., Roquet et al., Synthetic recombinase-based state machines in living cells, Science 353 (6297): aad8559 (2016), which is entirely incorporated herein by reference. Additional methods for producing different DNA sequences from one common sequence can include targeted nucleic acid editing enzymes such as CRISPR-Cas, TALENS, and Zinc Finger Nucleases. Sequences produced by recombinases, targeted editing enzymes or the like can be used in conjunction with any of the previous methods, for example methods disclosed in any of the figures and disclosure in the present application.
- the bit-stream of information to be encoded is larger than that which can be encoded by any single nucleic acid molecule, then the information can be split and indexed with nucleic acid sequence barcodes.
- any subset of size k nucleic acid molecules from the set of N nucleic acid molecules can be chosen to produce log 2 (Nchoosek) bits of information.
- Barcodes may be assembled onto the nucleic acid molecules within the subsets of size k to encode even longer bit streams. For example, M barcodes may be used to produce M*log 2 (Nchoosek) bits of information.
- a method for encoding digital information can comprise steps for breaking up the bit stream and encoding the individual elements. For example, a bit stream comprising 6 bits can be split into 3 components each component comprising two bits. Each two bit component can be barcoded to form an information cassette, and grouped or pooled together to form a hyper-pool of information cassettes.
- Barcodes can facilitate information indexing when the amount of digital information to be encoded exceeds the amount that can fit in one pool alone.
- Information comprising longer strings of bits and/or multiple bytes can be encoded by layering the approach disclosed in FIG. 3 , for example, by including a tag with unique nucleic acid sequences encoded using the nucleic acid index.
- Information cassettes or identifier libraries can comprise nitrogenous bases or nucleic acid sequences that include unique nucleic acid sequences that provide location and bit-value information in addition to a barcode or tag which indicates the component or components of the bit stream that a given sequence corresponds to.
- Information cassettes can comprise one or more unique nucleic acid sequences as well as a barcode or tag.
- the barcode or tag on the information cassette can provide a reference for the information cassette and any sequences included in the information cassette.
- the tag or barcode on an information cassette can indicate which portion of the bit stream or bit component of the bit steam the unique sequence encodes information for (e.g., the bit value and bit position information for).
- a sequence of 10 bits can be separated into two sets of bytes, each byte comprising 5 bits. Each byte can be mapped to a set of 5 possible distinct identifiers. Initially, the identifiers generated for each byte can be the same, but they may be kept in separate pools or else someone reading the information may not be able to tell which byte a particular nucleic acid sequence belongs to.
- each identifier can be barcoded or tagged with a label that corresponds to the byte for which the encoded information applies (e.g., barcode one may be attached to sequences in the nucleic acid pool to provide the first five bits and barcode two may be attached to sequences in the nucleic acid pool to provide the second five bits), and then the identifiers corresponding to the two bytes can be combined into one pool (e.g., “hyper-pool” or one or more identifier libraries).
- Each identifier library of the one or more combined identifier libraries may comprise a distinct barcode that identifies a given identifier as belonging to a given identifier library.
- Methods for adding a barcode to each identifier in an identifier library can comprise using PCR, Gibson, ligation, or any other approach that enables a given barcode (e.g., barcode 1) to attach to a given nucleic acid sample pool (e.g., barcode 1 to nucleic acid sample pool 1 and barcode 2 to nucleic acid sample pool 2).
- the sample from the hyper-pool can be read with sequencing methods, and sequencing information can be parsed using the barcode or tag.
- a method using identifier libraries and barcodes with a set of M barcodes and N possible identifiers can encode a stream of bits with a length equivalent to the product of M and N.
- identifier libraries may be stored in an array of wells.
- the array of wells may be defined as having n columns and q rows and each well may comprise two or more identifier libraries in a hyper-pool.
- the information encoded in each well may constitute one large contiguous item of information of size n ⁇ q larger than the information contained in each of the wells.
- An aliquot may be taken from one or more of the wells in the array of wells and the encoding may be read using sequencing, hybridization, or PCR.
- a nucleic acid sample pool, hyper-pool, identifier library, group of identifier libraries, or a well, containing a nucleic acid sample pool or hyper-pool may comprise unique nucleic acid molecules (e.g., identifiers) corresponding to bits of information and a plurality of supplemental nucleic acid sequences.
- the supplemental nucleic acid sequences may not correspond to encoded data (e.g., do not correspond to a bit value).
- the supplemental nucleic acid samples may mask or encrypt the information stored in the sample pool.
- the supplemental nucleic acid sequences may be derived from a biological source or synthetically produced.
- Supplemental nucleic acid sequences derived from a biological source may include randomly fragmented nucleic acid sequences or rationally fragmented sequences.
- the biologically derived supplemental nucleic acids may hide or obscure the data-containing nucleic acids within the sample pool by providing natural genetic information along with the synthetically encoded information, especially if the synthetically encoded information (e.g., the combinatorial space of identifiers) is made to resemble natural genetic information (e.g., a fragmented genome).
- the identifiers are derived from a biological source and the supplemental nucleic acids are derived from a biological source.
- a sample pool may contain multiple sets of identifiers and supplemental nucleic acid sequences.
- Each set of identifiers and supplemental nucleic acid sequences may be derived from different organisms.
- the identifiers are derived from one or more organisms and the supplemental nucleic acid sequences are derived from a single, different organism.
- the supplemental nucleic acid sequences may also be derived from one or more organism and the identifiers may be derived from a single organism that is different from the organism that the supplemental nucleic acids are derived from. Both the identifiers and the supplemental nucleic acid sequences may be derived from multiple different organisms.
- a key may be used to distinguish the identifiers from the supplemental nucleic acid sequences.
- the supplemental nucleic acid sequences may store metadata about the written information.
- the metadata may comprise extra information for determining and/or authorizing the source of the original information and or the intended recipient of the original information.
- the metadata may comprise extra information about the format of the original information, the instruments and methods used to encode and write the original information, and the date and time of writing the original information into the identifiers.
- the metadata may comprise additional information about the format of the original information, the instruments and methods used to encode and write the original information, and the date and time of writing the original information into nucleic acid sequences.
- the metadata may comprise additional information about modifications made to the original information after writing the information into nucleic acid sequences.
- the metadata may comprise annotations to the original information or one or more references to external information. Alternatively, or in addition to, the metadata may be stored in one or more barcodes or tags attached to the identifiers.
- the identifiers in an identifier pool may have the same, similar, or different lengths than one another.
- the supplemental nucleic acid sequences may have a length that is less than, substantially equal to, or greater than the length of the identifiers.
- the supplemental nucleic acid sequences may have an average length that is within one base, within two bases, within three bases, within four bases, within five bases, within six bases, within seven bases, within eight bases, within nine bases, within ten bases, or within more bases of the average length of the identifiers.
- the supplemental nucleic acid sequences are the same or substantially the same length as the identifiers.
- the concentration of supplemental nucleic acid sequences may be less than, substantially equal to, or greater than the concentration of the identifiers in the identifiers library.
- the concentration of the supplemental nucleic acids may be less than or equal to about 1%, 10%, 20%, 40%, 60%, 80%, 100, %, 125%, 150%, 175%, 200%, 1000%, 1 ⁇ 10 4 %, 1 ⁇ 10 5 %, 1 ⁇ 10 6 %, 1 ⁇ 10 7 %, 1 ⁇ 10 8 % or less than the concentration of the identifiers.
- the concentration of the supplemental nucleic acids may be greater than or equal to about 1%, 10%, 20%, 40%, 60%, 80%, 100, %, 125%, 150%, 175%, 200% 1000%, 1 ⁇ 10 4 %, 1 ⁇ 10 5 %, 1 ⁇ 10 6 %, 1 ⁇ 10 7 %, 1 ⁇ 10 8 % or more than the concentration of the identifiers. Larger concentrations may be beneficial for obfuscation or concealing data.
- the concentration of the supplemental nucleic acid sequences are substantially greater (e.g., 1 ⁇ 10 8 % greater) than the concentration of identifiers in an identifier pool.
- a method for copying information encoded in nucleic acid sequence(s) may comprise (a) providing an identifier library and (b) constructing one or more copies of the identifier library.
- An identifier library may comprise a subset of a plurality of identifiers from a larger combinatorial space. Each individual identifier of the plurality of identifiers may correspond to an individual symbol in a string of symbols.
- An identifier may comprise one or more components.
- a component may comprise a nucleic acid sequence.
- a method for accessing information encoded in nucleic acid sequences may comprise (a) providing an identifier library, and (b) extracting a portion or a subset of the identifiers present in the identifier library from the identifier library.
- An identifier library may comprise a subset of a plurality of identifiers from a larger combinatorial space. Each individual identifier of the plurality of identifiers may correspond to an individual symbol in a string of symbols.
- An identifier may comprise one or more components.
- a component may comprise a nucleic acid sequence.
- Information may be written into one or more identifier libraries as described elsewhere herein. Identifiers may be constructed using any method described elsewhere herein.
- Stored data may be copied by generating copies of the individual identifiers in an identifier library or in one or more identifier libraries. A portion of the identifiers may be copied or an entire library may be copied. Copying may be performed by amplifying the identifiers in an identifier library. When one or more identifier libraries are combined, a single identifier library or multiple identifier libraries may be copied. If an identifier library comprises supplemental nucleic acid sequences, the supplemental nucleic acid sequences may or may not be copied.
- Identifiers in an identifier library may be constructed to comprise one or more common primer binding sites.
- the one or more binding sites may be located at the edges of each identifier or interweaved throughout each identifier.
- the primer binding site may allow for an identifier library specific primer pair or a universal primer pair to bind to and amplify the identifiers.
- All the identifiers within an identifier library or all the identifiers in one or more identifier libraries may be replicated multiple times by multiple PCR cycles.
- Conventional PCR may be used to copy the identifiers and the identifiers may be exponentially replicated with each PCR cycle. The number of copies of an identifier may increase exponentially with each PCR cycle.
- Linear PCR may be used to copy the identifiers and the identifiers may be linearly replicated with each PCR cycle. The number of identifier copies may increase linearly with each PCR cycle.
- the identifiers may be ligated into a circular vector prior to PCR amplification.
- the circle vector may comprise a barcode at each end of the identifier insertion site.
- the PCR primers for amplifying identifiers may be designed to prime to the vector such that the barcoded edges are included with the identifier in the amplification product.
- recombination between identifiers may result in copied identifiers that comprise non-correlated barcodes on each edge.
- the non-correlated barcodes may be detectable upon reading the identifiers.
- Identifiers containing non-correlated barcodes may be considered false positives and may be disregarded during the information decoding process.
- Information may be encoded by assigning each bit of information to a unique nucleic acid molecule.
- three sample sets (X, Y, and Z) each containing two nucleic acid sequences may assemble into eight unique nucleic acid molecules and encode eight bits of data:
- the information may be accessed through sequencing or hybridization assays.
- primers or probes may be designed to bind to common regions or the barcoded region of the nucleic acid sequence. This may enable amplification of any region of the nucleic acid molecule.
- the amplification product may then be read by sequencing the amplification product or by a hybridization assay.
- a primer specific to the barcode region of the X1 nucleic acid sequence and a primer that binds to the common region of the Z set may be used to amplify the nucleic acid molecules. This may return the sequence Y1Z2, which may encode for 0100.
- the substring of that data may also be accessed by further amplifying the nucleic acid molecules with a primer that binds to the barcode region of the Y1 nucleic acid sequence and a primer that binds to the common sequence of the Z set. This may return the Z2 nucleic acid sequence, encoding the substring 01.
- the data may be accessed by checking for the presence or absence of a particular nucleic acid sequence without sequencing. For example, amplification with a primer specific to the Y2 barcode may generate amplification products for the Y2 barcode, but not for the Y1 barcode. The presence of Y2 amplification product may signal a bit value of ‘1’. Alternatively, the absence of Y2 amplification products may signal a bit value of ‘0’.
- PCR based methods can be used to access and copy data from identifier or nucleic acid sample pools. Using common primer binding sites that flank the identifiers in the pools or hyper-pools, nucleic acids containing information can be readily copied. Alternatively, other nucleic acid amplification approaches such as isothermal amplification may also be used to readily copy data from sample pools or hyper-pools (e.g., identifier libraries).
- a particular subset of information e.g., all nucleic acids relating to a particular barcode
- a primer that binds the specific barcode at one edge of the identifier in the forward orientation, along with another primer that binds a common sequence on the opposite edge of the identifier in a reverse orientation.
- Various read-out methods can be used to pull information from the encoded nucleic acid; for example microarray (or any sort of fluorescent hybridization), digital PCR, quantitative PCR (qPCR), and various sequencing platforms can be further used to read out the encoded sequences and by extension digitally encoded data.
- Accessing information stored in nucleic acid molecules may be performed by selectively removing the portion of non-targeted identifiers from an identifier library or a pool of identifiers or, for example, selectively removing all identifiers of an identifier library from a pool of multiple identifier libraries. Accessing data may also be performed by selectively capturing targeted identifiers from an identifier library or pool of identifiers. The targeted identifiers may correspond to data of interest within the larger item of information.
- a pool of identifiers may comprise supplemental nucleic acid molecules.
- the supplemental nucleic acid molecules may contain metadata about the encoded information or may be used to encrypt or mask the identifiers corresponding to the information.
- FIGS. 17 A- 17 C schematically illustrate an overview of example methods for accessing portions of information stored in nucleic acid sequences by accessing a number of particular identifiers from a larger number of identifiers.
- FIG. 17 A shows example methods for using polymerase chain reaction, affinity tagged probes, and degradation targeting probes to access identifiers containing a specified component.
- a pool of identifiers e.g., identifier library
- the common sequences or variable sequences may be primer binding sites.
- One or more primers may bind to the common or variable regions on the identifier edges.
- the identifiers with primers bound may be amplified by PCR.
- the amplified identifiers may significantly outnumber the non-amplified identifiers.
- the amplified identifiers may be identified.
- An identifier from an identifier library may comprise sequences on one or both of its ends that are distinct to that library, thus enabling a single library to be selectively accessed from a pool or group of more than one identifier libraries.
- the components that constitute the identifiers in a pool may share complementarity with one or more probes.
- the one or more probes may bind or hybridize to the identifiers to be accessed.
- the probe may comprise an affinity tag.
- the affinity tags may bind to a bead, generating a complex comprising a bead, at least one probe, and at least one identifier.
- the beads may be magnetic, and together with a magnet, the beads may collect and isolate the identifiers to be accessed.
- the identifiers may be removed from the beads under denaturing conditions prior to reading.
- the beads may collect the non-targeted identifiers and sequester them away from the rest of the pool that can get washed into a separate vessel and read.
- the affinity tag may bind to a column.
- the identifiers to be accessed may bind to the column for capture. Column-bound identifiers may subsequently be eluted or denatured from the column prior to reading.
- the non-targeted identifiers may be selectively targeted to the column while the targeted identifiers may flow through the column. Accessing the targeted identifiers may comprise applying one or more probes to a pool of identifiers simultaneously or applying one or more probes to a pool of identifiers sequentially.
- the components that constitute the identifiers in a pool may share complementarity with one or more degradation-targeting probes.
- the probes may bind to or hybridize with distinct components on the identifiers.
- the probe may be a target for a degradation enzyme, such as an endonuclease.
- one or more identifier libraries may be combined.
- a set of probes may hybridize with one of the identifier libraries.
- the set of probes may comprise RNA and the RNA may guide a Cas9 enzyme.
- a Cas9 enzyme may be introduced to the one or more identifier libraries.
- the identifiers hybridized with the probes may be degraded by the Cas9 enzyme.
- the identifiers to be accessed may not be degraded by the degradation enzyme.
- the identifiers may be single-stranded and the identifier library may be combined with a single-strand specific endonuclease(s), such as the SI nuclease, that selectively degrades identifiers that are not to be accessed.
- Identifiers to be accessed may be hybridized with a complementary set of identifiers to protect them from degradation by the single-strand specific endonuclease(s).
- the identifiers to be accessed may be separated from the degradation products by size selection, such as size selection chromatography (e.g., agarose gel electrophoresis).
- identifiers that are not degraded may be selectively amplified (e.g., using PCR) such that the degradation products are not amplified.
- the non-degraded identifiers may be amplified using primers that hybridize to each end of the non-degraded identifiers and therefore not to each end of the degraded or cleaved identifiers.
- FIG. 17 B shows example methods for using polymerase chain reaction to perform ‘OR’ or ‘AND’ operations to access identifiers containing multiple components.
- an ‘OR’ amplification of the union of those sets of identifiers may be accomplished by using the two forward primers together in a multiplex PCR reaction with a reverse primer that binds all of the identifiers on the right end.
- an ‘AND’ amplification of the intersection of those two sets of identifiers may be accomplished by using the forward primer and the reverse primer together as a primer pair in a PCR reaction.
- FIG. 17 C shows example methods for using affinity tags to perform ‘OR’ or ‘AND’ operations to access identifiers containing multiple components.
- affinity probe ‘P1’ captures all identifiers with component ‘C1’
- affinity probe ‘P2’ captures all identifiers with component ‘C2’
- the set of all identifiers with C1 or C2 can be captured by using P1 and P2 simultaneously (corresponding to an ‘OR’ operation).
- the set of all identifiers with C1 and C2 can be captures by using P1 and P2 sequentially (corresponding to an ‘AND’ operation).
- a method for reading information encoded in nucleic acid sequences may comprise (a) providing an identifier library. (b) identifying the identifiers present in the identifier library, (c) generating a string of symbols from the identifiers present in the identifier library, and (d) compiling information from the string of symbols.
- An identifier library may comprise a subset of a plurality of identifiers from a combinatorial space. Each individual identifier of the subset of identifiers may correspond to an individual symbol in a string of symbols.
- An identifier may comprise one or more components. A component may comprise a nucleic acid sequence.
- Identifiers may be constructed using any method described elsewhere herein.
- Stored data may be copied and accessed using any method described elsewhere herein.
- the identifier may comprise information relating to a location of the encoded symbol, a value of the encoded symbol, or both the location and the value of the encoded symbol.
- An identifier may include information relating to a location of the encoded symbol and the presence or absence of the identifier in an identifier library may indicate the value of the symbol.
- the presence of an identifier in an identifier library may indicate a first symbol value (e.g., first bit value) in a binary string and the absence of an identifier in an identifier library may indicate a second symbol value (e.g., second bit value) in a binary string.
- basing a bit value on the presence or absence of an identifier in an identifier library may reduce the number of identifiers assembled and, therefore, reduce the write time.
- the presence of an identifier may indicate a bit value of ‘1’ at the mapped location and the absence of an identifier may indicate a bit value of ‘0’ at the mapped location.
- decoding nucleic acid encoded data m be achieved by base-by-base sequencing of the nucleic acid strands, such as Illumina Sequencing, or by utilizing a sequencing technique that indicates the presence or absence of specific nucleic acid sequences, such as fragmentation analysis by capillary electrophoresis.
- the sequencing may employ the use of reversible terminators.
- the sequencing may employ the use of natural or non-natural (e.g., engineered) nucleotides or nucleotide analogs.
- decoding nucleic acid sequences may be performed using a variety of analytical techniques, including but not limited to, any methods that generate optical, electrochemical, or chemical signals.
- PCR polymerase chain reaction
- digital PCR Sanger sequencing
- high-throughput sequencing sequencing-by-synthesis
- single-molecule sequencing sequencing-by-ligation
- RNA-Seq IIlumina
- Next generation sequencing Digital Gene Expression (Helicos)
- Cetos Chromosomes
- Solexa Single MicroArray
- shotgun sequencing Maxim-Gilbert sequencing
- massively-parallel sequencing PCR
- PCR polymerase chain reaction
- digital PCR Sanger sequencing
- high-throughput sequencing sequencing-by-synthesis
- single-molecule sequencing sequencing-by-ligation
- RNA-Seq RNA-Seq (Illumina)
- Next generation sequencing Digital Gene Expression (Helicos)
- Clonal Single MicroArray Solexa
- shotgun sequencing Maxim-Gilbert sequencing
- massively-parallel sequencing massively-parallel sequencing.
- Various read-out methods can be used to pull information from the encoded nucleic acid.
- microarray or any sort of fluorescent hybridization
- digital PCR or any sort of fluorescent hybridization
- qPCR quantitative PCR
- sequencing platforms can be further used to read out the encoded sequences and by extension digitally encoded data.
- An identifier library may further comprise supplemental nucleic acid sequences that provide metadata about the information, encrypt or mask the information, or that both provide metadata and mask the information.
- the supplemental nucleic acids may be identified simultaneously with identification of the identifiers. Alternatively, the supplemental nucleic acids may be identified prior to or after identifying the identifiers. In an example, the supplemental nucleic acids are not identified during reading of the encoded information.
- the supplemental nucleic acid sequences may be indistinguishable from the identifiers.
- An identifier index or a key may be used to differentiate the supplemental nucleic acid molecules from the identifiers.
- the efficiency of encoding and decoding data may be increased by recoding input bit strings to enable the use of fewer nucleic acid molecules. For example, if an input string is received with a high occurrence of ‘111’ substrings, which may map to three nucleic acid molecules (e.g., identifiers) with an encoding method, it may be recoded to a ‘000’ substring which may map to a null set of nucleic acid molecules. The alternate input substring of ‘000’ may also be recoded to ‘111’. This method of recoding may reduce the total amount of nucleic acid molecules used to encode the data because there may be a reduction in the number of ‘1’s in the dataset.
- the total size of the dataset may be increased to accommodate a codebook that specifies the new mapping instructions.
- An alternative method for increasing encoding and decoding efficiency may be to recode the input string to reduce the variable length. For example, ‘111’ may be recoded to ‘00’ which may shrink the size of the dataset and reduce the number of ‘1’s in the dataset.
- nucleic acid sequences e.g., identifiers
- nucleic acid sequences that are designed for ease of detection may include nucleic acid sequences comprising a majority of nucleotides that are easier to call and detect based on their optical, electrochemical, chemical, or physical properties.
- Engineered nucleic acid sequences may be either single or double stranded.
- Engineered nucleic acid sequences may include synthetic or unnatural nucleotides that improve the detectable properties of the nucleic acid sequence.
- Engineered nucleic acid sequences may comprise all natural nucleotides, all synthetic or unnatural nucleotides, or a combination of natural, synthetic, and unnatural nucleotides.
- Synthetic nucleotides may include nucleotide analogues such as peptide nucleic acids, locked nucleic acids, glycol nucleic acids, and threose nucleic acids.
- Unnatural nucleotides may include dNaM, an artificial nucleoside containing a 3-methoxy-2-naphthly group, and d5SICS, an artificial nucleoside containing a 6-methylisoquinoline-1-thione-2-yl group.
- Engineered nucleic acid sequences may be designed for a single enhanced property, such as enhanced optical properties, or the designed nucleic acid sequences may be designed with multiple enhanced properties, such as enhanced optical and electrochemical properties or enhanced optical and chemical properties.
- Engineered nucleic acid sequences may comprise reactive natural, synthetic, and unnatural nucleotides that do not improve the optical, electrochemical, chemical, or physical properties of the nucleic acid sequences.
- the reactive components of the nucleic acid sequences may enable the addition of a chemical moiety that confers improved properties to the nucleic acid sequence.
- Each nucleic acid sequence may include a single chemical moiety or may include multiple chemical moieties.
- Example chemical moieties may include, but are not limited to, fluorescent moieties, chemiluminescent moieties, acidic or basic moieties, hydrophobic or hydrophilic moieties, and moieties that alter oxidation state or reactivity of the nucleic acid sequence.
- a sequencing platform may be designed specifically for decoding and reading information encoded into nucleic acid sequences.
- the sequencing platform may be dedicated to sequencing single or double stranded nucleic acid molecules.
- the sequencing platform may decode nucleic acid encoded data by reading individual bases (e.g., base-by-base sequencing) or by detecting the presence or absence of an entire nucleic acid sequence (e.g., component) incorporated within the nucleic acid molecule (e.g., identifier).
- the sequencing platform may include the use of promiscuous reagents, increased read lengths, and the detection of specific nucleic acid sequences by the addition of detectable chemical moieties.
- the use of more promiscuous reagents during sequencing may increase reading efficiency by enabling faster base calling which in turn may decrease the sequencing time.
- the use of increased read lengths may enable longer sequences of encoded nucleic acids to be decoded per read.
- the addition of detectable chemical moiety tags may enable the detection of the presence or absence of a nucleic acid sequence by the presence or absence of a chemical moiety. For example, each nucleic acid sequence encoding a bit of information may be tagged with a chemical moiety that generates a unique optical, electrochemical, or chemical signal. The presence or absence of that unique optical, electrochemical, or chemical signal may indicate a ‘0’ or a ‘1’ bit value.
- the nucleic acid sequence may comprise a single chemical moiety or multiple chemical moieties.
- the chemical moiety may be added to the nucleic acid sequence prior to use of the nucleic acid sequence to encode data. Alternatively or in addition to, the chemical moiety may be added to the nucleic acid sequence after encoding the data, but prior to decoding the data.
- the chemical moiety tag may be added directly to the nucleic acid sequence or the nucleic acid sequence may comprise a synthetic or unnatural nucleotide anchor and the chemical moiety tag may be added to that anchor.
- Unique codes may be applied to minimize or detect encoding and decoding errors. Encoding and decoding errors may occur from false negatives (e.g., a nucleic acid molecule or identifier not included in a random sampling).
- An example of an error detecting code may be a checksum sequence that counts the number of identifiers in a contiguous set of possible identifiers that is included in the identifier library. While reading the identifier library, the checksum may indicate how many identifiers from that contiguous set of identifiers to expect to retrieve, and identifiers can continue to be sampled for reading until the expected number is met.
- a checksum sequence may be included for every contiguous set of R identifiers where R can be equal in size or greater than 1, 2, 5, 10, 50, 100, 200, 500, or 1000 or less than 1000, 500, 200, 100, 50, 10, 5, or 2. The smaller the value of R, the better the error detection.
- the checksums may be supplemental nucleic acid sequences.
- a set comprising seven nucleic acid sequences may be divided into two groups, nucleic acid sequences for constructing identifiers with a product scheme (components X1-X3 in layer X and Y1-Y3 in layer Y), and nucleic acid sequences for the supplemental checksums (X4-X7 and Y4-Y7).
- the checksum sequences X4-X7 may indicate whether zero, one, two, or three sequences of layer X are assembled with each member of layer Y.
- the checksum sequences Y4-Y7 may indicate whether zero, one, two, or three sequences of layer Y are assembled with each member of layer X.
- an original identifier library with identifiers ⁇ X1Y1, X1Y3, X2Y1, X2Y2, X2Y3 ⁇ may be supplemented to include checksums to become the following pool: ⁇ X1Y1, X1Y3, X2Y1, X2Y2, X2Y3, X1Y6, X2Y7, X3Y4, X6Y1, X5Y2, X6Y3 ⁇ .
- the checksum sequences may also be used for error correction. For example, absence of X1Y1 from the above dataset and the presence of X1Y6 and X6Y1 may enable inference that the X1Y1 nucleic acid molecule is missing from the dataset.
- the checksum sequences may indicate whether identifiers are missing from a sampling of the identifier library or an accessed portion of the identifier library. In the case of a missing checksum sequence, access methods such as PCR or affinity tagged probe hybridization may amplify and/or isolate it. In some embodiments, the checksums may not be supplemental nucleic acid sequences. They checksums may be coded directly into the information such that they are represented by identifiers.
- Noise in data encoding and decoding may be reduced by constructing identifiers palindromically, for example, by using palindromic pairs of components rather than single components in the product scheme. Then the pairs of components from different layers may be assembled to one another in a palindromic manner (e.g., YXY instead of XY for components X and Y). This palindromic method may be expanded to larger numbers of layers (e.g., ZYXYZ instead of XYZ) and may enable detection of erroneous cross reactions between identifiers.
- the identifiers may be enriched from the supplemental nucleic acid sequences.
- the identifiers may be enriched by a nucleic acid amplification reaction using primers specific to the identifier ends.
- the information may be decoded without enriching the sample pool by sequencing (e.g., sequencing by synthesis) using a specific primer. In both decoding methods, it may be difficult to enrich or decode the information without having a decoding key or knowing something about the composition of the identifiers.
- Alternative access methods may also be employed such as using affinity tag based probes.
- a system for encoding digital information into nucleic acids can comprise systems, methods and devices for converting files and data (e.g., raw data, compressed zip files, integer data, and other forms of data) into bytes and encoding the bytes into segments or sequences of nucleic acids, typically DNA, or combinations thereof.
- files and data e.g., raw data, compressed zip files, integer data, and other forms of data
- a system for encoding binary sequence data using nucleic acids may comprise a device and one or more computer processors.
- the device may be configured to construct an identifier library.
- the one or more computer processors may be individually or collectively programmed to (i) translate the information into a sting of symbols, (ii) map the string of symbols to the plurality of identifiers, and (iii) construct an identifier library comprising at least a subset of a plurality of identifiers.
- An individual identifier of the plurality of identifiers may correspond to an individual symbol of the string of symbols.
- An individual identifier of the plurality of identifiers may comprise one or more components.
- An individual component of the one or more components may comprise a nucleic acid sequence.
- a system for reading binary sequence data using nucleic acids may comprise a database and one or more computer processors.
- the database may store an identifier library encoding the information.
- the one or more computer processors may be individually or collectively programmed to (i) identify the identifiers in the identifier library, (ii) generate a plurality of symbols from identifiers identified in (i), and (iii) compile the information from the plurality of symbols.
- the identifier library may comprise a subset of a plurality of identifiers. Each individual identifier of the plurality of identifiers may correspond to an individual symbol in a string of symbols.
- An identifier may comprise one or more components.
- a component may comprise a nucleic acid sequence.
- Non-limiting embodiments of methods for using the system to encode digital data can comprise steps for receiving digital information in the form of byte streams. Parsing the byte streams into individual bytes, mapping the location of a bit within the byte using a nucleic acid index (or identifier rank), and encoding sequences corresponding to either bit values of 1 or bit values of 0 into identifiers.
- Steps for retrieving digital data can comprise sequencing a nucleic acid sample or nucleic acid pool comprising sequences of nucleic acid (e.g., identifiers) that map to one or more bits, referencing an identifier rank to confirm if the identifier is present in the nucleic acid pool and decoding the location and bit-value information for each sequence into a byte comprising a sequence of digital information.
- nucleic acid e.g., identifiers
- a system for encoding and writing information into nucleic acid molecules may include a device and one or more computer processors.
- the one or more computer processors may be programmed to parse the information into strings of symbols (e.g., strings of bits).
- the computer processor may generate an identifier rank.
- the computer processor may categorize the symbols into two or more categories.
- One category may include symbols to be represented by a presence of the corresponding identifier in the identifier library and the other category may include symbols to be represented by an absence of the corresponding identifiers in the identifier library.
- the computer processor may direct the device to assemble the identifiers corresponding to symbols to be represented to the presence of an identifier in the identifier library.
- the device may comprise a plurality regions, sections, or partitions.
- the reagents and components to assemble the identifiers may be stored in one or more regions, sections, or partitions of the device.
- Layers may be stored in separate regions of section of the device.
- a layer may comprise one or more unique components.
- the component in one layer may be unique from the components in another layer.
- the regions or sections may comprise vessels and the partitions may comprise wells.
- Each layer may be stored in a separate vessel or partition.
- Each reagent or nucleic acid sequence may be stored in a separate vessel or partition.
- reagents may be combined to form a master mix for identifier construction.
- the device may transfer reagents, components, and templates from one section of the device to be combined in another section.
- the device may provide the conditions for completing the assembly reaction. For example, the device may provide heating, agitation, and detection of reaction progress.
- the constructed identifiers may be directed to undergo one or more subsequent reactions to add barcodes, common sequences, variable sequences, or tags to one or more ends of the identifiers.
- the identifiers may then be directed to a region or partition to generate an identifier library.
- One or more identifier libraries may be stored in each region, section, or individual partition of the device.
- the device may transfer fluid (e.g., reagents, components, templates) using pressure, vacuum, or suction.
- the identifier libraries may be stored in the device or may be moved to a separate database.
- the database may comprise one or more identifier libraries.
- the database may provide conditions for long term storage of the identifier libraries (e.g., conditions to reduce degradation of identifiers).
- the identifier libraries may be stored in a powder, liquid, or solid form.
- the database may provide Ultra-Violet light protection, reduced temperature (e.g., refrigeration or freezing), and protection from degrading chemicals and enzymes. Prior to being transferred to a database, the identifier libraries may be lyophilized or frozen.
- the identifier libraries may include ethylenediaminetetraacetic acid (EDTA) to inactivate nucleases and/or a buffer to maintain the stability of the nucleic acid molecules.
- EDTA ethylenediaminetetraacetic acid
- the database may be coupled to, include, or be separate from a device that writes the information into identifiers, copies the information, accesses the information, or reads the information.
- a portion of an identifier library may be removed from the database pnor to copying, accessing or reading.
- the device that copies the information from the database may be the same or a different device from that which writes the information.
- the device that copies the information may extract an aliquot of an identifier library from the device and combine that aliquot with the reagents and constituents to amplify a portion of or the entire identifier library.
- the device may control the temperature, pressure, and agitation of the amplification reaction.
- the device may comprise partitions and one or more amplification reaction may occur in the partition comprising the identifier library.
- the device may copy more than one pool of identifiers at a time.
- the copied identifiers may be transferred from the copy device to an accessing device.
- the accessing device may be the same device as the copy device.
- the access device may comprise separate regions, sections, or partitions.
- the access device may have one or more columns, bead reservoirs, or magnetic regions for separating identifiers bound to affinity tags.
- the access device may have one or more size selection units.
- a size selection unit may include agarose gel electrophoresis or any other method for size selecting nucleic acid molecules. Copying and extraction may be performed in the same region of a device or in different regions of a device.
- the accessed data may be read in the same device or the accessed data may be transferred to another device.
- the reading device may comprise a detection unit to detect and identify the identifiers.
- the detection unit may be part of a sequencer, hybridization array, or other unit for identifying the presence or absence of an identifier.
- a sequencing platform may be designed specifically for decoding and reading information encoded into nucleic acid sequences.
- the sequencing platform may be dedicated to sequencing single or double stranded nucleic acid molecules.
- the sequencing platform may decode nucleic acid encoded data by reading individual bases (e.g., base-by-base sequencing) or by detecting the presence or absence of an entire nucleic acid sequence (e.g., component) incorporated within the nucleic acid molecule (e.g., identifier).
- the sequencing platform may be a system such as Illumina Sequencing or fragmentation analysis by capillary electrophoresis.
- decoding nucleic acid sequences may be performed using a variety of analytical techniques implemented by the device, including but not limited to, any methods that generate optical, electrochemical, or chemical signals.
- Information storage in nucleic acid molecules may have various applications including, but not limited to, long term information storage, sensitive information storage, and storage of medical information.
- a person's medical information e.g., medical history and records
- the information may be stored external to the body (e.g., in a wearable device) or internal to the body (e.g., in a subcutaneous capsule).
- a sample may be taken from the device or capsule and the information may be decoded with the use of a nucleic acid sequencer.
- personal storage of medical records in nucleic acid molecules may provide an alternative to computer and cloud based storage systems.
- Nucleic acid molecules used for capsule-based storage of medical records may be derived from human genomic sequences. The use of human genomic sequences may decrease the immunogenicity of the nucleic acid sequences in the event of capsule failure and leakage.
- FIG. 19 shows a computer system 1901 that is programmed or otherwise configured to encode digital information into nucleic acid sequences and/or read (e.g., decode) information derived from nucleic acid sequences.
- the computer system 1901 can regulate various aspects of the encoding and decoding procedures of the present disclosure, such as, for example, the bit-values and bit location information for a given bit or byte from an encoded bitstream or byte stream.
- the computer system 1901 includes a central processing unit (CPU, also “processor” and “computer processor” herein) 1905 , which can be a single core or multi core processor, or a plurality of processors for parallel processing.
- the computer system 1901 also includes memory or memory location 1910 (e.g., random-access memory, read-only memory, flash memory), electronic storage unit 1915 (e.g., hard disk), communication interface 1920 (e.g., network adapter) for communicating with one or more other systems, and peripheral devices 1925 , such as cache, other memory, data storage and/or electronic display adapters.
- the memory 1910 , storage unit 1915 , interface 1920 and peripheral devices 1925 are in communication with the CPU 1905 through a communication bus (solid lines), such as a motherboard.
- the storage unit 1915 can be a data storage unit (or data repository) for storing data.
- the computer system 1901 can be operatively coupled to a computer network (“network”) 1930 with the aid of the communication interface 1920 .
- the network 1930 can be the Internet, an internet and/or extranet, or an intranet and/or extranet that is in communication with the Internet.
- the network 1930 in some cases is a telecommunication and/or data network.
- the network 1930 can include one or more computer servers, which can enable distributed computing, such as cloud computing.
- the network 1930 in some cases with the aid of the computer system 1901 , can implement a peer-to-peer network, which may enable devices coupled to the computer system 1901 to behave as a client or a server.
- the CPU 1905 can execute a sequence of machine-readable instructions, which can be embodied in a program or software.
- the instructions may be stored in a memory location, such as the memory 1910 .
- the instructions can be directed to the CPU 1905 , which can subsequently program or otherwise configure the CPU 1905 to implement methods of the present disclosure. Examples of operations performed by the CPU 1905 can include fetch, decode, execute, and writeback.
- the CPU 1905 can be part of a circuit, such as an integrated circuit.
- a circuit such as an integrated circuit.
- One or more other components of the system 1901 can be included in the circuit.
- the circuit is an application specific integrated circuit (ASIC).
- ASIC application specific integrated circuit
- the storage unit 1915 can store files, such as drivers, libraries and saved programs.
- the storage unit 1915 can store user data, e.g., user preferences and user programs.
- the computer system 1901 in some cases can include one or more additional data storage units that are external to the computer system 1901 , such as located on a remote server that is in communication with the computer system 1901 through an intranet or the Internet.
- the computer system 1901 can communicate with one or more remote computer systems through the network 1930 .
- the computer system 1901 can communicate with a remote computer system of a user or other devices and or machinery that may be used by the user in the course of analyzing data encoded or decoded in a sequence of nucleic acids (e.g., a sequencer or other system for chemically determining the order of nitrogenous bases in a nucleic acid sequence).
- remote computer systems include personal computers (e.g., portable PC), slate or tablet PC's (e.g., Apple iPad, Samsung Galaxy Tab), telephones, Smart phones (e.g., Apple iPhone, Android-enabled device, Blackberry ), or personal digital assistants.
- the user can access the computer system 1901 via the network 1930 .
- Methods as described herein can be implemented by way of machine (e.g., computer processor) executable code stored on an electronic storage location of the computer system 1901 , such as, for example, on the memory 1910 or electronic storage unit 1915 .
- the machine executable or machine readable code can be provided in the form of software.
- the code can be executed by the processor 1905
- the code can be retrieved from the storage unit 1915 and stored on the memory 1910 for ready access by the processor 1905.
- the electronic storage unit 1915 can be precluded, and machine-executable instructions are stored on memory 1910 .
- the code can be pre-compiled and configured for use with a machine having a processer adapted to execute the code, or can be compiled during runtime.
- the code can be supplied in a programming language that can be selected to enable the code to execute in a pre-compiled or as-compiled fashion.
- aspects of the systems and methods provided herein can be embodied in programming.
- Various aspects of the technology may be thought of as “products” or “articles of manufacture” typically in the form of machine (or processor) executable code and/or associated data that is carried on or embodied in a type of machine readable medium.
- Machine-executable code can be stored on an electronic storage unit, such as memory (e.g., read-only memory, random-access memory, flash memory) or a hard disk.
- “Storage” type media can include any or all of the tangible memory of the computers, processors or the like, or associated modules thereof, such as various semiconductor memories, tape drives, disk drives and the like, which may provide non-transitory storage at any time for the software programming.
- All or portions of the software may at times be communicated through the Internet or various other telecommunication networks. Such communications, for example, may enable loading of the software from one computer or processor into another, for example, from a management server or host computer into the computer platform of an application server.
- another type of media that may bear the software elements includes optical, electrical and electromagnetic waves, such as used across physical interfaces between local devices, through wired and optical landline networks and over various air-links.
- the physical elements that carry such waves, such as wired or wireless links, optical links or the like, also may be considered as media bearing the software.
- terms such as computer or machine “readable medium” refer to any medium that participates in providing instructions to a processor for execution.
- a machine readable medium such as computer-executable code
- a tangible storage medium such as computer-executable code
- Non-volatile storage media include, for example, optical or magnetic disks, such as any of the storage devices in any computer(s) or the like, such as may be used to implement the databases, etc. shown in the drawings.
- Volatile storage media include dynamic memory, such as main memory of such a computer platform.
- Tangible transmission media include coaxial cables; copper wire and fiber optics, including the wires that comprise a bus within a computer system.
- Carrier-wave transmission media may take the form of electric or electromagnetic signals, or acoustic or light waves such as those generated during radio frequency (RF) and infrared (IR) data communications.
- RF radio frequency
- IR infrared
- Common forms of computer-readable media therefore include for example: a floppy disk, a flexible disk, hard disk, magnetic tape, any other magnetic medium, a CD-ROM, DVD or DVD-ROM, any other optical medium, punch cards paper tape, any other physical storage medium with patterns of holes, a RAM, a ROM, a PROM and EPROM, a FLASH-EPROM, any other memory chip or cartridge, a carrier wave transporting data or instructions, cables or links transporting such a carrier wave, or any other medium from which a computer may read programming code and/or data.
- Many of these forms of computer readable media may be involved in carrying one or more sequences of one or more instructions to a processor for execution.
- the computer system 1901 can include or be in communication with an electronic display 1935 that comprises a user interface (UT) 1940 for providing, for example, sequence output data including chromatographs, sequences as well as bits, bytes, or bit streams encoded by or read by a machine or computer system that is encoding or decoding nucleic acids, raw data, files and compressed or decompressed zip files to be encoded or decoded into DNA stored data.
- UT user interface
- Examples of UI's include, without limitation, a graphical user interface (GUI) and web-based user interface.
- GUI graphical user interface
- Methods and systems of the present disclosure can be implemented by way of one or more algorithms.
- An algorithm can be implemented by way of software upon execution by the central processing unit 1905 .
- the algorithm can, for example, be used with a DNA index and raw data or zip file compressed or decompressed data, to determine a customized method for coding digital information from the raw data or zip file compressed data, prior to encoding the digital information.
- Data to be encoded is a textfile containing a poem.
- the data is encoded manually with pipettes to mix together DNA components from two layers of 96 components to construct identifiers using the product scheme implemented with overlap extension PCR.
- the first layer, X comprises 96 total DNA components.
- the second layer, Y also comprises 96 total components.
- the data Prior to writing the DNA, the data is mapped to binary and then recoded to a uniform weight format where every contiguous (adjacent disjoint) string of 61 bits of the original data is translated to a 96 bit string with exactly 17 bit-values of 1. This uniform weight format may have natural error checking qualities.
- the data is then hashed into a 96 by 96 table to form a reference map.
- the middle panel of FIG. 18 A shows the two-dimensional reference map of a 96 by 96 table encoding the poem into a plurality of identifiers. Dark points correspond to a ‘1’ bit-value and white points corresponded to a ‘0’ bit-value.
- the data is encoded into identifiers using two layers of 96 components. Each X value and Y value of the table is assigned a component and the X and Y components are assembled into an identifier using overlap extension PCR for each (X,Y) coordinate with a ‘1’ value.
- the data was read back (e.g., decoded) by sequencing the identifier library to determine the presence or absence of each possible (X,Y) assembly.
- FIG. 18 A shows a two-dimensional heat map of the abundances of sequences present in the identifier library as determined by sequencing.
- Each pixel represents a molecule comprising the corresponding X and Y components, and the greyscale intensity at that pixel represents the relative abundance of that molecule compared to other molecules.
- Identifiers are taken as the top 17 most abundant (X, Y) assemblies in each row (as the uniform weight encoding guarantees that each contiguous string of 96 bits may have exactly 17 ‘T’ values, and hence 17 corresponding identifiers).
- Data to be encoded is a textfile of three poems totaling 62824 bits.
- the data is encoded using a Labcyte Echo Liquid Handler to mix together DNA components from two layers of 384 components to construct identifiers using the product scheme implemented with overlap extension PCR.
- the first layer, X comprises 384 total DNA components.
- the second layer, Y also comprises 384 total components.
- the data Prior to writing the DNA, the data is mapped to binary and then recoded to decrease the weight (number of bit-values of ‘1’) and include checksums.
- the checksums are established so that there is an identifier that corresponds to a checksum for every contiguous string of 192 bits of data.
- the re-coded data has a weight of approximately 10,100, which corresponds to the number of identifiers to be constructed.
- the data may then be hashed into a 384 by 384 table to form a reference map.
- the middle panel of FIG. 18 B shows a two-dimensional reference map of a 384 by 384 table encoding the textfile into a plurality of identifiers.
- Each coordinate (X,Y) corresponds to the bit of data at position X+(Y ⁇ 1)*192.
- Black points correspond to a bit value of ‘1’ and white points correspond to a bit value of ‘0’.
- the black points on the right side of the figure are the checksums and the pattern of black points on the top of the figure is the codebook (e.g., dictionary for de-coding the data).
- Each X value and Y value of the table may be assigned a component and the X and Y components are assembled into an identifier using overlap extension PCR for each (X, Y) coordinate with a ‘1’ value.
- the data was read back (e.g., decoded) by sequencing the identifier library to determine the presence or absence of each possible (X, Y) assembly.
- the right panel of FIG. 18 B shows a two-dimensional heat map of the abundances of sequences present in the identifier library as determined by sequencing.
- Each pixel represents a molecule comprising the corresponding X and Y components, and the greyscale intensity at that pixel represents the relative abundance of that molecule compared to other molecules.
- Identifiers are taken as the top S most abundant (X, Y) assemblies in each row, where S for each row may be the checksum value.
Landscapes
- Life Sciences & Earth Sciences (AREA)
- Engineering & Computer Science (AREA)
- Health & Medical Sciences (AREA)
- Physics & Mathematics (AREA)
- Chemical & Material Sciences (AREA)
- Bioinformatics & Cheminformatics (AREA)
- Theoretical Computer Science (AREA)
- Biophysics (AREA)
- General Health & Medical Sciences (AREA)
- Biotechnology (AREA)
- Bioinformatics & Computational Biology (AREA)
- Evolutionary Biology (AREA)
- Genetics & Genomics (AREA)
- Spectroscopy & Molecular Physics (AREA)
- Medical Informatics (AREA)
- Organic Chemistry (AREA)
- Molecular Biology (AREA)
- General Engineering & Computer Science (AREA)
- Wood Science & Technology (AREA)
- Zoology (AREA)
- Analytical Chemistry (AREA)
- Proteomics, Peptides & Aminoacids (AREA)
- Biochemistry (AREA)
- Biomedical Technology (AREA)
- Databases & Information Systems (AREA)
- Bioethics (AREA)
- Microbiology (AREA)
- General Physics & Mathematics (AREA)
- Data Mining & Analysis (AREA)
- Medicinal Chemistry (AREA)
- Software Systems (AREA)
- Computational Linguistics (AREA)
- Chemical Kinetics & Catalysis (AREA)
- Computing Systems (AREA)
- Mathematical Physics (AREA)
- Artificial Intelligence (AREA)
- Evolutionary Computation (AREA)
- Plant Pathology (AREA)
- Crystallography & Structural Chemistry (AREA)
- Immunology (AREA)
Abstract
Methods and systems for encoding digital information in nucleic acid (e.g., deoxyribonucleic acid) molecules without base-by-base synthesis, by encoding bit-value information in the presence or absence of unique nucleic acid sequences within a pool, comprising specifying each bit location in a bit-stream with a unique nucleic sequence and specifying the bit value at that location by the presence or absence of the corresponding unique nucleic acid sequence in the pool. But, more generally, specifying unique bytes in a bytestream by unique subsets of nucleic acid sequences. Also disclosed are methods for generating unique nucleic acid sequences without base-by-base synthesis using combinatorial genomic strategies (e.g., assembly of multiple nucleic acid sequences or enzymatic-based editing of nucleic acid sequences).
Description
- This application is a Continuation application of International Patent Application No. PCT/US17/062098 filed Nov. 16, 2017, which claims the benefit of U.S. Provisional Patent Application Ser. No. 62/423,058, filed Nov. 16, 2016, U.S. Provisional Patent Application Ser. No. 62/457,074, filed Feb. 9, 2017, and U.S. Provisional Patent Application Ser. No. 62/466,304, filed Mar. 2, 2017, each of which is entirely incorporated herein by reference.
- Nucleic acid digital data storage is a stable approach for encoding and storing information for long periods of time, with data stored at higher densities than magnetic tape or hard drive storage systems. Additionally, digital data stored in nucleic acid molecules that are stored in cold and dry conditions can be retrieved as long as 60,000 years later or longer.
- To access digital data stored in nucleic acid molecules, the nucleic acid molecules may be sequenced. As such, nucleic acid digital data storage may be an ideal method for storing data that is not frequently accessed but may have a high volume of information to be stored or archived for long periods of time.
- Current methods rely on encoding the digital information (e.g., binary code) into base-by-base nucleic acids sequences, such that the base to base relationship in the sequence directly translates into the digital information (e.g., binary code). Sequencing of digital data stored in base-by-base sequences that can be read into bit-streams or bytes of digitally encoded information can be error prone and costly to encode since the cost of de novo base-by-base nucleic acid synthesis can be expensive. Opportunities for new methods of performing nucleic acid digital data storage may provide approaches for encoding and retrieving data that are less costly and easier to commercially implement.
- Methods and systems for encoding digital information in nucleic acid (e.g., deoxyribonucleic acid, DNA) molecules without base-by-base synthesis, by encoding bit-value information in the presence or absence of unique nucleic acid sequences within a pool, comprising specifying each bit location in a bit-stream with a unique nucleic sequence and specifying the bit value at that location by the presence or absence of the corresponding unique nucleic acid sequence in the pool. But, more generally, specifying unique bytes in a byte stream by unique subsets of nucleic acid sequences. Also disclosed are methods for generating unique nucleic acid sequences without base-to-base synthesis using combinatorial genomic strategies (e.g., assembly of multiple nucleic acid sequences or enzymatic-based editing of nucleic acid sequences).
- In an aspect, the present disclosure provides a method for writing information into nucleic acid sequence(s), comprising: (a) translating the information into a string of symbols; (b) mapping the string of symbols to a plurality of identifiers, wherein an individual identifier of the plurality of identifiers comprises one or more components, wherein an individual component of the one or more components comprises a nucleic acid sequence, and wherein the individual identifier of the plurality of identifiers corresponds to an individual symbol of the string of symbols; and (c) constructing an identifier library comprising at least a subset of the plurality of identifiers.
- In some embodiments, each symbol in the string of symbols is one of two possible symbol values. In some embodiments, one symbol value at each position of the string of symbols may be represented by the absence of a distinct identifier in the identifier library. In some embodiments, the two possible symbol values are a bit-value of 0 and 1, wherein the individual symbol with the bit-value of 0 in the string of symbols may be represented by an absence of a distinct identifier in the identifier library, wherein the individual symbol with the bit-value of 1 in the string of symbols may be represented by a presence of the distinct identifier in the identifier library, and vice versa. In some embodiments, each symbol of the string of symbols is one of one or more possible symbol values. In some embodiments, a presence of an individual identifier in the identifier library corresponds to a first symbol value in a binary suing and an absence of the individual identifier corresponds to a second symbol value in a binary string. In some embodiments, the first symbol value is a bit value of 1 and the second symbol value is a bit value of 0. In some embodiments, the first symbol value is a bit value of 0 and the second symbol value is a bit value of 1.
- In some embodiments, constructing the individual identifier in the identifier library comprises assembling the one or more components from one or more layers and wherein each layer of the one or more layers comprises a distinct set of components. In some embodiments, the individual identifier from the identifier library comprises one component from each layer of the one or more layers. In some embodiments, the one or more components are assembled in a fixed order. In some embodiments, the one or more components are assembled in a random order. In some embodiments, the one or more components are assembled with one or more partitioning components disposed between two components from different layers of the one or more layers. In some embodiments, the individual identifier comprises one component from each layer of a subset of the one or more layers. In some embodiments, the individual identifier comprises at least one component from each of the one or more layers. In some embodiments, the one or more components are assembled using overlap-extension polymerase chain reaction (PCR), polymerase cycling assembly, sticky end ligation, biobricks assembly, golden gate assembly, gibson assembly, recombinase assembly, ligase cycling reaction, or template directed ligation.
- In some embodiments, constructing the individual identifier in the identifier library comprises deleting, replacing, or inserting at least one component in a parent identifier by applying nucleic acid editing enzymes to the parent identifier. In some embodiments, the parent identifier comprises a plurality of components flanked by nuclease-specific target sites, recombinase recognition sites, or distinct spacer sequences. In some embodiments, the nucleic acid editing enzymes are selected from the group consisting of CRISPR-Cas, TALENs, Zinc Finger Nucleases, Recombinases, and functional variants thereof.
- In some embodiments, the identifier library comprises a plurality of nucleic acid sequences. In some embodiments, the plurality of nucleic acid sequences stores metadata of the information and/or conceals the information. In some embodiments, the metadata comprises secondary information corresponding to a source of the information, an intended recipient of the information, an original format of the information, instrumentation and methods used to encode the information, a date and a time of writing the information into the identifier library, modifications made to the information, and/or a reference to other information.
- In some embodiments, one or more identifier libraries are combined and wherein each identifier library of the one or more identifier libraries is tagged with a distinct barcode. In some embodiments, each individual identifier in the identifier library comprises the distinct barcode. In some embodiments, the plurality of identifiers is selected for ease of read, write, access, copy, and deletion operations. In some embodiments, the plurality of identifiers is selected to minimize write errors, mutations, degradation, and read errors.
- In another aspect, the present disclosure provides a method for copying information encoded in nucleic acid sequence(s), comprising: (a) providing an identifier library encoding a string of symbols, wherein the identifier library comprises a plurality of identifiers, wherein an individual identifier of the plurality of identifiers comprises one or more components, wherein an individual component of the one or more components comprises a nucleic acid sequence, and wherein the individual identifier of the plurality of identifiers corresponds to an individual symbol of the string of symbols; and (b) constructing one or more copies of the identifier library.
- In some embodiments, the plurality of identifiers comprises one or more primer binding sites. In some embodiments, the identifier library is copied using polymerase chain reaction (PCR). In some embodiments, the PCR is conventional PCR or linear PCR and wherein a number of copies of the identifier library double or increase linearly, respectively, with each PCR cycle. In some embodiments, the individual identifier in the identifier library is ligated into a circular vector prior to PCR and wherein the circle vector comprises a barcode at each end of the individual identifier.
- In some embodiments, the identifier library comprises a plurality of nucleic acid sequences. In some embodiments, the plurality of nucleic acid sequences is copied. In some embodiments, one or more identifier libraries are combined prior to copying and wherein each library of the one or more identifier libraries comprises a distinct barcode.
- In another aspect, the present disclosure provides a method for accessing information encoded in nucleic acid sequence(s), comprising: (a) providing an identifier library encoding a string of symbols, wherein the identifier library comprises a plurality of identifiers, wherein an individual identifier of the plurality of identifiers comprises one or more components, wherein an individual component of the one or more components comprises a nucleic acid sequence, and wherein the individual identifier of the plurality of identifiers corresponds to an individual symbol of the string of symbols; and (b) extracting a targeted subset of the plurality of identifiers from the identifier library.
- In some embodiments, a plurality of probes is combined with the identifier library. In some embodiments, the plurality of probes share complementarity with the targeted subset of the plurality of identifiers from the identifier library. In some embodiments, the plurality of probes hybridizes the targeted subset of the plurality of identifiers in the identifier library. In some embodiments, the plurality of probes comprises one or more affinity tags and wherein the one or more affinity tags is captured by an affinity bead or an affinity column.
- In some embodiments, the identifier library is sequentially combined with one or more subsets of the plurality of probes and wherein a portion of the identifier library binds to the one or more subsets of the plurality of probes. In some embodiments, the portion of the identifier library that binds to the one or more subsets of the plurality of probes is removed prior to the addition of another subset of the plurality of probes to the identifier library.
- In some embodiments, the individual identifier of the plurality of identifiers comprises one or more common primer binding regions, one or more variable primer binding regions, or any combination thereof. In some embodiments, the identifier library is combined with primers that bind to the one or more common primer binding regions or to the one or more variable primer binding regions. In some embodiments, the primers that bind to the one or more variable primer binding regions are used to selectively amplify the targeted subset of the identifier library.
- In some embodiments, a portion of identifiers is removed from the identifier library by selective nuclease cleavage. In some embodiments, the identifier library is combined with Cas9 and guide probes and wherein the guide probes guide the Cas9 to remove specified identifiers from the identifier library. In some embodiments, the individual identifiers are single-stranded and wherein the identifier library is combined with a single-strand specific endonuclease(s). In some embodiments, the identifier library is mixed with a complementary set of individual identifiers that protect target individual identifiers from degradation prior to the addition of the single-strand specific endonuclease(s). In some embodiments, the individual identifiers that are not cleaved by the selective nuclease cleavage are separated by size-selective chromatography. In some embodiments, the individual identifiers that are not cleaved by the selective nuclease cleavage are amplified and wherein the individual identifiers that are cleaved by the selective nuclease cleavage are not amplified. In some embodiments, the identifier library comprises a plurality of nucleic acid sequences and wherein the plurality of nucleic acid sequences are extracted with the targeted subset of the plurality of identifiers in the identifier library.
- In another aspect, the present disclosure provides a method for reading information encoded in nucleic acid sequence(s), comprising: (a) providing an identifier library comprising a plurality of identifiers, wherein an individual identifier of the plurality of identifiers comprises one or more components, wherein an individual component of the one or more components comprises a nucleic acid sequence; (b) identifying the plurality of identifiers in the identifier library; (c) generating a plurality of symbols from the plurality of identifiers identified in (b), wherein an individual symbol of the plurality of symbols corresponds to the individual identifier of the plurality of identifiers; and (d) compiling the information from the plurality of symbols.
- In some embodiments, each symbol in the string of symbols is one of two possible symbol values. In some embodiments, one symbol value at each position of the string of symbols may be represented by the absence of a distinct identifier in the identifier library. In some embodiments, the two possible symbol values are a bit-value of 0 and 1, wherein the individual symbol with the bit-value of 0 in the string of symbols may be represented by an absence of a distinct identifier in the identifier library, wherein the individual symbol with the bit-value of 1 in the string of symbols may be represented by a presence of the distinct identifier in the identifier library, and vice versa. In some embodiments, a presence of an individual identifier in the identifier library corresponds to a first symbol value in a binary string and an absence of the individual identifier in the identifier library corresponds to a second symbol value in a binary string. In some embodiments, the first symbol value is a bit value of 1 and the second symbol value is a bit value of 0. In some embodiments, the first symbol value is a bit value of 0 and the second symbol value is a bit value of 1.
- In some embodiments, identifying the plurality of identifiers comprises sequencing the plurality of identifiers in the identifier library. In some embodiments, sequencing comprises digital polymerase chain reaction (PCR), quantitative PCR, a microarray, sequencing by synthesis, or massively-parallel sequencing. In some embodiments, the identifier library comprises a plurality of nucleic acid sequences. In some embodiments, the plurality of nucleic acid sequences store metadata of the information and/or conceal the information. In some embodiments, one or more identifier libraries are combined and wherein each identifier library in the one or more identifier libraries comprises a distinct barcode. In some embodiments, the barcode stores metadata of the information.
- In another aspect, the present disclosure provides a method for nucleic acid-based computer data storage, comprising: (a) receiving computer data, (b) synthesizing nucleic acid molecules comprising nucleic acid sequences encoding the computer data, wherein the computer data is encoded in at least a subset of nucleic acid molecules synthesized and not in a sequence of each of the nucleic acid molecules, and (c) storing the nucleic acid molecules having the nucleic acid sequences.
- In some embodiments, the at least the subset of the nucleic acid molecules are grouped together. In some embodiments, the method further comprises sequencing the nucleic acid molecule(s) to determine the nucleic acid sequence(s), thereby retrieving the computer data. In some embodiments, (b) is performed in a time period that is less than about 1 day. In some embodiments, (b) is performed at an accuracy of at least about 90%.
- In another aspect, the present disclosure provides a method for nucleic acid-based computer data storage, comprising: (a) receiving computer data. (b) synthesizing a nucleic acid molecule comprising at least one nucleic acid sequence encoding the computer data, which synthesizing the nucleic acid molecule is in the absence of base-by-base nucleic acid synthesis, and (c) storing the nucleic acid molecule comprising the at least one nucleic acid sequence.
- In some embodiments, the method further comprises sequencing the nucleic acid molecule to determine the nucleic acid sequence, thereby retrieving the computer data. In some embodiments, (b) is performed in a time period that is less than about 1 day. In some embodiments, (b) is performed at an accuracy of at least about 90%.
- In another aspect, the present disclosure provides a system for encoding binary sequence data using nucleic acids, comprising: a device configured to construct an identifier library, wherein the identifier library comprises a plurality of identifiers, wherein an individual identifier of the plurality of identifiers comprises one or more components, and wherein an individual component of the one or more components is a nucleic acid sequence; and one or more computer processors operatively coupled to the device, wherein the one or more computer processors are individually or collectively programmed to (i) translate the information into a string of symbols, (ii) map the string of symbols to the plurality of identifiers, wherein the individual identifier of the plurality of identifiers corresponds to an individual symbol of the string of symbols, and (iii) construct an identifier library comprising the plurality of identifiers.
- In some embodiments, the device comprises a plurality of partitions and wherein the identifier library is generated in one or more of the plurality of partitions. In some embodiments, the plurality of partitions comprises wells. In some embodiments, constructing the individual identifier in the identifier library comprises assembling the one or more components from one or more layers and wherein each layer of the one or more layers comprises a distinct set of components. In some embodiments, each layer of the one or more layers is stored in a separate portion of the device and wherein the device is configured to combine the one or more components from the one or more layers. In some embodiments, the identifier library comprises a plurality of nucleic acid sequences. In some embodiments, one or more identifier libraries are combined in a single area of the device and wherein each identifier library of the one or more identifier libraries comprises a distinct barcode.
- In another aspect, the present disclosure provides a system for reading information encoded in nucleic acid sequence(s), comprising: a database that stores an identifier library comprising a plurality of identifiers, wherein an individual identifier of the plurality of identifiers comprises one or more components, wherein an individual component of the one or more components comprises a nucleic acid sequence; and one or more computer processors operatively coupled to the database, wherein the one or more computer processors are individually or collectively programmed to (i) identify the plurality of identifiers in the identifier library, (ii) generate a plurality of symbols from the plurality of identifiers identified in (i), wherein an individual symbol of the plurality of symbols corresponds to the individual identifier of the plurality of identifiers, and (iii) compile the information from the plurality of symbols.
- In some embodiments, the system further comprises a plurality of partitions. In some embodiments, the partitions are wells. In some embodiments, a given partition of the plurality of partitions comprises one or more identifier libraries and wherein each identifier library of the one or more identifier libraries comprises a distinct barcode. In some embodiments, the system further comprises a detection unit configured to identify the plurality of identifiers in the identifier library.
- Additional aspects and advantages of the present disclosure will become readily apparent to those skilled in this art from the following detailed description, wherein only illustrative embodiments of the present disclosure are shown and described. As will be realized, the present disclosure is capable of other and different embodiments, and its several details are capable of modifications in various obvious respects, all without departing from the disclosure. Accordingly, the drawings and description are to be regarded as illustrative in nature, and not as restrictive.
- All publications, patents, and patent applications mentioned in this specification are herein incorporated by reference to the same extent as if each individual publication, patent, or patent application was specifically and individually indicated to be incorporated by reference. To the extent publications and patents or patent applications incorporated by reference contradict the disclosure contained in the specification, the specification is intended to supersede and/or take precedence over any such contradictory material.
- The novel features of the invention are set forth with particularity in the appended claims. A better understanding of the features and advantages of the present invention will be obtained by reference to the following detailed description that sets forth illustrative embodiments, in which the principles of the invention are utilized, and the accompanying drawings (also “Figure” and “FIG.” herein), of which:
-
FIG. 1 schematically illustrates an overview of a process for encoding, writing, accessing, reading, and decoding digital information stored in nucleic acid sequences; -
FIGS. 2A and 2B schematically illustrate an example method of encoding digital data, referred to as “data at address”, using objects or identifiers (e.g., nucleic acid molecules);FIG. 2A illustrates combining a rank object (or address object) with a byte-value object (or data object) to create an identifier;FIG. 2B illustrates an embodiment of the data at address method wherein the rank objects and byte-value objects are themselves combinatorial concatenations of other objects; -
FIGS. 3A and 3B schematically illustrate an example method of encoding digital information using objects or identifiers (e.g., nucleic acid sequences);FIG. 3A illustrates encoding digital information using a rank object as an identifier;FIG. 3B illustrates an embodiment of the encoding method wherein the address objects are themselves combinatorial concatenations of other objects; -
FIG. 4 shows a contour plot, in log space, of a relationship between the combinatorial space of possible identifiers (C, x-axis) and the average number of identifiers (k, y-axis) that may be constructed to store information of a given size (contour lines): -
FIG. 5 schematically illustrates an overview of a method for writing information to nucleic acid sequences (e.g., deoxyribonucleic acid); -
FIGS. 6A and 6B illustrate an example method, referred to as the “product scheme”, for constructing identifiers (e.g., nucleic acid molecules) by combinatorially assembling distinct components (e.g., nucleic acid sequences);FIG. 6A illustrates the architecture of identifiers constructed using the product scheme;FIG. 6B illustrates an example of the combinatorial space of identifiers that may be constructed using the product scheme; -
FIG. 7 schematically illustrates the use of overlap extension polymerase chain reaction to construct identifiers (e.g., nucleic acid molecules) from components (e.g., nucleic acid sequences); -
FIG. 8 schematically illustrates the use of sticky end ligation to construct identifiers (e.g., nucleic acid molecules) from components (e.g., nucleic acid sequences); -
FIG. 9 schematically illustrates the use of recombinase assembly to construct identifiers (e.g., nucleic acid molecules) from components (e.g., nucleic acid sequences); -
FIGS. 10A and 10B demonstrates template directed ligation;FIG. 10A schematically illustrates the use of template directed ligation to construct identifiers (e.g., nucleic acid molecules) from components (e.g., nucleic acid sequences);FIG. 10B shows a histogram of the copy numbers (abundances) of 256 distinct nucleic acid sequences that were each combinatorially assembled from six nucleic acid sequences (e.g., components) in one pooled template directed ligation reaction: -
FIGS. 11A-11G schematically illustrate an example method, referred to as the “permutation scheme”, for constructing identifiers (e.g., nucleic acid molecules) with permuted components (e.g., nucleic acid sequences);FIG. 11A illustrates the architecture of identifiers constructed using the permutation scheme;FIG. 11B illustrates an example of the combinatorial space of identifiers that may be constructed using the permutation scheme;FIG. 11C shows an example implementation of the permutation scheme with template directed ligation;FIG. 11D shows an example of how the implementation fromFIG. 11C may be modified to construct identifiers with permuted and repeated components;FIG. 11E shows how the example implementation fromFIG. 11D may lead to unwanted byproducts that may be removed with nucleic acid size selection;FIG. 11F shows another example of how to use template directed ligation and size selection to construct identifiers with permuted and repeated components;FIG. 11G shows an example of when size selection may fail to isolate a particular identifier from unwanted byproducts; -
FIGS. 12A-12D schematically illustrate an example method, referred to as the “MchooseK” scheme, for constructing identifiers (e.g., nucleic acid molecules) with any number, K of assembled components (e.g., nucleic acid sequences) out of a larger number, M, of possible components;FIG. 12A illustrates the architecture of identifiers constructed using the MchooseK scheme;FIG. 12B illustrates an example of the combinatorial space of identifiers that may be constructed using the MchooseK scheme;FIG. 12C shows an example implementation of the MchooseK scheme using template directed ligation;FIG. 12D shows how the example implementation fromFIG. 12C may lead to unwanted byproducts that may be removed with nucleic acid size selection; -
FIGS. 13A and 13B schematically illustrates an example method, referred to as the “partition scheme” for constructing identifiers with partitioned components;FIG. 13A shows an example of the combinatorial space of identifiers that may be constructed using the partition scheme;FIG. 13B shows an example implementation of the partition scheme using template directed ligation; -
FIGS. 14A and 14B schematically illustrates an example method, referred to as the “unconstrained string” (or USS) scheme, for constructing identifiers made up of any string of components from a number of possible components;FIG. 14A shows an example of the combinatorial space of identifiers that may be constructed using the USS scheme;FIG. 14B shows an example implementation of the USS scheme using template directed ligation; -
FIGS. 15A and 15B schematically illustrates an example method, referred to as “component deletion” for constructing identifiers by removing components from a parent identifier;FIG. 15A shows an example of the combinatorial space of identifiers that may be constructed using the component deletion scheme;FIG. 15B shows an example implementation of the component deletion scheme using double stranded targeted cleavage and repair; -
FIG. 16 schematically illustrates a parent identifier with recombinase recognition sites where further identifiers may be constructed by applying recombinases to the parent identifier; -
FIGS. 17A-17C schematically illustrate an overview of example methods for accessing portions of information stored in nucleic acid sequences by accessing a number of particular identifiers from a larger number of identifiers;FIG. 17A shows example methods for using polymerase chain reaction, affinity tagged probes, and degradation targeting probes to access identifiers containing a specified component;FIG. 17B shows example methods for using polymerase chain reaction to perform ‘OR’ or ‘AND’ operations to access identifiers containing multiple specified components;FIG. 17C shows example methods for using affinity tags to perform ‘OR’ or ‘AND’ operations to access identifiers containing multiple specified components; -
FIGS. 18A and 18B show examples of encoding, writing, and reading data encoded in nucleic acid molecules;FIG. 18A shows an example of encoding, writing, and reading 5,856 bits of data;FIG. 18 b shows an example of encoding, writing, and reading 62,824 bits of data; and -
FIG. 19 shows a computer system that is programmed or otherwise configured to implement methods provided herein. - While various embodiments of the invention have been shown and described herein, it will be obvious to those skilled in the art that such embodiments are provided by way of example only. Numerous variations, changes, and substitutions may occur to those skilled in the art without departing from the invention. It should be understood that various alternatives to the embodiments of the invention described herein may be employed.
- The term “symbol,” as used herein, generally refers to a representation of a unit of digital information. Digital information may be divided or translated into a string of symbols. In an example, a symbol may be a bit and the bit may have a value of ‘0’ or ‘1’.
- The term “distinct,” or “unique,” as used herein, generally refers to an object that is distinguishable from other objects in a group. For example, a distinct, or unique, nucleic acid sequence may be a nucleic acid sequence that does not have the same sequence as any other nucleic acid sequence. A distinct, or unique, nucleic acid molecule may not have the same sequence as any other nucleic acid molecule. The distinct, or unique, nucleic acid sequence or molecule may share regions of similarity with another nucleic acid sequence or molecule.
- The term “component.” as used herein, generally refers to a nucleic acid sequence. A component may be a distinct nucleic acid sequence. A component may be concatenated or assembled with one or more other components to generate other nucleic acid sequence or molecules.
- The term “layer,” as used herein, generally refers to group or pool of components. Each layer may comprise a set of distinct components such that the components in one layer are different from the components in another layer. Components from one or more layers may be assembled to generate one or more identifiers.
- The term “identifier,” as used herein, generally refers to a nucleic acid molecule or a nucleic acid sequence that represents the position and value of a bit-string within a larger bit-string. More generally, an identifier may refer to any object that represents or corresponds to a symbol in a string of symbols. In some embodiments, identifiers may comprise one or multiple concatenated components.
- The term “combinatorial space.” as used herein generally refers to the set of all possible distinct identifiers that may be generated from a starting set of objects, such as components, and a permissible set of rules for how to modify those objects to form identifiers. The size of a combinatorial space of identifiers made by assembling or concatenating components may depend on the number of layers of components, the number of components in each layer, and the particular assembly method used to generate the identifiers.
- The term “identifier rank,” as used herein generally refers to a relation that defines the order of identifiers in a set.
- The term “identifier library,” as used herein generally refers to a collection of identifiers corresponding to the symbols in a symbol string representing digital information. In some embodiments, the absence of a given identifier in the identifier library may indicate a symbol value at a particular position. One or more identifier libraries may be combined in a pool, group, or set of identifiers. Each identifier library may include a unique barcode that identifies the identifier library.
- The term “nucleic acid,” as used herein, general refers to deoxyribonucleic acid (DNA), ribonucleic acid (RNA), or a variant thereof. A nucleic acid may include one or more subunits selected from adenosine (A), cytosine (C), guanine (G), thymine (T), and uracil (U), or variants thereof. A nucleotide can include A, C, G, T, or U, or variants thereof. A nucleotide can include any subunit that can be incorporated into a growing nucleic acid strand. Such subunit can be A, C, G, T, or U, or any other subunit that may be specific to one of more complementary A, C, G, T, or U, or complementary to a purine (i.e., A or G, or variant thereof) or pyrimidine (i.e., C. T, or U, or variant thereof). In some examples, a nucleic acid may be single-stranded or double stranded, in some cases, a nucleic acid is circular.
- The terms “nucleic acid molecule” or “nucleic acid sequence,” as used herein, generally refer to a polymeric form of nucleotides, or polynucleotide, that may have various lengths, either deoxyribonucleotides (DNA) or ribonucleotides (RNA), or analogs thereof. The term “nucleic acid sequence” may refer to the alphabetical representation of a polynucleotide; alternatively, the term may be applied to the physical polynucleotide itself. This alphabetical representation can be input into databases in a computer having a central processing unit and used for mapping nucleic acid sequences or nucleic acid molecules to symbols, or bits, encoding digital information. Nucleic acid sequences or oligonucleotides may include one or more non-standard nucleotide(s), nucleotide analog(s) and/or modified nucleotides.
- An “oligonucleotide”, as used herein, generally refers to a single-stranded nucleic acid sequence, and is typically composed of a specific sequence of four nucleotide bases: adenine (A); cytosine (C); guanine (G), and thymine (T) or uracil (U) when the polynucleotide is RNA.
- Examples of modified nucleotides include, but are not limited to diaminopurine, 5-fluorouracil, 5-bromouracil, 5-chlorouracil, 5-iodouracil, hypoxanthine, xantine, 4-acetylcytosine, 5-(carboxyhydroxylmethyl)uracil, 5-carboxymethylaminomethyl-2-thiouridine, 5-carboxymethylaminomethyluracil, dihydrouracil, beta-D-galactosylqueosine, inosine, N6-isopentenyladenine, 1-methylguanine, 1-methylinosine, 2,2-dimethylguanine, 2-methyladenine, 2-methylguanine, 3-methylcytosine, 5-methylcytosine, N6-adenine, 7-methylguanine, 5-methylaminomethyluracil, 5-methoxyaminomethyl-2-thiouracil, beta-D-mannosylqueosine, 5′-methoxycarboxymethyluracil, 5-methoxyuracil, 2-methylthio-D46-isopentenyladenine, uracil-5-oxyacetic acid (v), wybutoxosine, pseudouracil, queosine, 2-thiocytosine, 5-methyl-2-thiouracil, 2-thiouracil, 4-thiouracil, 5-methyluracil, uracil-5-oxyacetic acid methylester, uracil-5-oxyacetic acid (v), 5-methyl-2-thiouracil, 3-(3-amino-3-N-2-carboxypropyl)uracil, (acp3)w, 2,6-diaminopurine and the like. Nucleic acid molecules may also be modified at the base moiety (e.g., at one or more atoms that typically are available to form a hydrogen bond with a complementary nucleotide and/or at one or more atoms that are not typically capable of forming a hydrogen bond with a complementary nucleotide), sugar moiety or phosphate backbone. Nucleic acid molecules may also contain amine-modified groups, such as aminoallyl-dUTP (aa-dUTP) and aminohexhylacrylamide-dCTP (aha-dCTP) to allow covalent attachment of amine reactive moieties, such as N-hydroxy succinimide esters (NHS).
- The term “primer,” as used herein, generally refers to a strand of nucleic acid that serves as a starting point for nucleic acid synthesis, such as polymerase chain reaction (PCR). In an example, during replication of a DNA sample, an enzyme that catalyzes replication starts replication at the 3′-end of a primer attached to the DNA sample and copies the opposite strand.
- The term “polymerase” or “polymerase enzyme,” as used herein, generally refers to any enzyme capable of catalyzing a polymerase reaction. Examples of polymerases include, without limitation, a nucleic acid polymerase. The polymerase can be naturally occurring or synthesized. An example polymerase is a Φ29 polymerase or derivative thereof. In some cases, a transcriptase or a ligase is used (i.e., enzymes which catalyze the formation of a bond) in conjunction with polymerases or as an alternative to polymerases to construct new nucleic acid sequences. Examples of polymerases include a DNA polymerase, a RNA polymerase, a thermostable polymerase, a wild-type polymerase, a modified polymerase, E. coli DNA polymerase I, T7 DNA polymerase, bacteriophage T4 DNA polymerase Φ29 (phi29) DNA polymerase, Taq polymerase, Tth polymerase, Tli polymerase, Pfu polymerase Pwo polymerase, VENT polymerase, DEEPVENT polymerase, Ex-Taq polymerase, LA-Taw polymerase, Sso polymerase Poc polymerase, Pab polymerase, Mth polymerase ES4 polymerase, Tru polymerase, Tac polymerase, Tne polymerase, Tma polymerase, Tca polymerase, Tih polymerase, Tfi polymerase, Platinum Taq polymerases, Tbr polymerase, Tfl polymerase, Pfutubo polymerase, Pyrobest polymerase, KOD polymerase, Bst polymerase, Sac polymerase, Klenow fragment polymerase with 3′ to 5′ exonuclease activity, and variants, modified products and derivatives thereof.
- Digital information, such as computer data, in the form of binary code can comprise a sequence or string of symbols. A binary code may encode or represent text or computer processor instructions using, for example, a binary number system having two binary symbols, typically 0 and 1, referred to as bits. Digital information may be represented in the form of non-binary code which can comprise a sequence of non-binary symbols. Each encoded symbol can be re-assigned to a unique bit string (or “byte”), and the unique bit string or byte can be arranged into strings of bytes or byte streams. A bit value for a given bit can be one of two symbols (e.g., 0 or 1). A byte, which can comprise a string of N bits, can have a total of 2N unique byte-values. For example, a byte comprising 8 bits can produce a total of 2S or 256 possible unique byte-values, and each of the 256 bytes can correspond to one of 256 possible distinct symbols, letters, or instructions which can be encoded with the bytes. Raw data (e.g., text files and computer instructions) can be represented as strings of bytes or byte streams. Zip files, or compressed data files comprising raw data can also be stored in byte streams, these files can be stored as byte streams in a compressed form, and then decompressed into raw data before being read by the computer.
- Methods and systems of the present disclosure may be used to encode computer data or information in a plurality of identifiers, each of which may represent one or more bits of the original information. In some examples, methods and systems of the present disclosure encode data or information using identifiers that each represents two bits of the original information.
- Previous methods for encoding digital information into nucleic acids have relied on base-by-base synthesis of the nucleic acids, which can be costly and time consuming. Alternative methods may improve the efficiency, improve the commercial viability of digital information storage by reducing the reliance on base-by-base nucleic acid synthesis for encoding digital information, and eliminate the de novo synthesis of distinct nucleic acid sequences for every new information storage request.
- New methods can encode digital information (e.g., binary code) in a plurality of identifiers, or nucleic acid sequences, comprising combinatorial arrangements of components instead of relying on base-by-base or de-novo nucleic acid synthesis (e.g., phosphoramidite synthesis). As such, new strategies may produce a first set of distinct nucleic acid sequences (or components) for the first request of information storage, and can there-after re-use the same nucleic acid sequences (or components) for subsequent information storage requests. These approaches can significantly reduce the cost of DNA-based information storage by reducing the role of de-novo synthesis of nucleic acid sequences in the information-to-DNA encoding and writing process. Moreover, unlike implementations of base-by-base synthesis, such as phosphoramidite chemistry- or template-free polymerase-based nucleic acid elongation, which may use cyclical delivery of each base to each elongating nucleic acid, new methods of information-to-DNA writing using identifier construction from components are highly parallelizable processes that do not necessarily use cyclical nucleic acid elongation. Thus, new methods may increase the speed of writing digital information to DNA compared to older methods.
- In an aspect, the present disclosure provides methods for encoding information into nucleic acid sequences. A method for encoding information into nucleic acid sequences may comprise (a) translating the information into a string of symbols, (b) mapping the string of symbols to a plurality of identifiers, and (c) constructing an identifier library comprising at least a subset of the plurality of identifiers. An individual identifier of the plurality of identifiers may comprise one or more components. An individual component of the one or more components may comprise a nucleic acid sequence. Each symbol at each position in the string of symbols may correspond to a distinct identifier. The individual identifier may correspond to an individual symbol at an individual position in the string of symbols. Moreover, one symbol at each position in the string of symbols may correspond to the absence of an identifier. For example, in a string of binary symbols (e.g., bits) of ‘0’s and ‘1’s, each occurrence of ‘0’ may correspond to the absence of an identifier.
- In another aspect, the present disclosure provides methods for nucleic acid-based computer data storage. A method for nucleic acid-based computer data storage may comprise (a) receiving computer data, (b) synthesizing nucleic acid molecules comprising nucleic acid sequences encoding the computer data, and (c) storing the nucleic acid molecules having the nucleic acid sequences. The computer data may be encoded in at least a subset of nucleic acid molecules synthesized and not in a sequence of each of the nucleic acid molecules.
- In another aspect, the present disclosure provides methods for writing and storing information in nucleic acid sequences. The method may comprise, (a) receiving or encoding a virtual identifier library that represents information, (b) physically constructing the identifier library, and (c) storing one or more physical copies of the identifier library in one or more separate locations. An individual identifier of the identifier library may comprise one or more components. An individual component of the one or more components may comprise a nucleic acid sequence.
- In another aspect, the present disclosure provides methods for nucleic acid-based computer data storage. A method for nucleic acid-based computer data storage may comprise (a) receiving computer data, (b) synthesizing a nucleic acid molecule comprising at least one nucleic acid sequence encoding the computer data, and (c) storing the nucleic acid molecule comprising the at least one nucleic acid sequence. Synthesizing the nucleic acid molecule may be in the absence of base-by-base nucleic acid synthesis.
- In another aspect, the present disclosure provides methods for writing and storing information in nucleic acid sequences. A method for writing and storing information in nucleic acid sequences may comprise, (a) receiving or encoding a virtual identifier library that represents information, (b) physically constructing the identifier library, and (c) storing one or more physical copies of the identifier library in one or more separate locations. An individual identifier of the identifier library may comprise one or more components. An individual component of the one or more components may comprise a nucleic acid sequence.
-
FIG. 1 illustrates an overview process for encoding information into nucleic acid sequences, writing information to the nucleic acid sequences, reading information written to nucleic acid sequences, and decoding the read information. Digital information, or data, may be translated into one or more strings of symbols. In an example, the symbols are bits and each bit may have a value of either ‘0’ or ‘1’. Each symbol may be mapped, or encoded, to an object (e.g., identifier) representing that symbol. Each symbol may be represented by a distinct identifier. The distinct identifier may be a nucleic acid molecule made up of components. The components may be nucleic acid sequences. The digital information may be written into nucleic acid sequences by generating an identifier library corresponding to the information. The identifier library may be physically generated by physically constructing the identifiers that correspond to each symbol of the digital information. All or any portion of the digital information may be accessed at a time. In an example, a subset of identifiers is accessed from an identifier library. The subset of identifiers may be read by sequencing and identifying the identifiers. The identified identifiers may be associated with their corresponding symbol to decode the digital data. - A method for encoding and reading information using the approach of
FIG. 1 can, for example, include receiving a bit stream and mapping each one-bit (bit with bit-value of ‘1’) in the bit stream to a distinct nucleic acid identifier using an identifier rank or a nucleic acid index. Constructing a nucleic acid sample pool, or identifier library, comprising copies of the identifiers that correspond to bit values of 1 (and excluding identifiers for bit values of 0). Reading the sample can comprise using molecular biology methods (e.g., sequencing, hybridization, PCR, etc), determining which identifiers are represented in the identifier library, and assigning bit-values of ‘1’ to the bits corresponding to those identifiers and bit-values of ‘0’ elsewhere (again referring to the identifier rank to identify the bits in the original bit-stream that each identifier corresponds to), thus decoding the information into the original encoded bit stream. - Encoding a string of N distinct bits, can use an equivalent number of unique nucleic acid sequences as possible identifiers. This approach to information encoding may use de-novo synthesis of identifiers (e.g., nucleic acid molecules) for each new item of information (string of N bits) to store. In other instances, the cost of newly synthesizing identifiers (equivalent in number to or less than N) for each new item of information to store can be reduced by the one-time de-novo synthesis and subsequent maintenance of all possible identifiers, such that encoding new items of information may involve mechanically selecting and mixing together pre-synthesized (or pre-fabricated) identifiers to form an identifier library. In other instances, both the cost of (1) de-novo synthesis of up to N identifiers for each new item of information to store or (2) maintaining and selecting from N possible identifiers for each new item of information to store, or any combination thereof, may be reduced by synthesizing and maintaining a number (less than N, and in some cases much less than N) of nucleic acid sequences and then modifying these sequences through enzymatic reactions to generate up to N identifiers for each new item of information to store.
- The identifiers may be rationally designed and selected for ease of read, write, access, copy, and deletion operations. The identifiers may be designed and selected to minimize wnte errors, mutations, degradation, and read errors.
-
FIGS. 2A and 2B schematically illustrate an example method, referred to as “data at address”, of encoding digital data in objects or identifiers (e.g., nucleic acid molecules).FIG. 2A illustrates encoding a bit stream into an identifier library wherein the individual identifiers are constructed by concatenating or assembling a single component that specifies an identifier rank with a single component that specifies a byte-value. In general, the data at address method uses identifiers that encode information modularly by comprising two objects: one object, the “byte-value object” (or “data object”), that identifies a byte-value and one object, the “rank object” (or “address object”), that identifies the identifier rank (or the relative position of the byte in the original bit-stream).FIG. 2B illustrates an example of the data at address method wherein each rank object may be combinatorially constructed from a set of components and each byte-value object may be combinatorially constructed from a set of components. Such combinatorial construction of rank and byte-value objects enables more information to be written into identifiers than if the objects where made from the single components alone (e.g.,FIG. 2A ). -
FIGS. 3A and 3B schematically illustrate another example method of encoding digital information in objects or identifiers (e.g., nucleic acid sequences).FIG. 3A illustrates encoding a bit stream into an identifier library wherein identifiers are constructed from single components that specify identifier rank. The presence of an identifier at a particular rank (or address) specifies a bit-value of ‘1’ and the absence of an identifier at a particular rank (or address) specifies a bit-value of ‘0’. This type of encoding may use identifiers that solely encode rank (the relative position of a bit in the original bit stream) and use the presence or absence of those identifiers in an identifier library to encode a bit-value of ‘1’ or ‘0’, respectively. Reading and decoding the information may include identifying the identifiers present in the identifier library, assigning bit-values of ‘1’ to their corresponding ranks and assigning bit-values of ‘0’ elsewhere.FIG. 3B illustrates an example encoding method where each identifier may be combinatorially constructed from a set of components such that each possible combinatorial construction specifies a rank. Such combinatorial construction enables more information to be written into identifiers than if the identifiers where made from the single components alone (e.g.,FIG. 3A ). For example, a component set may comprise five distinct components. The five distinct components may be assembled to generate ten distinct identifiers, each comprising two of the five components. The ten distinct identifiers may each have a rank (or address) that corresponds to the position of a bit in a bit stream. An identifier library may include the subset of those ten possible identifiers that corresponds to the positions of bit-value ‘1’, and exclude the subset of those ten possible identifiers that corresponds to the positions of the bit-value ‘0’ within a bit stream of length ten. -
FIG. 4 shows a contour plot, in log space, of a relationship between the combinatorial space of possible identifiers (C, x-axis) and the average number of identifiers (k, y-axis) to be physically constructed in order to store information of a given original size in bits (D, contour lines) using the encoding method shown inFIGS. 3A and 3B . This plot assumes that the original information of size D is re-coded into a string of C bits (where C may be greater than D) where a number of bits, k, has a bit-value of ‘1’. Moreover, the plot assumes that information-to-nucleic-acid encoding is performed on the re-coded bit string and that identifiers for positions where the bit-value is ‘1’ are constructed and identifiers for positions where the bit-value is ‘0’ are not constructed. Following the assumptions, the combinatorial space of possible identifiers has size C to identify every position in the re-coded bit string, and the number of identifiers used to encode the bit string of size D is such that D=log2(Cchoosek), where Cchoosek may be the mathematical formula for the number of ways to pick k unordered outcomes from C possibilities. Thus, as the combinatorial space of possible identifiers increases beyond the size (in bits) of a given item of information, a decreasing number of physically constructed identifiers may be used to store the given information. -
FIG. 5 shows an overview method for writing information into nucleic acid sequences. Prior to writing the information, the information may be translated into a string of symbols and encoded into a plurality of identifiers. Writing the information may include setting up reactions to produce possible identifiers. A reaction may be set up by depositing inputs into a compartment. The inputs may comprise nucleic acids, components, templates, enzymes, or chemical reagents. The compartment may be a well, a tube, a position on a surface, a chamber in a microfluidic device, or a droplet within an emulsion. Multiple reactions may be set up in multiple compartments. Reactions may proceed to produce identifiers through programmed temperature incubation or cycling. Reactions may be selectively or ubiquitously removed (e.g., deleted). Reactions may also be selectively or ubiquitously interrupted, consolidated, and purified to collect their identifiers in one pool. Identifiers from multiple identifier libraries may be collected in the same pool. An individual identifier may include a barcode or a tag to identify to which identifier library it belongs. Alternatively, or in addition to, the barcode may include metadata for the encoded information. Supplemental nucleic acids or identifiers may also be included in an identifier pool together with an identifier library. The supplemental nucleic acids or identifiers may include metadata for the encoded information or serve to obfuscate or conceal the encoded information. - An identifier rank (e.g., nucleic acid index) can comprise a method or key for determining the ordering of identifiers. The method can comprise a look-up table with all identifiers and their corresponding rank. The method can also comprise a look up table with the rank of all components that constitute identifiers and a function for determining the ordering of any identifier comprising a combination of those components. Such a method may be referred to as lexicographical ordering and may be analogous to the manner in which words in a dictionary are alphabetically ordered. In the data at address encoding method, the identifier rank (encoded by the rank object of the identifier) may be used to determine the position of a byte (encoded by the byte-value object of the identifier) within a bit stream. In an alternative method, the identifier rank (encoded by the entire identifier itself) for a present identifier may be used to determine the position of bit-value of ‘1’ within a bit stream.
- A key may assign distinct bytes to unique subsets of identifiers (e.g., nucleic acid molecules) within a sample. For example, in a simple form, a key may assign each bit in a byte to a unique nucleic acid sequence that specifies the position of the bit, and then the presence or absence of that nucleic acid sequence within a sample may specify the bit-value of 1 or 0, respectively. Reading the encoded information from the nucleic acid sample can comprise any number of molecular biology techniques including sequencing, hybridization, or PCR. In some embodiments, reading the encoded dataset may comprise reconstructing a portion of the dataset or reconstructing the entire encoded dataset from each nucleic acid sample. When the sequence may be read the nucleic acid index can be used along with the presence or absence of a unique nucleic acid sequence and the nucleic acid sample can be decoded into a bit stream (e.g., each string of bits, byte, bytes, or string of bytes).
- Identifiers may be constructed by combinatorially assembling component nucleic acid sequences. For example, information may be encoded by taking a set of nucleic acid molecules (e.g., identifiers) from a defined group of molecules (e.g., combinatorial space). Each possible identifier of the defined group of molecules may be an assembly of nucleic acid sequences (e.g., components) from a prefabricated set of components that may be divided into layers. Each individual identifier may be constructed by concatenating one component from every layer in a fixed order. For example, if there are M layers and each layer may have n components, then up to C=nM unique identifiers may be constructed and up to 2C different items of information, or C bits, may be encoded and stored. For example, storage of a megabit of information may use 1×106 distinct identifiers or a combinatorial space of size C=1×106. The identifiers in this example may be assembled from a variety of components organized in different ways. Assemblies may be made from M=2 prefabricated layers, each containing n=1×103 components. Alternatively, assemblies may be made from M=3 layers, each containing n=1×102 components. As this example illustrates, encoding the same amount of information using a larger number of layers may allow for the total number of components to be smaller. Using a smaller number of total components may be advantageous in terms of writing cost.
- In an example, one can start with two sets of unique nucleic acid sequences or layers, X and Y, each with x and y components (e.g., nucleic acid sequences), respectively. Each nucleic acid sequence from X can be assembled to each nucleic acid sequence from Y. Though the total number of nucleic acid sequences maintained in the two sets may be the sum of x and y, the total number of nucleic acid molecules, and hence possible identifiers, that can be generated may be the product of x and y. Even more nucleic acid sequences (e.g., identifiers) can be generated if the sequences from X can be assembled to the sequences of Y in any order. For example, the number of nucleic acid sequences (e.g., identifiers) generated may be twice the product of x and y if the assembly order is programmable. This set of all possible nucleic acid sequences that can be generated may be referred to as XY. The order of the assembled units of unique nucleic acid sequences in XY can be controlled using nucleic acids with distinct 5′ and 3′ ends, and restriction digestion, ligation, polymerase chain reaction (PCR), and sequencing may occur with respect to the distinct 5′ and 3′ ends of the sequences. Such an approach can reduce the total number of nucleic acid sequences (e.g., components) used to encode N distinct bits, by encoding information in the combinations and orders of their assembly products. For example, to encode 100 bits of information, two layers of 10 distinct nucleic acid molecules (e.g., component) may be assembled in a fixed order to produce 10*10 or 100 distinct nucleic acid molecules (e.g., identifiers), or one layer of 5 distinct nucleic acid molecules (e.g., components) and another layer of 10 distinct nucleic acid molecules (e.g., components) may be assembled in any order to produce 100 distinct nucleic acid molecules (e.g., identifiers).
- Nucleic acid sequences (e.g., components) within each layer may comprise a unique (or distinct) sequence, or barcode, in the middle, a common hybridization region on one end, and another common hybridization region on another other end. The barcode may contain a sufficient number of nucleotides to uniquely identify every sequence within the layer. For example, there are typically four possible nucleotides for each base position within a barcode. Therefore, a three base barcode may uniquely identify 43=64 nucleic acid sequences. The barcodes may be designed to be randomly generated. Alternatively, the barcodes may be designed to avoid sequences that may create complications to the construction chemistry of identifiers or sequencing. Additionally, barcodes may be designed so that each may have a minimum hamming distance from the other barcodes, thereby decreasing the likelihood that base-resolution mutations or read errors may interfere with the proper identification of the barcode.
- The hybidization region on one end of the nucleic acid sequence (e.g., component) may be different in each layer, but the hybridization region may be the same for each member within a layer. Adjacent layers are those that have complementary hybridization regions on their components that allow them to interact with one another. For example, any component from layer X may be able to attach to any component from layer Y because they may have complementary hybridization regions. The hybridization region on the opposite end may serve the same purpose as the hybridization region on the first end. For example, any component from layer Y may attach to any component of layer X on one end and any component of layer Z on the opposite end.
-
FIGS. 6A and 6B illustrate an example method, referred to as the “product scheme”, for constructing identifiers (e.g., nucleic acid molecules) by combinatorially assembling a distinct component (e.g., nucleic acid sequence) from each layer in a fixed order.FIG. 6A illustrates the architecture of identifiers constructed using the product scheme. An identifier may be constructed by combining a single component from each layer in a fixed order. For M layers, each with N components, there are NM possible identifiers.FIG. 6B illustrates an example of the combinatorial space of identifiers that may be constructed using the product scheme. In an example, a combinatorial space may be generated from three layers each comprising three distinct components. The components may be combined such that one component from each layer may be combined in a fixed order. The entire combinatorial space for this assembly method may comprise twenty-seven possible identifiers. -
FIGS. 7-10 illustrate chemical methods for implementing the product scheme (seeFIG. 6 ). Methods depicted inFIGS. 7-10 , along with any other methods for assembling two or more distinct components in a fixed order may be used, for example, to produce any one or more identifiers in an identifier library. Identifiers may be constructed using any of the implementation methods described inFIGS. 7-10 , at any time during the methods or systems disclosed herein. In some instances, all or a portion of the combinatorial space of possible identifiers may be constructed before digital information is encoded or written, and then the writing process may involve mechanically selecting and pooling the identifiers (that encode the information) from the already existing set. In other instances, the identifiers may be constructed after one or more steps of the data encoding or writing process may have occurred (i.e., as information is being written). - Enzymatic reactions may be used to assemble components from the different layers or sets. Assembly can occur in a one pot reaction because components (e.g., nucleic acid sequences) of each layer have specific hybridization or attachment regions for components of adjacent layers. For example, a nucleic acid sequence (e.g., component) X1 from layer X, a nucleic acid sequence Y1 from layer Y, and a nucleic acid sequence Z1 from layer Z may form the assembled nucleic acid molecule (e.g., identifier) X1Y1Z1. Additionally, multiple nucleic acid molecules (e.g., identifiers) may be assembled in one reaction by including multiple nucleic acid sequences from each layer. For example, including both Y1 and Y2 in the one pot reaction of the previous example may yield two assembled products (e.g., identifiers), X1Y1Z1 and X1Y2Z1. This reaction multiplexing may be used to speed up writing time for the plurality of identifiers that are physically constructed. Assembly of the nucleic acid sequences may be performed in a time period that is less than or equal to about 1 day, 12 hours, 10 hours, 9 hours, 8 hours, 7 hours, 6 hours, 5 hours, 4 hours, 3 hours, 2 hours, or 1 hour. The accuracy of the encoded data may be at least about or equal to about 90%, 95%, 96%, 97%, 98%, 99%, or greater.
- Identifiers may be constructed in accordance with the product scheme using overlap extension polymerase chain reaction (OEPCR), as illustrated in
FIG. 7 . Each component in each layer may comprise a double-stranded or single stranded (as depicted in the figure) nucleic acid sequence with a common hybridization region on the sequence end that may be homologous and/or complementary to the common hybridization region on the sequence end of components from an adjacent layer. An individual identifier may be constructed by concatenating one component (e.g., unique sequence) from a layer X (or layer 1) comprising components X1-XA, a second component (e.g., unique sequence) from a layer Y (or layer 2) comprising Y1-YA, and a third component (e.g., unique sequence) from layer Z (or layer 3) comprising Z1-ZB. The components from layer X may have a 3′ end that shares complementarity with the 3′ end on components from layer Y. Thus single-stranded components from layer X and Y may be annealed together at the 3′ end and may be extended using PCR to generate a double-stranded nucleic acid molecule. The generated double-stranded nucleic-acid molecule may be melted to generate a 3′ end that shares complementarity with a 3′ end of a component from layer Z. A component from layer Z may be annealed with the generated nucleic acid molecule and may be extended to generate a unique identifier comprising a single component from layers X, Y, and Z in a fixed order. DNA size selection (e.g., with gel extraction) or polymerase chain reaction (PCR) with primers flanking the outer most layers may be implemented to isolate identifier products from other byproducts that may form in the reaction. - Identifiers may be assembled in accordance with the product scheme using sticky end ligation, as illustrated in
FIG. 8 . Three layers, each comprising double stranded components (e.g., double stranded DNA (dsDNA)) with single-stranded 3′ overhangs, can be used to assemble distinct identifiers. For example, identifiers comprising one component from the layer X (or layer 1) comprising components X1-XA, a second component from the layer Y (or layer 2) comprising Y1-YB, and a third component from the layer Z (or layer 3) comprising Z1-ZC. To combine components from layer X with components from layer Y, the components in layer X can comprise a common 3′ overhang,FIG. 8 labeled a, and the components in layer Y can comprise a common, complementary 3′ overhang, a*. To combine components from layer Y with components from layer Z, the elements in layer Y can comprise a common 3′ overhang,FIG. 8 labeled b, and the elements in layer Z can comprise a common, complementary 3′ overhang, b*. The 3′ overhang in layer X components can be complementary to the 3′ end in layer Y components and the other 3′ overhang in layer Y components can be complementary to the 3′ end in layer Z components allowing the components to hybridize and ligate. As such, components from layer X cannot hybridize with other components from layer X or layer Z, and similarly components from layer Y cannot hybridize with other elements from layer Y. Furthermore, a single component from layer Y can ligate to a single component of layer X and a single component of layer Z, ensuring the formation of a complete identifier. DNA size selection (for example with gel extraction) or PCR with primers flanking the outer most layers may be implemented to isolate identifier products from other byproducts that may form in the reaction. - The sticky ends for sticky end ligation may be generated by treating the components of each layer with restriction endonucleases. In some embodiments, the components of multiple layers may be generated from one “parent” set of components. For example, an embodiment wherein a single parent set of double-stranded components may have complementary restrictions sites on each end (e.g., restriction sites for BamHI and BglII). Any two components may be selected for assembly, and individually digested with one or the other complementary restriction enzymes (e.g., BglII or BamHI) resulting in complementary sticky ends that can be ligated together resulting in an inert scar. The product nucleic acid sequence may comprise the complementary restriction sites on each end (e.g., BamHI on the 5′ end and BglII on the 3′ end), and can be further ligated to another component from the parent set following the same process. This process may cycle indefinitely. If the parent comprises N components, then each cycle may be equivalent to adding an extra layer of N components to the product scheme.
- A method for using ligation to construct a sequence of nucleic acids comprising elements from set X (e.g., set 1 of dsDNA) and elements from set Y (e.g., set 2 of dsDNA) can comprise the steps of obtaining or constructing two or more pools (e.g., set 1 of dsDNA and set 2 of dsDNA) of double stranded sequences wherein a first set (e.g., set 1 of dsDNA) comprises a sticky end (e.g., a) and a second set (e.g., set 2 of dsDNA) comprises a sticky end (e.g., a*) that is complementary to the sticky end of the first set. Any DNA from the first set (e.g., set 1 of dsDNA) and any subset of DNA from the second set (e.g., set 2 of dsDNA) can me combined and assembled and then ligated together to form a single double stranded DNA with an element from the first set and an element from the second set.
- Identifiers may be assembled in accordance with the product scheme using site specific recombination, as illustrated in
FIG. 9 . Identifiers may be constructed by assembling components from three different layers. The components in layer X (or layer 1) may comprise double-stranded molecules with an attBx recombinase site on one side of the molecule, components from layer Y (or layer 2) may comprise double-stranded molecules with an attPx recombinase site on one side and an attBy recombinase site on the other side, and components in layer Z (or layer 3) may comprise an attPy recombinase site on one side of the molecule. attB and attP sites within a pair, as indicate by their subscripts, are capable of recombining in the presence of their corresponding recombinase enzyme. One component from each layer may be combined such that one component from layer X associates with one component from layer Y, and one component from layer Y associates with one component from layer Z. Application of one or more recombinase enzymes may recombine the components to generate a double-stranded identifier comprising the ordered components. DNA size selection (for example with gel extraction) or PCR with primers flanking the outer most layers may be implemented to isolate identifier products from other byproducts that may form in the reaction. In general, multiple orthogonal attB and attP pairs may be used, and each pair may be used to assemble a component from an extra layer. For the large-serine family of recombinases, up to six orthogonal attB and attP pairs may be generated per recombinases, and multiple orthogonal recombinases may be implemented as well. For example, thirteen layers may be assembled by using twelve orthogonal attB and attP pairs, six orthogonal pairs from each of two large serine recombinases, such as BxbI and PhiC31. Orthogonality of attB and attP pairs ensures that an attB site from one pair does not react with an attP site from another pair. This enables components from different layers to be assembled in a fixed order. Recombinase-mediated recombination reactions may be reversible or irreversible depending on the recombinase system implemented. For example, the large serine recombinase family catalyzes irreversible recombination reactions without requiring any high energy cofactors, whereas the tyrosine recombinase family catalyzes reversible reactions. - Identifiers may be constructed in accordance with the product scheme using template directed ligation (TDL), as shown in
FIG. 10A . Template directed ligation utilizes single stranded nucleic acid sequences, referred to as “templates” or “staples”, to facilitate the ordered ligation of components to form identifiers. The templates simultaneously hybridize to components from adjacent layers and hold them adjacent to each other (3′ end against 5′ end) while a ligase ligates them. In the example fromFIG. 10A , three layers or sets of single-stranded components are combined. A first layer of components (e.g., layer X or layer 1) that share common sequences a on their 3′ end, which are complementary to sequences a*; a second layer of components (e.g., layer Y or layer 2) that share common sequences b and c on their 5′ and 3′ ends respectively, which are complementary to sequences b* and c*; a third layer of components (e.g., layer Z or layer 3) that share common sequence d on their 5′ end, which may be complementary to sequences d*; and a set of two templates or “staples” with the first staple comprising the sequence a*b* (5′ to 3′) and the second staple comprising a sequence c*d* (‘5 to 3’). In this example, one or more components from each layer may be selected and mixed into a reaction with the staples, which, by complementary annealing may facilitate the ligation of one component from each layer in a defined order to form an identifier. DNA size selection (for example with gel extraction) or PCR with primers flanking the outer most layers may be implemented to isolate identifier products from other byproducts that may form in the reaction. -
FIG. 10B shows a histogram of the copy numbers (abundances) of 256 distinct nucleic acid sequences that were each assembled with 6-layer TDL. The edge layers (first and final layers) each had one component, and each of the internal layers (remaining 4 four layers) had four components. Each edge layer component was 28 bases including a 10 base hybridization region. Each internal layer component was 30 bases including a 10 base common hybridization region on the 5′ end, a 10 base variable (barcode) region, and a 10 base common hybridization region on the 3′ end. Each of the three template strands was 20 bases in length. All 256 distinct sequences were assembled in a multiplex fashion with one reaction containing all of the components and templates, T4 Polynucleotide Kinase (for phosphorylating the components), and T4 Ligase, ATP, and other proper reaction reagents. The reaction was incubated at 37 degrees for 30 minutes and then room temperature for 1 hour. Sequencing adapters were added to the reaction product with PCR, and the product was sequenced with an Illumina MiSeq instrument. The relative copy number of each distinct assembled sequence out of 192910 total assembled sequence reads is shown. Other embodiments of this method may use double stranded components, where the components are initially melted to form single stranded versions that can anneal to the staples. Other embodiments or derivatives of this method (i.e., TDL) may be used to construct a combinatorial space of identifiers more complex than what may be accomplished in the product scheme. - Identifiers may be constructed in accordance with the product scheme using various other chemical implementations including golden gate assembly, gibson assembly, and ligase cycling reaction assembly.
-
FIGS. 11A and 11B schematically illustrate an example method, referred to as the “permutation scheme”, for constructing identifiers (e.g., nucleic acid molecules) with permuted components (e.g., nucleic acid sequences).FIG. 11A illustrates the architecture of identifiers constructed using the permutation scheme. An identifier may be constructed by combining a single component from each layer in a programmable order.FIG. 11B illustrates an example of the combinatorial space of identifiers that may be constructed using the permutation scheme. In an example, a combinatorial space of size six may be generated from three layers each comprising one distinct component. The components may be concatenated in any order. In general, with M layers, each with N components, the permutation scheme enables a combinatorial space of NMM! total identifiers. -
FIG. 11C illustrates an example implementation of the permutation scheme with template directed ligation (TDL). Components from multiple layers are assembled in between fixed left end and right end components, referred to as edge scaffolds. These edge scaffolds are the same for all identifiers in the combinatorial space and thus may be added as part of the reaction master mix for the implementation. Templates or staples exist for any possible junction between any two layers or scaffolds such that the order in which components from different layers are incorporated into an identifier in the reaction depends on the templates selected for the reaction. In order to enable any possible permutation of layers for M layers, there may be M2+2M distinct selectable staples for every possible junction (including junctions with the scaffolds). M of those templates (shaded in grey) form junctions between layers and themselves and may be excluded for the purposes of permutation assembly as described herein. However, their inclusion can enable a larger combinatorial space with identifiers comprising repeat components as illustrated inFIGS. 11D-G . DNA size selection (for example with gel extraction) or PCR with primers targeting the edge scaffolds may be implemented to isolate identifier products from other byproducts that may form in the reaction. -
FIGS. 11D-G illustrate example methods of how the permutation scheme may be expanded to include certain instances of identifiers with repeated components.FIG. 11D shows an example of how the implementation formFIG. 11C may be used to construct identifiers with permuted and repeated components. For example, an identifier may comprise three total components assembled from two distinct components. In this example, a component from a layer may be present multiple times in an identifier. Adjacent concatenations of the same component may be achieved by using a staple with adjacent complementary hybridization regions for both the 3′ end and 5′ end of the same component, such as the a*b* (5′ to 3′) staple in the figure. In general, for M layers, there are M such staples. Incorporation of repeated components with this implementation may generate nucleic acid sequences of more than one length (i.e., comprising one, two, three, four, or more components) that are assembled between the edge scaffolds, as demonstrated inFIG. 11E .FIG. 11E shows how the example implementation fromFIG. 11D may lead to non-targeted nucleic acid sequences, besides the identifier, that are assembled between the edge scaffolds. The appropriate identifier cannot be isolated from non-targeted nucleic acid sequence with PCR because they share the same primer binding sites on the edge. However, in this example, DNA size selection (e.g., with gel extraction) may be implemented to isolate the targeted identifier (e.g., the second sequence from the top) from the non-targeted sequences since each assembled nucleic acid sequence can be designed to have a unique length (e.g., if all components have the same length).FIG. 11F shows another example where constructing an identifier with repeated components may generate multiple nucleic acid sequences with equal edge sequences but distinct lengths in the same reaction. In this method, templates that assemble a components in one layer with components in other layers in an alternating pattern may be used. As with the method shown inFIG. 11E , size selection may be used to select identifiers of the designed length.FIG. 11G shows an example where constructing an identifier with repeated components may generate multiple nucleic acid sequences with equal edge sequences and for some nucleic acid sequences (e.g., the third and fourth from the top and the sixth and seventh from the top), equal lengths. In this example, those nucleic acid sequences that share equal lengths may be excluded from both being individual identifiers as it may not be possible to construct one without also constructing the other, even if PCR and DNA size selection are implemented. -
FIGS. 12A-12D schematically illustrate an example method, referred to as the “MchooseK scheme”, for constructing identifiers (e.g., nucleic acid molecules) with any number, K. of assembled components (e.g., nucleic acid sequences) out of a larger number, M, of possible components.FIG. 12A illustrates the architecture of identifiers constructed using the MchooseK scheme. Using this method identifiers are constructed by assembling one component form each layer in any subset of all layers (e.g., choose components from k layers out of M possible layers).FIG. 12B illustrates an example of the combinatorial space of identifiers that may be constructed using the MchooseK scheme. In this assembly scheme the combinatorial space may comprise NKMchooseK possible identifiers for M layers, N components per layer, and an identifier length of K components. In an example, if there are five layers each comprising one component, then up to ten distinct identifiers may be assemble comprising two components each. - The MchooseK scheme may be implemented using template directed ligation, as shown in
FIG. 12C . As with the TDL implementation for the permutation scheme (FIG. 11C ), components in this example are assembled between edge scaffolds that may or may not be included in the reaction master mix. Components may be divided into M layers, for example M=4 layers with predefined rank from 2 to M, where the left edge scaffold may berank 1 and the right edge scaffold may be rank M+1. Templates comprise nucleic acid sequences for the 3′ to 5′ ligation of any two components with lower rank to higher rank, respectively. There are ((M+1)2+M+1)/2 such templates. An individual identifier of any K components from distinct layers may be constructed by combining those selected components in a ligation reaction with the corresponding K+1 staples used to bring the K components together with the edge scaffolds in their rank order. Such a reaction set up may yield the nucleic acid sequence corresponding to the target identifier between the edge scaffolds. Alternatively, a reaction mix comprising all templates may be combined with the select components to assemble the target identifier. This alternative method may generate various nucleic acid sequences with the same edge sequences but distinct lengths (if all component lengths are equal), as illustrated inFIG. 12D . The target identifier (bottom) may be isolated from byproduct nucleic acid sequences by size. -
FIGS. 13A and 13B schematically illustrate an example method, referred to as the “partition scheme” for constructing identifiers with partitioned components.FIG. 13A shows an example of the combinatorial space of identifiers that may be constructed using the partition scheme. An individual identifier may be constructed by assembling one component from each layer in a fixed order with the optional placement of any partition (specially classified component) between any two components of different layers. For example, a set of components may be organized into one partition component and four layers containing one component each. A component from each layer may be combined in a fixed order and a single partition component may be assembled in various locations between layers. An identifier in this combinatorial space may comprise no partition components, a partition component between the components from the first and second layer, a partition between the components from the second and third layer, and so on to make a combinatorial space of eight possible identifiers. In general, with M layers, each with N components, and p partition components, there are NK(p+1)M-1 possible identifiers that may be constructed. This method may generate identifiers of various lengths. -
FIG. 13B shows an example implementation of the partition scheme using template directed ligation. Templates comprise nucleic acid sequences for ligating together one component from each of M layers in a fixed order. For each partition component, additional pairs of templates exist that enable the partition component to ligate in between the components from any two adjacent layers. For example a pair of templates such that one template (with sequence g*b* (5′ to 3′) for example) in a pair enables the 3′ end of layer 1 (with sequence b) to ligate to the 5′ end of the partition component (with sequence g) and such that the second template in the pair (with sequence c*h* (5′ to 3′) for example) enables the 3′ end of the partition component (with sequence h) to ligate to the 5′ end of layer 2 (with sequence c). To insert a partition between any two components of adjacent layers, the standard template for ligating together those layers may be excluded in the reaction and the pair of templates for ligating the partition in that position may be selected in the reaction. In the current example, targeting the partition component betweenlayer 1 andlayer 2 may use the pair of templates c*h* (5′ to 3′) and g*b* (5′ to 3′) to select for the reaction rather than the template c*b* (5′ to 3′). Components may be assembled between edge scaffolds that may be included in the reaction mix (along with their corresponding templates for ligating to the first and Mth layers, respectively). In general, a total of around M−1+2*p*(M−1) selectable templates may be used for this method for M layers and p partition components. This implementation of the partition scheme may generate various nucleic acid sequences in a reaction with the same edge sequences but distinct lengths. The target identifier may be isolated from byproduct nucleic acid sequences by DNA size selection. Specifically, there may be exactly one nucleic acid sequence product with exactly M layer components. If the layer components are designed large enough compared to the partition components, it may be possible to define a universal size selection region whereby the identifier (and none of the non-targeted byproducts) may be selected regardless of the particular partitioning of the components within the identifier, thereby allowing for multiple partitioned identifiers from multiple reactions to be isolated in the same size selection step. -
FIGS. 14A and 14B schematically illustrates an example method, referred to as the “unconstrained string scheme” or “USS”, for constructing identifiers made up of any string of components from a number of possible components.FIG. 14A shows an example of the combinational space of 3-component (or 4-scaffold) length identifiers that may be constructed using the unconstrained string scheme. The unconstrained string scheme constructs an individual identifier of length K components with one or more distinct components each taken from one or more layers, where each distinct component can appear at any of the K component positions in the identifier (allowing for repeats). For example, for two layers, each comprising one component, there are eight possible 3-component length identifiers. In general, with M layers, each with one component, there are MK possible identifiers of length K components.FIG. 14B shows an example implementation of the unconstrained string scheme using template directed ligation. In this method, K+1 single-stranded and ordered scaffold DNA components (including two edge scaffolds and K−1 internal scaffolds) are present in the reaction mix. An individual identifier comprises a single component ligated between every pair of adjacent scaffolds. For example, a component ligated between scaffolds A and B, a component ligated between scaffolds C and D, and so on until all K adjacent scaffold junctions are occupied by a component. In a reaction, selected components from different layers are introduced to scaffolds along with selected pairs of staples that direct them to assemble onto the appropriate scaffolds. For example, the pair of staples a*L* (5′ to 3′) and A*b* (5′ to 3′) direct thelayer 1 component with a 5′ end region ‘a’ and 3′ end region ‘b’ to ligate in between the L and A scaffolds. In general, with M layers and K+1 scaffolds, 2*A*K selectable staples may be used to construct any USS identifier of length K. Because the staples that connect a component to a scaffold on the 5′ end are disjoint from the staples that connect the same component to a scaffold on the 3′ end, nucleic acid byproducts may form in the reaction with equal edge scaffolds as the target identifier, but with less than K components (less than K+1 scaffolds) or with more than K components (more than K+1 scaffolds). The targeted identifier may form with exactly K components (K+1 scaffolds) and may therefore be selectable through techniques like DNA size selection if all components are designed to be equal in length and all scaffolds are designed to be equal in length. In certain embodiments of the unconstrained string scheme where there may be one component per layer, that component may solely comprise a single distinct nucleic acid sequence that fulfills all three roles of (1) an identification barcode, (2) a hybridization region for staple-mediated ligation of the 5′ end to a scaffold, and (3) a hybridization region for staple mediated ligation of the 3′ end to a scaffold. - The internal scaffolds illustrated in
FIG. 14B may be designed such that they use the same hybridization sequence for both the staple-mediated 5′ ligation of the scaffold to a component and the staple-mediated 3′ ligation of the scaffold to another (not necessarily distinct) component. Thus the depicted one-scaffold, two-staple stacked hybridization events inFIG. 14B represent the statistical back-and-forth hybridization events that occur between the scaffold and each of the staples, thus enabling both 5′ component ligation and 3′ component ligation. In other embodiments of the unconstrained string scheme, the scaffold may be designed with two concatenated hybridization regions—a distinct 3′ hybridization region for staple-mediated 3′ ligation and a distinct 5′ hybridization region for staple-mediated 5′ ligation. -
FIGS. 15A and 15B schematically illustrate an example method, referred to as the “component deletion scheme”, for constructing identifiers by deleting nucleic acid sequences (or components) from a parent identifier.FIG. 15A shows an example of the combinatorial spaces of possible identifiers that may be constructed using the component deletion scheme. In this example, a parent identifier may comprise multiple components. A parent identifier may comprise more than or equal to about 2, 3, 4, 5, 6, 7, 8, 9, 10, 20, 30, 40, 50 or more components. An individual identifier may be constructed by selectively deleting any number of components from N possible components, leading to a “full” combinatorial space ofsize 2N, or by deleting a fixed number of K components from N possible components, thus leading to an “NchooseK” combinatorial space of size NchooseK. In an example with a parent identifier with 3 components, the full combinatorial space may be 8 and the 3choose2 combinatorial space may be 3. -
FIG. 15B shows an example implementation of the component deletion scheme using double stranded targeted cleavage and repair (DSTCR). The parent sequence may be a single stranded DNA substrate comprising components flanked by nuclease-specific target sites (which can be 4 or less bases in length), and where the parent may be incubated with one or more double-strand-specific nucleases corresponding to the target sites. An individual component may be targeted for deletion with a complementary single stranded DNA (or cleavage template) that binds the component DNA (and flanking nuclease sites) on the parent, thus forming a stable double stranded sequence on the parent that may be cleaved on both ends by the nucleases. Another single stranded DNA (or repair template) hybridizes to the resulting disjoint ends of the parent (between which the component sequence had been) and brings them together for ligation, either directly or bridged by a replacement sequence, such that the ligated sequences on the parent no longer contain active nuclease-targeted sites. We refer to this method as “Double Stranded Targeted Cleavage” (DSTC). Size selection may be used to select for identifiers with a certain number of deleted components. - Alternatively, or in addition to, the parent identifier may be a double or single stranded nucleic acid substrate comprising components separated by spacer sequences such that no two components are flanked by the same sequence. The parent identifier may be incubated with Cas9 nuclease. An individual component may be targeted for deletion with guide ribonucleic acids (the cleavage templates) that bind to the edges of the component and enable Cas9-mediated cleavage at its flanking sites. A single stranded nucleic acid (the repair template) may hybridize to the resulting disjoint ends of the parent identifier (e.g., between the ends where the component sequence had been), thus bringing them together for ligation. Ligation may be done directly or by bridging the ends with a replacement sequence, such that the ligated sequences on the parent no longer contain spacer sequences that can be targeted by Cas9. We refer to this method as “sequence specific targeted cleavage and repair” or “SSTCR”.
- Identifiers may be constructed by inserting components into a parent identifier using a derivative of DSTCR A parent identifier may be single stranded nucleic acid substrate comprising nuclease-specific target sites (which can be 4 or less bases in length), each embedded within a distinct nucleic acid sequence. The parent identifier may be incubated with one or more double-strand-specific nucleases corresponding to the target sites. An individual target site on the parent identifier may be targeted for component insertion with a complementary single stranded nucleic acid (the cleavage template) that binds the target site and the distinct surrounding nucleic acid sequence on the parent identifier, thus forming a double stranded site. The double-stranded site may be cleaved by a nuclease. Another single stranded nucleic acid (the repair template) may hybridize to the resulting disjoint ends of the parent identifier and bring them together for ligation, bridged by a component sequence, such that the ligated sequences on the parent no longer contain active nuclease-targeted sites. Alternatively a derivative of SSTCR may be used to insert components into a parent identifier. The parent identifier may be a double or single-stranded nucleic acid and the parent may be incubated with a Cas9 nuclease. A distinct site on the parent identifier may be targeted for cleavage with a guide RNA (the cleavage template). A single stranded nucleic acid (the repair template) may hybridize to the disjoint ends of the parent identifier and bring them together for ligation, bridged by a component sequence, such that the ligated sequences on the parent identifier no longer contain active nuclease-targeted sites. Size selection may be used to select for identifiers with a certain number of component insertions.
-
FIG. 16 schematically illustrates a parent identifier with recombinase recognition sites. Recognition sites of different patterns can be recognized by different recombinases. All recognition sites for a given set of recombinases are arranged such that the nucleic acids in between them may be excised if the recombinase is applied. The nucleic acid strand shown inFIG. 16 can adopt 25=32 different sequences depending on the subset of recombinases that are applied to it. In some embodiments, as depicted inFIG. 16 , unique molecules can be generated using recombinases to excise, shift, invert, and transpose segments of DNA to create different nucleic acid molecules. In general, with N recombinases there can be 2N possible identifiers built from a parent. In some embodiments, multiple orthogonal pairs of recognition sites from different recombinases may be arranged on a parent identifier in an overlapping fashion such that the application of one recombinase affects the type of recombination event that occurs when a downstream recombinase is applied (see Roquet et al., Synthetic recombinase-based state machines in living cells, Science 353 (6297): aad8559 (2016), which is entirely incorporated herein by reference). Such a system may be capable of constructing a different identifier for every ordering of N recombinases, N!. Recombinases may be of the tyrosine family such as Flp and Cre, or of the large serine recombinase family such as PhiC31, BxbI, TP901, or A 118. The use of recombinases from the large serine recombinase family may be advantageous because they facilitate irreversible recombination and therefore may produce identifiers more efficiently than other recombinases. - In some instances, a single nucleic acid sequence can be programmed to become many distinct nucleic acid sequences by applying numerous recombinases in a distinct order. Approximately ˜e1M! distinct nucleic acid sequences may be generated by applying M recombinases in different subsets and orders thereof, when the number of recombinases, M, may be less than or equal to 7 for the large serine recombinase family. When the number of recombinases, M, may be greater than 7, the number of sequences that can be produced approximates 3.9M, see e.g., Roquet et al., Synthetic recombinase-based state machines in living cells, Science 353 (6297): aad8559 (2016), which is entirely incorporated herein by reference. Additional methods for producing different DNA sequences from one common sequence can include targeted nucleic acid editing enzymes such as CRISPR-Cas, TALENS, and Zinc Finger Nucleases. Sequences produced by recombinases, targeted editing enzymes or the like can be used in conjunction with any of the previous methods, for example methods disclosed in any of the figures and disclosure in the present application.
- If the bit-stream of information to be encoded is larger than that which can be encoded by any single nucleic acid molecule, then the information can be split and indexed with nucleic acid sequence barcodes. Moreover, any subset of size k nucleic acid molecules from the set of N nucleic acid molecules can be chosen to produce log2(Nchoosek) bits of information. Barcodes may be assembled onto the nucleic acid molecules within the subsets of size k to encode even longer bit streams. For example, M barcodes may be used to produce M*log2(Nchoosek) bits of information. Given a number, N, of available nucleic acid molecules in a set and a number, M, of available barcodes, subsets of size k=k● may be chosen to minimize the total number of molecules in a pool to encode a piece of information. A method for encoding digital information can comprise steps for breaking up the bit stream and encoding the individual elements. For example, a bit stream comprising 6 bits can be split into 3 components each component comprising two bits. Each two bit component can be barcoded to form an information cassette, and grouped or pooled together to form a hyper-pool of information cassettes.
- Barcodes can facilitate information indexing when the amount of digital information to be encoded exceeds the amount that can fit in one pool alone. Information comprising longer strings of bits and/or multiple bytes can be encoded by layering the approach disclosed in
FIG. 3 , for example, by including a tag with unique nucleic acid sequences encoded using the nucleic acid index. Information cassettes or identifier libraries can comprise nitrogenous bases or nucleic acid sequences that include unique nucleic acid sequences that provide location and bit-value information in addition to a barcode or tag which indicates the component or components of the bit stream that a given sequence corresponds to. Information cassettes can comprise one or more unique nucleic acid sequences as well as a barcode or tag. The barcode or tag on the information cassette can provide a reference for the information cassette and any sequences included in the information cassette. For example, the tag or barcode on an information cassette can indicate which portion of the bit stream or bit component of the bit steam the unique sequence encodes information for (e.g., the bit value and bit position information for). - Using barcodes, more information in bits can be encoded in a pool than the size of the combinatorial space of possible identifiers. A sequence of 10 bits, for example, can be separated into two sets of bytes, each byte comprising 5 bits. Each byte can be mapped to a set of 5 possible distinct identifiers. Initially, the identifiers generated for each byte can be the same, but they may be kept in separate pools or else someone reading the information may not be able to tell which byte a particular nucleic acid sequence belongs to. However each identifier can be barcoded or tagged with a label that corresponds to the byte for which the encoded information applies (e.g., barcode one may be attached to sequences in the nucleic acid pool to provide the first five bits and barcode two may be attached to sequences in the nucleic acid pool to provide the second five bits), and then the identifiers corresponding to the two bytes can be combined into one pool (e.g., “hyper-pool” or one or more identifier libraries). Each identifier library of the one or more combined identifier libraries may comprise a distinct barcode that identifies a given identifier as belonging to a given identifier library. Methods for adding a barcode to each identifier in an identifier library can comprise using PCR, Gibson, ligation, or any other approach that enables a given barcode (e.g., barcode 1) to attach to a given nucleic acid sample pool (e.g.,
barcode 1 to nucleicacid sample pool 1 andbarcode 2 to nucleic acid sample pool 2). The sample from the hyper-pool can be read with sequencing methods, and sequencing information can be parsed using the barcode or tag. A method using identifier libraries and barcodes with a set of M barcodes and N possible identifiers (the combinatorial space) can encode a stream of bits with a length equivalent to the product of M and N. - In some embodiments, identifier libraries may be stored in an array of wells. The array of wells may be defined as having n columns and q rows and each well may comprise two or more identifier libraries in a hyper-pool. The information encoded in each well may constitute one large contiguous item of information of size n×q larger than the information contained in each of the wells. An aliquot may be taken from one or more of the wells in the array of wells and the encoding may be read using sequencing, hybridization, or PCR.
- A nucleic acid sample pool, hyper-pool, identifier library, group of identifier libraries, or a well, containing a nucleic acid sample pool or hyper-pool may comprise unique nucleic acid molecules (e.g., identifiers) corresponding to bits of information and a plurality of supplemental nucleic acid sequences. The supplemental nucleic acid sequences may not correspond to encoded data (e.g., do not correspond to a bit value). The supplemental nucleic acid samples may mask or encrypt the information stored in the sample pool. The supplemental nucleic acid sequences may be derived from a biological source or synthetically produced. Supplemental nucleic acid sequences derived from a biological source may include randomly fragmented nucleic acid sequences or rationally fragmented sequences. The biologically derived supplemental nucleic acids may hide or obscure the data-containing nucleic acids within the sample pool by providing natural genetic information along with the synthetically encoded information, especially if the synthetically encoded information (e.g., the combinatorial space of identifiers) is made to resemble natural genetic information (e.g., a fragmented genome). In an example, the identifiers are derived from a biological source and the supplemental nucleic acids are derived from a biological source. A sample pool may contain multiple sets of identifiers and supplemental nucleic acid sequences. Each set of identifiers and supplemental nucleic acid sequences may be derived from different organisms. In an example, the identifiers are derived from one or more organisms and the supplemental nucleic acid sequences are derived from a single, different organism. The supplemental nucleic acid sequences may also be derived from one or more organism and the identifiers may be derived from a single organism that is different from the organism that the supplemental nucleic acids are derived from. Both the identifiers and the supplemental nucleic acid sequences may be derived from multiple different organisms. A key may be used to distinguish the identifiers from the supplemental nucleic acid sequences.
- The supplemental nucleic acid sequences may store metadata about the written information. The metadata may comprise extra information for determining and/or authorizing the source of the original information and or the intended recipient of the original information. The metadata may comprise extra information about the format of the original information, the instruments and methods used to encode and write the original information, and the date and time of writing the original information into the identifiers. The metadata may comprise additional information about the format of the original information, the instruments and methods used to encode and write the original information, and the date and time of writing the original information into nucleic acid sequences. The metadata may comprise additional information about modifications made to the original information after writing the information into nucleic acid sequences. The metadata may comprise annotations to the original information or one or more references to external information. Alternatively, or in addition to, the metadata may be stored in one or more barcodes or tags attached to the identifiers.
- The identifiers in an identifier pool may have the same, similar, or different lengths than one another. The supplemental nucleic acid sequences may have a length that is less than, substantially equal to, or greater than the length of the identifiers. The supplemental nucleic acid sequences may have an average length that is within one base, within two bases, within three bases, within four bases, within five bases, within six bases, within seven bases, within eight bases, within nine bases, within ten bases, or within more bases of the average length of the identifiers. In an example, the supplemental nucleic acid sequences are the same or substantially the same length as the identifiers. The concentration of supplemental nucleic acid sequences may be less than, substantially equal to, or greater than the concentration of the identifiers in the identifiers library. The concentration of the supplemental nucleic acids may be less than or equal to about 1%, 10%, 20%, 40%, 60%, 80%, 100, %, 125%, 150%, 175%, 200%, 1000%, 1×104%, 1×105%, 1×106%, 1×107%, 1×108% or less than the concentration of the identifiers. The concentration of the supplemental nucleic acids may be greater than or equal to about 1%, 10%, 20%, 40%, 60%, 80%, 100, %, 125%, 150%, 175%, 200% 1000%, 1×104%, 1×105%, 1×106%, 1×107%, 1×108% or more than the concentration of the identifiers. Larger concentrations may be beneficial for obfuscation or concealing data. In an example, the concentration of the supplemental nucleic acid sequences are substantially greater (e.g., 1×108% greater) than the concentration of identifiers in an identifier pool.
- In another aspect, the present disclosure provides methods for copying information encoded in nucleic acid sequence(s). A method for copying information encoded in nucleic acid sequence(s) may comprise (a) providing an identifier library and (b) constructing one or more copies of the identifier library. An identifier library may comprise a subset of a plurality of identifiers from a larger combinatorial space. Each individual identifier of the plurality of identifiers may correspond to an individual symbol in a string of symbols. An identifier may comprise one or more components. A component may comprise a nucleic acid sequence.
- In another aspect, the present disclosure provides methods for accessing information encoded in nucleic acid sequences. A method for accessing information encoded in nucleic acid sequences may comprise (a) providing an identifier library, and (b) extracting a portion or a subset of the identifiers present in the identifier library from the identifier library. An identifier library may comprise a subset of a plurality of identifiers from a larger combinatorial space. Each individual identifier of the plurality of identifiers may correspond to an individual symbol in a string of symbols. An identifier may comprise one or more components. A component may comprise a nucleic acid sequence.
- Information may be written into one or more identifier libraries as described elsewhere herein. Identifiers may be constructed using any method described elsewhere herein. Stored data may be copied by generating copies of the individual identifiers in an identifier library or in one or more identifier libraries. A portion of the identifiers may be copied or an entire library may be copied. Copying may be performed by amplifying the identifiers in an identifier library. When one or more identifier libraries are combined, a single identifier library or multiple identifier libraries may be copied. If an identifier library comprises supplemental nucleic acid sequences, the supplemental nucleic acid sequences may or may not be copied.
- Identifiers in an identifier library may be constructed to comprise one or more common primer binding sites. The one or more binding sites may be located at the edges of each identifier or interweaved throughout each identifier. The primer binding site may allow for an identifier library specific primer pair or a universal primer pair to bind to and amplify the identifiers. All the identifiers within an identifier library or all the identifiers in one or more identifier libraries may be replicated multiple times by multiple PCR cycles. Conventional PCR may be used to copy the identifiers and the identifiers may be exponentially replicated with each PCR cycle. The number of copies of an identifier may increase exponentially with each PCR cycle. Linear PCR may be used to copy the identifiers and the identifiers may be linearly replicated with each PCR cycle. The number of identifier copies may increase linearly with each PCR cycle. The identifiers may be ligated into a circular vector prior to PCR amplification. The circle vector may comprise a barcode at each end of the identifier insertion site. The PCR primers for amplifying identifiers may be designed to prime to the vector such that the barcoded edges are included with the identifier in the amplification product. During amplification, recombination between identifiers may result in copied identifiers that comprise non-correlated barcodes on each edge. The non-correlated barcodes may be detectable upon reading the identifiers. Identifiers containing non-correlated barcodes may be considered false positives and may be disregarded during the information decoding process.
- Information may be encoded by assigning each bit of information to a unique nucleic acid molecule. For example, three sample sets (X, Y, and Z) each containing two nucleic acid sequences may assemble into eight unique nucleic acid molecules and encode eight bits of data:
-
- N1=X1Y1Z1
- N2=X1Y1Z2
- N3=X1Y2Z1
- N4=X1Y2Z2
- N5=X2Y1Z1
- N6=X2Y1Z2
- N7=X2Y2Z1
- N8=X2Y2Z2
Each bit in a string may then be assigned to the corresponding nucleic acid molecule (e.g., N1 may specify the first bit, N2 may specify the second bit, N3 may specify the third bit, and so forth). The entire bit string may be assigned to a combination of nucleic acid molecules where the nucleic acid molecules corresponding to bit-values of ‘1’ are included in the combination or pool. For example, in UTF-8 codings, the letter ‘K’ may be represented by the 8-bit string code 01001011 which may be encoded by the presence of four nucleic acid molecules (e.g., X1Y1Z2, X2Y1Z1, X2Y2Z1, and X2Y2Z2 in the above example).
- The information may be accessed through sequencing or hybridization assays. For example, primers or probes may be designed to bind to common regions or the barcoded region of the nucleic acid sequence. This may enable amplification of any region of the nucleic acid molecule. The amplification product may then be read by sequencing the amplification product or by a hybridization assay. In the above example encoding the letter ‘K’, if the first half of the data is of interest a primer specific to the barcode region of the X1 nucleic acid sequence and a primer that binds to the common region of the Z set may be used to amplify the nucleic acid molecules. This may return the sequence Y1Z2, which may encode for 0100. The substring of that data may also be accessed by further amplifying the nucleic acid molecules with a primer that binds to the barcode region of the Y1 nucleic acid sequence and a primer that binds to the common sequence of the Z set. This may return the Z2 nucleic acid sequence, encoding the
substring 01. Alternatively, the data may be accessed by checking for the presence or absence of a particular nucleic acid sequence without sequencing. For example, amplification with a primer specific to the Y2 barcode may generate amplification products for the Y2 barcode, but not for the Y1 barcode. The presence of Y2 amplification product may signal a bit value of ‘1’. Alternatively, the absence of Y2 amplification products may signal a bit value of ‘0’. - PCR based methods can be used to access and copy data from identifier or nucleic acid sample pools. Using common primer binding sites that flank the identifiers in the pools or hyper-pools, nucleic acids containing information can be readily copied. Alternatively, other nucleic acid amplification approaches such as isothermal amplification may also be used to readily copy data from sample pools or hyper-pools (e.g., identifier libraries). In instances where the sample comprises hyper-pools, a particular subset of information (e.g., all nucleic acids relating to a particular barcode) can be accessed and retrieved by using a primer that binds the specific barcode at one edge of the identifier in the forward orientation, along with another primer that binds a common sequence on the opposite edge of the identifier in a reverse orientation. Various read-out methods can be used to pull information from the encoded nucleic acid; for example microarray (or any sort of fluorescent hybridization), digital PCR, quantitative PCR (qPCR), and various sequencing platforms can be further used to read out the encoded sequences and by extension digitally encoded data.
- Accessing information stored in nucleic acid molecules (e.g., identifiers) may be performed by selectively removing the portion of non-targeted identifiers from an identifier library or a pool of identifiers or, for example, selectively removing all identifiers of an identifier library from a pool of multiple identifier libraries. Accessing data may also be performed by selectively capturing targeted identifiers from an identifier library or pool of identifiers. The targeted identifiers may correspond to data of interest within the larger item of information. A pool of identifiers may comprise supplemental nucleic acid molecules. The supplemental nucleic acid molecules may contain metadata about the encoded information or may be used to encrypt or mask the identifiers corresponding to the information. The supplemental nucleic acid molecules may or may not be extracted while accessing the targeted identifiers.
FIGS. 17A-17C schematically illustrate an overview of example methods for accessing portions of information stored in nucleic acid sequences by accessing a number of particular identifiers from a larger number of identifiers.FIG. 17A shows example methods for using polymerase chain reaction, affinity tagged probes, and degradation targeting probes to access identifiers containing a specified component. For PCR-based access, a pool of identifiers (e.g., identifier library) may comprise identifiers with a common sequence at each end, a variable sequence at each end, or one of a common sequence or a variable sequence at each end. The common sequences or variable sequences may be primer binding sites. One or more primers may bind to the common or variable regions on the identifier edges. The identifiers with primers bound may be amplified by PCR. The amplified identifiers may significantly outnumber the non-amplified identifiers. During reading, the amplified identifiers may be identified. An identifier from an identifier library may comprise sequences on one or both of its ends that are distinct to that library, thus enabling a single library to be selectively accessed from a pool or group of more than one identifier libraries. - For affinity-tag based access, the components that constitute the identifiers in a pool may share complementarity with one or more probes. The one or more probes may bind or hybridize to the identifiers to be accessed. The probe may comprise an affinity tag. The affinity tags may bind to a bead, generating a complex comprising a bead, at least one probe, and at least one identifier. The beads may be magnetic, and together with a magnet, the beads may collect and isolate the identifiers to be accessed. The identifiers may be removed from the beads under denaturing conditions prior to reading. Alternatively, or in addition to, the beads may collect the non-targeted identifiers and sequester them away from the rest of the pool that can get washed into a separate vessel and read. The affinity tag may bind to a column. The identifiers to be accessed may bind to the column for capture. Column-bound identifiers may subsequently be eluted or denatured from the column prior to reading. Alternatively, the non-targeted identifiers may be selectively targeted to the column while the targeted identifiers may flow through the column. Accessing the targeted identifiers may comprise applying one or more probes to a pool of identifiers simultaneously or applying one or more probes to a pool of identifiers sequentially.
- For degradation based access, the components that constitute the identifiers in a pool may share complementarity with one or more degradation-targeting probes. The probes may bind to or hybridize with distinct components on the identifiers. The probe may be a target for a degradation enzyme, such as an endonuclease. In an example, one or more identifier libraries may be combined. A set of probes may hybridize with one of the identifier libraries. The set of probes may comprise RNA and the RNA may guide a Cas9 enzyme. A Cas9 enzyme may be introduced to the one or more identifier libraries. The identifiers hybridized with the probes may be degraded by the Cas9 enzyme. The identifiers to be accessed may not be degraded by the degradation enzyme. In another example, the identifiers may be single-stranded and the identifier library may be combined with a single-strand specific endonuclease(s), such as the SI nuclease, that selectively degrades identifiers that are not to be accessed. Identifiers to be accessed may be hybridized with a complementary set of identifiers to protect them from degradation by the single-strand specific endonuclease(s). The identifiers to be accessed may be separated from the degradation products by size selection, such as size selection chromatography (e.g., agarose gel electrophoresis). Alternatively, or in addition, identifiers that are not degraded may be selectively amplified (e.g., using PCR) such that the degradation products are not amplified. The non-degraded identifiers may be amplified using primers that hybridize to each end of the non-degraded identifiers and therefore not to each end of the degraded or cleaved identifiers.
-
FIG. 17B shows example methods for using polymerase chain reaction to perform ‘OR’ or ‘AND’ operations to access identifiers containing multiple components. In an example, if two forward primers bind distinct sets of identifiers on the left end, then an ‘OR’ amplification of the union of those sets of identifiers may be accomplished by using the two forward primers together in a multiplex PCR reaction with a reverse primer that binds all of the identifiers on the right end. In another example, if one forward primer binds a set of identifiers on the left end and one reverse primer binds a set of identifiers on the right end, then an ‘AND’ amplification of the intersection of those two sets of identifiers may be accomplished by using the forward primer and the reverse primer together as a primer pair in a PCR reaction. -
FIG. 17C shows example methods for using affinity tags to perform ‘OR’ or ‘AND’ operations to access identifiers containing multiple components. In an example, if affinity probe ‘P1’ captures all identifiers with component ‘C1’ and another affinity probe ‘P2’ captures all identifiers with component ‘C2’, then the set of all identifiers with C1 or C2 can be captured by using P1 and P2 simultaneously (corresponding to an ‘OR’ operation). In another example with the same components and probes, the set of all identifiers with C1 and C2 can be captures by using P1 and P2 sequentially (corresponding to an ‘AND’ operation). - In another aspect, the present disclosure provides methods for reading information encoded in nucleic acid sequences. A method for reading information encoded in nucleic acid sequences may comprise (a) providing an identifier library. (b) identifying the identifiers present in the identifier library, (c) generating a string of symbols from the identifiers present in the identifier library, and (d) compiling information from the string of symbols. An identifier library may comprise a subset of a plurality of identifiers from a combinatorial space. Each individual identifier of the subset of identifiers may correspond to an individual symbol in a string of symbols. An identifier may comprise one or more components. A component may comprise a nucleic acid sequence.
- Information may be written into one or more identifier libraries as described elsewhere herein. Identifiers may be constructed using any method described elsewhere herein Stored data may be copied and accessed using any method described elsewhere herein.
- The identifier may comprise information relating to a location of the encoded symbol, a value of the encoded symbol, or both the location and the value of the encoded symbol. An identifier may include information relating to a location of the encoded symbol and the presence or absence of the identifier in an identifier library may indicate the value of the symbol. The presence of an identifier in an identifier library may indicate a first symbol value (e.g., first bit value) in a binary string and the absence of an identifier in an identifier library may indicate a second symbol value (e.g., second bit value) in a binary string. In a binary system, basing a bit value on the presence or absence of an identifier in an identifier library may reduce the number of identifiers assembled and, therefore, reduce the write time. In an example, the presence of an identifier may indicate a bit value of ‘1’ at the mapped location and the absence of an identifier may indicate a bit value of ‘0’ at the mapped location.
- Generating symbols (e.g., bit values) for a piece of information may include identifying the presence or absence of the identifier that the symbol (e.g., bit) may be mapped or encoded to. Determining the presence or absence of an identifier may include sequencing the present identifiers or using a hybridization array to detect the presence of an identifier. In an example, decoding and reading the encoded sequences may be performed using sequencing platforms. Examples of sequencing platforms are described in U.S. patent application Ser. No. 14/465,685 filed Aug. 21, 2014, U.S. patent application Ser. No. 13/886,234 filed May 2, 2013, and U.S. patent application Ser. No. 12/400,593 filed Mar. 9, 2009, each of which is entirely incorporated herein by reference.
- In an example, decoding nucleic acid encoded data m be achieved by base-by-base sequencing of the nucleic acid strands, such as Illumina Sequencing, or by utilizing a sequencing technique that indicates the presence or absence of specific nucleic acid sequences, such as fragmentation analysis by capillary electrophoresis. The sequencing may employ the use of reversible terminators. The sequencing may employ the use of natural or non-natural (e.g., engineered) nucleotides or nucleotide analogs. Alternatively or in addition to, decoding nucleic acid sequences may be performed using a variety of analytical techniques, including but not limited to, any methods that generate optical, electrochemical, or chemical signals. A variety of sequencing approaches may be used including, but not limited to, polymerase chain reaction (PCR), digital PCR, Sanger sequencing, high-throughput sequencing, sequencing-by-synthesis, single-molecule sequencing, sequencing-by-ligation, RNA-Seq (Illumina), Next generation sequencing, Digital Gene Expression (Helicos), Clonal Single MicroArray (Solexa), shotgun sequencing, Maxim-Gilbert sequencing, or massively-parallel sequencing.
- Various read-out methods can be used to pull information from the encoded nucleic acid. In an example, microarray (or any sort of fluorescent hybridization), digital PCR, quantitative PCR (qPCR), and various sequencing platforms can be further used to read out the encoded sequences and by extension digitally encoded data.
- An identifier library may further comprise supplemental nucleic acid sequences that provide metadata about the information, encrypt or mask the information, or that both provide metadata and mask the information. The supplemental nucleic acids may be identified simultaneously with identification of the identifiers. Alternatively, the supplemental nucleic acids may be identified prior to or after identifying the identifiers. In an example, the supplemental nucleic acids are not identified during reading of the encoded information. The supplemental nucleic acid sequences may be indistinguishable from the identifiers. An identifier index or a key may be used to differentiate the supplemental nucleic acid molecules from the identifiers.
- The efficiency of encoding and decoding data may be increased by recoding input bit strings to enable the use of fewer nucleic acid molecules. For example, if an input string is received with a high occurrence of ‘111’ substrings, which may map to three nucleic acid molecules (e.g., identifiers) with an encoding method, it may be recoded to a ‘000’ substring which may map to a null set of nucleic acid molecules. The alternate input substring of ‘000’ may also be recoded to ‘111’. This method of recoding may reduce the total amount of nucleic acid molecules used to encode the data because there may be a reduction in the number of ‘1’s in the dataset. In this example, the total size of the dataset may be increased to accommodate a codebook that specifies the new mapping instructions. An alternative method for increasing encoding and decoding efficiency may be to recode the input string to reduce the variable length. For example, ‘111’ may be recoded to ‘00’ which may shrink the size of the dataset and reduce the number of ‘1’s in the dataset.
- The speed and efficiency of decoding nucleic acid encoded data may be controlled (e.g., increased) by specifically designing identifiers for ease of detection. For example, nucleic acid sequences (e.g., identifiers) that are designed for ease of detection may include nucleic acid sequences comprising a majority of nucleotides that are easier to call and detect based on their optical, electrochemical, chemical, or physical properties. Engineered nucleic acid sequences may be either single or double stranded. Engineered nucleic acid sequences may include synthetic or unnatural nucleotides that improve the detectable properties of the nucleic acid sequence. Engineered nucleic acid sequences may comprise all natural nucleotides, all synthetic or unnatural nucleotides, or a combination of natural, synthetic, and unnatural nucleotides. Synthetic nucleotides may include nucleotide analogues such as peptide nucleic acids, locked nucleic acids, glycol nucleic acids, and threose nucleic acids. Unnatural nucleotides may include dNaM, an artificial nucleoside containing a 3-methoxy-2-naphthly group, and d5SICS, an artificial nucleoside containing a 6-methylisoquinoline-1-thione-2-yl group. Engineered nucleic acid sequences may be designed for a single enhanced property, such as enhanced optical properties, or the designed nucleic acid sequences may be designed with multiple enhanced properties, such as enhanced optical and electrochemical properties or enhanced optical and chemical properties.
- Engineered nucleic acid sequences may comprise reactive natural, synthetic, and unnatural nucleotides that do not improve the optical, electrochemical, chemical, or physical properties of the nucleic acid sequences. The reactive components of the nucleic acid sequences may enable the addition of a chemical moiety that confers improved properties to the nucleic acid sequence. Each nucleic acid sequence may include a single chemical moiety or may include multiple chemical moieties. Example chemical moieties may include, but are not limited to, fluorescent moieties, chemiluminescent moieties, acidic or basic moieties, hydrophobic or hydrophilic moieties, and moieties that alter oxidation state or reactivity of the nucleic acid sequence.
- A sequencing platform may be designed specifically for decoding and reading information encoded into nucleic acid sequences. The sequencing platform may be dedicated to sequencing single or double stranded nucleic acid molecules. The sequencing platform may decode nucleic acid encoded data by reading individual bases (e.g., base-by-base sequencing) or by detecting the presence or absence of an entire nucleic acid sequence (e.g., component) incorporated within the nucleic acid molecule (e.g., identifier). The sequencing platform may include the use of promiscuous reagents, increased read lengths, and the detection of specific nucleic acid sequences by the addition of detectable chemical moieties. The use of more promiscuous reagents during sequencing may increase reading efficiency by enabling faster base calling which in turn may decrease the sequencing time. The use of increased read lengths may enable longer sequences of encoded nucleic acids to be decoded per read. The addition of detectable chemical moiety tags may enable the detection of the presence or absence of a nucleic acid sequence by the presence or absence of a chemical moiety. For example, each nucleic acid sequence encoding a bit of information may be tagged with a chemical moiety that generates a unique optical, electrochemical, or chemical signal. The presence or absence of that unique optical, electrochemical, or chemical signal may indicate a ‘0’ or a ‘1’ bit value. The nucleic acid sequence may comprise a single chemical moiety or multiple chemical moieties. The chemical moiety may be added to the nucleic acid sequence prior to use of the nucleic acid sequence to encode data. Alternatively or in addition to, the chemical moiety may be added to the nucleic acid sequence after encoding the data, but prior to decoding the data. The chemical moiety tag may be added directly to the nucleic acid sequence or the nucleic acid sequence may comprise a synthetic or unnatural nucleotide anchor and the chemical moiety tag may be added to that anchor.
- Unique codes may be applied to minimize or detect encoding and decoding errors. Encoding and decoding errors may occur from false negatives (e.g., a nucleic acid molecule or identifier not included in a random sampling). An example of an error detecting code may be a checksum sequence that counts the number of identifiers in a contiguous set of possible identifiers that is included in the identifier library. While reading the identifier library, the checksum may indicate how many identifiers from that contiguous set of identifiers to expect to retrieve, and identifiers can continue to be sampled for reading until the expected number is met. In some embodiments, a checksum sequence may be included for every contiguous set of R identifiers where R can be equal in size or greater than 1, 2, 5, 10, 50, 100, 200, 500, or 1000 or less than 1000, 500, 200, 100, 50, 10, 5, or 2. The smaller the value of R, the better the error detection. In some embodiments, the checksums may be supplemental nucleic acid sequences. For example, a set comprising seven nucleic acid sequences (e.g., components) may be divided into two groups, nucleic acid sequences for constructing identifiers with a product scheme (components X1-X3 in layer X and Y1-Y3 in layer Y), and nucleic acid sequences for the supplemental checksums (X4-X7 and Y4-Y7). The checksum sequences X4-X7 may indicate whether zero, one, two, or three sequences of layer X are assembled with each member of layer Y. Alternatively, the checksum sequences Y4-Y7 may indicate whether zero, one, two, or three sequences of layer Y are assembled with each member of layer X. In this example, an original identifier library with identifiers {X1Y1, X1Y3, X2Y1, X2Y2, X2Y3} may be supplemented to include checksums to become the following pool: {X1Y1, X1Y3, X2Y1, X2Y2, X2Y3, X1Y6, X2Y7, X3Y4, X6Y1, X5Y2, X6Y3}. The checksum sequences may also be used for error correction. For example, absence of X1Y1 from the above dataset and the presence of X1Y6 and X6Y1 may enable inference that the X1Y1 nucleic acid molecule is missing from the dataset. The checksum sequences may indicate whether identifiers are missing from a sampling of the identifier library or an accessed portion of the identifier library. In the case of a missing checksum sequence, access methods such as PCR or affinity tagged probe hybridization may amplify and/or isolate it. In some embodiments, the checksums may not be supplemental nucleic acid sequences. They checksums may be coded directly into the information such that they are represented by identifiers.
- Noise in data encoding and decoding may be reduced by constructing identifiers palindromically, for example, by using palindromic pairs of components rather than single components in the product scheme. Then the pairs of components from different layers may be assembled to one another in a palindromic manner (e.g., YXY instead of XY for components X and Y). This palindromic method may be expanded to larger numbers of layers (e.g., ZYXYZ instead of XYZ) and may enable detection of erroneous cross reactions between identifiers.
- Adding supplemental nucleic acid sequences in excess (e.g., vast excess) to the identifiers may prevent sequencing from recovering the encoded identifiers. Prior to decoding the information, the identifiers may be enriched from the supplemental nucleic acid sequences. For example, the identifiers may be enriched by a nucleic acid amplification reaction using primers specific to the identifier ends. Alternatively, or in addition to, the information may be decoded without enriching the sample pool by sequencing (e.g., sequencing by synthesis) using a specific primer. In both decoding methods, it may be difficult to enrich or decode the information without having a decoding key or knowing something about the composition of the identifiers. Alternative access methods may also be employed such as using affinity tag based probes.
- A system for encoding digital information into nucleic acids (e.g., DNA) can comprise systems, methods and devices for converting files and data (e.g., raw data, compressed zip files, integer data, and other forms of data) into bytes and encoding the bytes into segments or sequences of nucleic acids, typically DNA, or combinations thereof.
- In an aspect, the present disclosure provides systems for encoding binary sequence data using nucleic acids. A system for encoding binary sequence data using nucleic acids may comprise a device and one or more computer processors. The device may be configured to construct an identifier library. The one or more computer processors may be individually or collectively programmed to (i) translate the information into a sting of symbols, (ii) map the string of symbols to the plurality of identifiers, and (iii) construct an identifier library comprising at least a subset of a plurality of identifiers. An individual identifier of the plurality of identifiers may correspond to an individual symbol of the string of symbols. An individual identifier of the plurality of identifiers may comprise one or more components. An individual component of the one or more components may comprise a nucleic acid sequence.
- In another aspect, the present disclosure provides systems for reading binary sequence data using nucleic acids. A system for reading binary sequence data using nucleic acids may comprise a database and one or more computer processors. The database may store an identifier library encoding the information. The one or more computer processors may be individually or collectively programmed to (i) identify the identifiers in the identifier library, (ii) generate a plurality of symbols from identifiers identified in (i), and (iii) compile the information from the plurality of symbols. The identifier library may comprise a subset of a plurality of identifiers. Each individual identifier of the plurality of identifiers may correspond to an individual symbol in a string of symbols. An identifier may comprise one or more components. A component may comprise a nucleic acid sequence.
- Non-limiting embodiments of methods for using the system to encode digital data can comprise steps for receiving digital information in the form of byte streams. Parsing the byte streams into individual bytes, mapping the location of a bit within the byte using a nucleic acid index (or identifier rank), and encoding sequences corresponding to either bit values of 1 or bit values of 0 into identifiers. Steps for retrieving digital data can comprise sequencing a nucleic acid sample or nucleic acid pool comprising sequences of nucleic acid (e.g., identifiers) that map to one or more bits, referencing an identifier rank to confirm if the identifier is present in the nucleic acid pool and decoding the location and bit-value information for each sequence into a byte comprising a sequence of digital information.
- Systems for encoding, writing, copying, accessing, reading, and decoding information encoded and written into nucleic acid molecules may be a single integrated unit or may be multiple units configured to execute one or more of the aforementioned operations. A system for encoding and writing information into nucleic acid molecules (e.g., identifiers) may include a device and one or more computer processors. The one or more computer processors may be programmed to parse the information into strings of symbols (e.g., strings of bits). The computer processor may generate an identifier rank. The computer processor may categorize the symbols into two or more categories. One category may include symbols to be represented by a presence of the corresponding identifier in the identifier library and the other category may include symbols to be represented by an absence of the corresponding identifiers in the identifier library. The computer processor may direct the device to assemble the identifiers corresponding to symbols to be represented to the presence of an identifier in the identifier library.
- The device may comprise a plurality regions, sections, or partitions. The reagents and components to assemble the identifiers may be stored in one or more regions, sections, or partitions of the device. Layers may be stored in separate regions of section of the device. A layer may comprise one or more unique components. The component in one layer may be unique from the components in another layer. The regions or sections may comprise vessels and the partitions may comprise wells. Each layer may be stored in a separate vessel or partition. Each reagent or nucleic acid sequence may be stored in a separate vessel or partition. Alternatively, or in addition to, reagents may be combined to form a master mix for identifier construction. The device may transfer reagents, components, and templates from one section of the device to be combined in another section. The device may provide the conditions for completing the assembly reaction. For example, the device may provide heating, agitation, and detection of reaction progress. The constructed identifiers may be directed to undergo one or more subsequent reactions to add barcodes, common sequences, variable sequences, or tags to one or more ends of the identifiers. The identifiers may then be directed to a region or partition to generate an identifier library. One or more identifier libraries may be stored in each region, section, or individual partition of the device. The device may transfer fluid (e.g., reagents, components, templates) using pressure, vacuum, or suction.
- The identifier libraries may be stored in the device or may be moved to a separate database. The database may comprise one or more identifier libraries. The database may provide conditions for long term storage of the identifier libraries (e.g., conditions to reduce degradation of identifiers). The identifier libraries may be stored in a powder, liquid, or solid form. The database may provide Ultra-Violet light protection, reduced temperature (e.g., refrigeration or freezing), and protection from degrading chemicals and enzymes. Prior to being transferred to a database, the identifier libraries may be lyophilized or frozen. The identifier libraries may include ethylenediaminetetraacetic acid (EDTA) to inactivate nucleases and/or a buffer to maintain the stability of the nucleic acid molecules.
- The database may be coupled to, include, or be separate from a device that writes the information into identifiers, copies the information, accesses the information, or reads the information. A portion of an identifier library may be removed from the database pnor to copying, accessing or reading. The device that copies the information from the database may be the same or a different device from that which writes the information. The device that copies the information may extract an aliquot of an identifier library from the device and combine that aliquot with the reagents and constituents to amplify a portion of or the entire identifier library. The device may control the temperature, pressure, and agitation of the amplification reaction. The device may comprise partitions and one or more amplification reaction may occur in the partition comprising the identifier library. The device may copy more than one pool of identifiers at a time.
- The copied identifiers may be transferred from the copy device to an accessing device. The accessing device may be the same device as the copy device. The access device may comprise separate regions, sections, or partitions. The access device may have one or more columns, bead reservoirs, or magnetic regions for separating identifiers bound to affinity tags. Alternatively, or in addition to, the access device may have one or more size selection units. A size selection unit may include agarose gel electrophoresis or any other method for size selecting nucleic acid molecules. Copying and extraction may be performed in the same region of a device or in different regions of a device.
- The accessed data may be read in the same device or the accessed data may be transferred to another device. The reading device may comprise a detection unit to detect and identify the identifiers. The detection unit may be part of a sequencer, hybridization array, or other unit for identifying the presence or absence of an identifier. A sequencing platform may be designed specifically for decoding and reading information encoded into nucleic acid sequences. The sequencing platform may be dedicated to sequencing single or double stranded nucleic acid molecules. The sequencing platform may decode nucleic acid encoded data by reading individual bases (e.g., base-by-base sequencing) or by detecting the presence or absence of an entire nucleic acid sequence (e.g., component) incorporated within the nucleic acid molecule (e.g., identifier). Alternatively, the sequencing platform may be a system such as Illumina Sequencing or fragmentation analysis by capillary electrophoresis. Alternatively or in addition to, decoding nucleic acid sequences may be performed using a variety of analytical techniques implemented by the device, including but not limited to, any methods that generate optical, electrochemical, or chemical signals.
- Information storage in nucleic acid molecules may have various applications including, but not limited to, long term information storage, sensitive information storage, and storage of medical information. In an example, a person's medical information (e.g., medical history and records) may be stored in nucleic acid molecules and carried on his or her person. The information may be stored external to the body (e.g., in a wearable device) or internal to the body (e.g., in a subcutaneous capsule). When a patient is brought into a medical office or hospital, a sample may be taken from the device or capsule and the information may be decoded with the use of a nucleic acid sequencer. Personal storage of medical records in nucleic acid molecules may provide an alternative to computer and cloud based storage systems. Personal storage of medical records in nucleic acid molecules may reduce the instance or prevalence of medical records being hacked. Nucleic acid molecules used for capsule-based storage of medical records may be derived from human genomic sequences. The use of human genomic sequences may decrease the immunogenicity of the nucleic acid sequences in the event of capsule failure and leakage.
- The present disclosure provides computer systems that are programmed to implement methods of the disclosure.
FIG. 19 shows acomputer system 1901 that is programmed or otherwise configured to encode digital information into nucleic acid sequences and/or read (e.g., decode) information derived from nucleic acid sequences. Thecomputer system 1901 can regulate various aspects of the encoding and decoding procedures of the present disclosure, such as, for example, the bit-values and bit location information for a given bit or byte from an encoded bitstream or byte stream. - The
computer system 1901 includes a central processing unit (CPU, also “processor” and “computer processor” herein) 1905, which can be a single core or multi core processor, or a plurality of processors for parallel processing. Thecomputer system 1901 also includes memory or memory location 1910 (e.g., random-access memory, read-only memory, flash memory), electronic storage unit 1915 (e.g., hard disk), communication interface 1920 (e.g., network adapter) for communicating with one or more other systems, andperipheral devices 1925, such as cache, other memory, data storage and/or electronic display adapters. Thememory 1910,storage unit 1915,interface 1920 andperipheral devices 1925 are in communication with theCPU 1905 through a communication bus (solid lines), such as a motherboard. Thestorage unit 1915 can be a data storage unit (or data repository) for storing data. Thecomputer system 1901 can be operatively coupled to a computer network (“network”) 1930 with the aid of thecommunication interface 1920. Thenetwork 1930 can be the Internet, an internet and/or extranet, or an intranet and/or extranet that is in communication with the Internet. Thenetwork 1930 in some cases is a telecommunication and/or data network. Thenetwork 1930 can include one or more computer servers, which can enable distributed computing, such as cloud computing. Thenetwork 1930, in some cases with the aid of thecomputer system 1901, can implement a peer-to-peer network, which may enable devices coupled to thecomputer system 1901 to behave as a client or a server. - The
CPU 1905 can execute a sequence of machine-readable instructions, which can be embodied in a program or software. The instructions may be stored in a memory location, such as thememory 1910. The instructions can be directed to theCPU 1905, which can subsequently program or otherwise configure theCPU 1905 to implement methods of the present disclosure. Examples of operations performed by theCPU 1905 can include fetch, decode, execute, and writeback. - The
CPU 1905 can be part of a circuit, such as an integrated circuit. One or more other components of thesystem 1901 can be included in the circuit. In some cases, the circuit is an application specific integrated circuit (ASIC). - The
storage unit 1915 can store files, such as drivers, libraries and saved programs. Thestorage unit 1915 can store user data, e.g., user preferences and user programs. Thecomputer system 1901 in some cases can include one or more additional data storage units that are external to thecomputer system 1901, such as located on a remote server that is in communication with thecomputer system 1901 through an intranet or the Internet. - The
computer system 1901 can communicate with one or more remote computer systems through thenetwork 1930. For instance, thecomputer system 1901 can communicate with a remote computer system of a user or other devices and or machinery that may be used by the user in the course of analyzing data encoded or decoded in a sequence of nucleic acids (e.g., a sequencer or other system for chemically determining the order of nitrogenous bases in a nucleic acid sequence). Examples of remote computer systems include personal computers (e.g., portable PC), slate or tablet PC's (e.g., Apple iPad, Samsung Galaxy Tab), telephones, Smart phones (e.g., Apple iPhone, Android-enabled device, Blackberry), or personal digital assistants. The user can access thecomputer system 1901 via thenetwork 1930. - Methods as described herein can be implemented by way of machine (e.g., computer processor) executable code stored on an electronic storage location of the
computer system 1901, such as, for example, on thememory 1910 orelectronic storage unit 1915. The machine executable or machine readable code can be provided in the form of software. During use, the code can be executed by theprocessor 1905 In some cases, the code can be retrieved from thestorage unit 1915 and stored on thememory 1910 for ready access by theprocessor 1905. In some situations, theelectronic storage unit 1915 can be precluded, and machine-executable instructions are stored onmemory 1910. - The code can be pre-compiled and configured for use with a machine having a processer adapted to execute the code, or can be compiled during runtime. The code can be supplied in a programming language that can be selected to enable the code to execute in a pre-compiled or as-compiled fashion.
- Aspects of the systems and methods provided herein, such as the
computer system 1901, can be embodied in programming. Various aspects of the technology may be thought of as “products” or “articles of manufacture” typically in the form of machine (or processor) executable code and/or associated data that is carried on or embodied in a type of machine readable medium. Machine-executable code can be stored on an electronic storage unit, such as memory (e.g., read-only memory, random-access memory, flash memory) or a hard disk. “Storage” type media can include any or all of the tangible memory of the computers, processors or the like, or associated modules thereof, such as various semiconductor memories, tape drives, disk drives and the like, which may provide non-transitory storage at any time for the software programming. All or portions of the software may at times be communicated through the Internet or various other telecommunication networks. Such communications, for example, may enable loading of the software from one computer or processor into another, for example, from a management server or host computer into the computer platform of an application server. Thus, another type of media that may bear the software elements includes optical, electrical and electromagnetic waves, such as used across physical interfaces between local devices, through wired and optical landline networks and over various air-links. The physical elements that carry such waves, such as wired or wireless links, optical links or the like, also may be considered as media bearing the software. As used herein, unless restricted to non-transitory, tangible “storage” media, terms such as computer or machine “readable medium” refer to any medium that participates in providing instructions to a processor for execution. - Hence, a machine readable medium, such as computer-executable code, may take many forms, including but not limited to, a tangible storage medium, a carrier wave medium or physical transmission medium. Non-volatile storage media include, for example, optical or magnetic disks, such as any of the storage devices in any computer(s) or the like, such as may be used to implement the databases, etc. shown in the drawings. Volatile storage media include dynamic memory, such as main memory of such a computer platform. Tangible transmission media include coaxial cables; copper wire and fiber optics, including the wires that comprise a bus within a computer system. Carrier-wave transmission media may take the form of electric or electromagnetic signals, or acoustic or light waves such as those generated during radio frequency (RF) and infrared (IR) data communications. Common forms of computer-readable media therefore include for example: a floppy disk, a flexible disk, hard disk, magnetic tape, any other magnetic medium, a CD-ROM, DVD or DVD-ROM, any other optical medium, punch cards paper tape, any other physical storage medium with patterns of holes, a RAM, a ROM, a PROM and EPROM, a FLASH-EPROM, any other memory chip or cartridge, a carrier wave transporting data or instructions, cables or links transporting such a carrier wave, or any other medium from which a computer may read programming code and/or data. Many of these forms of computer readable media may be involved in carrying one or more sequences of one or more instructions to a processor for execution.
- The
computer system 1901 can include or be in communication with anelectronic display 1935 that comprises a user interface (UT) 1940 for providing, for example, sequence output data including chromatographs, sequences as well as bits, bytes, or bit streams encoded by or read by a machine or computer system that is encoding or decoding nucleic acids, raw data, files and compressed or decompressed zip files to be encoded or decoded into DNA stored data. Examples of UI's include, without limitation, a graphical user interface (GUI) and web-based user interface. - Methods and systems of the present disclosure can be implemented by way of one or more algorithms. An algorithm can be implemented by way of software upon execution by the
central processing unit 1905. The algorithm can, for example, be used with a DNA index and raw data or zip file compressed or decompressed data, to determine a customized method for coding digital information from the raw data or zip file compressed data, prior to encoding the digital information. - Data to be encoded is a textfile containing a poem. The data is encoded manually with pipettes to mix together DNA components from two layers of 96 components to construct identifiers using the product scheme implemented with overlap extension PCR. The first layer, X, comprises 96 total DNA components. The second layer, Y, also comprises 96 total components. Prior to writing the DNA, the data is mapped to binary and then recoded to a uniform weight format where every contiguous (adjacent disjoint) string of 61 bits of the original data is translated to a 96 bit string with exactly 17 bit-values of 1. This uniform weight format may have natural error checking qualities. The data is then hashed into a 96 by 96 table to form a reference map.
- The middle panel of
FIG. 18A shows the two-dimensional reference map of a 96 by 96 table encoding the poem into a plurality of identifiers. Dark points correspond to a ‘1’ bit-value and white points corresponded to a ‘0’ bit-value. The data is encoded into identifiers using two layers of 96 components. Each X value and Y value of the table is assigned a component and the X and Y components are assembled into an identifier using overlap extension PCR for each (X,Y) coordinate with a ‘1’ value. The data was read back (e.g., decoded) by sequencing the identifier library to determine the presence or absence of each possible (X,Y) assembly. - The right panel of figure
FIG. 18A shows a two-dimensional heat map of the abundances of sequences present in the identifier library as determined by sequencing. Each pixel represents a molecule comprising the corresponding X and Y components, and the greyscale intensity at that pixel represents the relative abundance of that molecule compared to other molecules. Identifiers are taken as the top 17 most abundant (X, Y) assemblies in each row (as the uniform weight encoding guarantees that each contiguous string of 96 bits may have exactly 17 ‘T’ values, and hence 17 corresponding identifiers). - Data to be encoded is a textfile of three poems totaling 62824 bits. The data is encoded using a Labcyte Echo Liquid Handler to mix together DNA components from two layers of 384 components to construct identifiers using the product scheme implemented with overlap extension PCR. The first layer, X, comprises 384 total DNA components. The second layer, Y, also comprises 384 total components. Prior to writing the DNA, the data is mapped to binary and then recoded to decrease the weight (number of bit-values of ‘1’) and include checksums. The checksums are established so that there is an identifier that corresponds to a checksum for every contiguous string of 192 bits of data. The re-coded data has a weight of approximately 10,100, which corresponds to the number of identifiers to be constructed. The data may then be hashed into a 384 by 384 table to form a reference map.
- The middle panel of
FIG. 18B shows a two-dimensional reference map of a 384 by 384 table encoding the textfile into a plurality of identifiers. Each coordinate (X,Y) corresponds to the bit of data at position X+(Y−1)*192. Black points correspond to a bit value of ‘1’ and white points correspond to a bit value of ‘0’. The black points on the right side of the figure are the checksums and the pattern of black points on the top of the figure is the codebook (e.g., dictionary for de-coding the data). Each X value and Y value of the table may be assigned a component and the X and Y components are assembled into an identifier using overlap extension PCR for each (X, Y) coordinate with a ‘1’ value. The data was read back (e.g., decoded) by sequencing the identifier library to determine the presence or absence of each possible (X, Y) assembly. - The right panel of
FIG. 18B shows a two-dimensional heat map of the abundances of sequences present in the identifier library as determined by sequencing. Each pixel represents a molecule comprising the corresponding X and Y components, and the greyscale intensity at that pixel represents the relative abundance of that molecule compared to other molecules. Identifiers are taken as the top S most abundant (X, Y) assemblies in each row, where S for each row may be the checksum value. - While preferred embodiments of the present invention have been shown and described herein, it will be obvious to those skilled in the art that such embodiments are provided by way of example only. It is not intended that the invention be limited by the specific examples provided within the specification. While the invention has been described with reference to the aforementioned specification, the descriptions and illustrations of the embodiments herein are not meant to be construed in a limiting sense. Numerous variations, changes, and substitutions will now occur to those skilled in the art without departing from the invention. Furthermore, it shall be understood that all aspects of the invention are not limited to the specific depictions, configurations or relative proportions set forth herein which depend upon a variety of conditions and variables. It should be understood that various alternatives to the embodiments of the invention described herein may be employed in practicing the invention. It is therefore contemplated that the invention shall also cover any such alternatives, modifications, variations or equivalents. It is intended that the following claims define the scope of the invention and that methods and structures within the scope of these claims and their equivalents be covered thereby.
Claims (21)
1.-29. (canceled)
30. A method for writing information into nucleic acid sequence(s), comprising:
(a) translating said information into a string of symbols;
(b) mapping said string of symbols to a plurality of identifiers, each identifier encoding a value and a position of the symbol in the string of symbols,
wherein an individual identifier of said plurality of identifiers comprises one or more components, wherein an individual component of said one or more components comprises a nucleic acid sequence; and
(c) constructing an identifier library comprising at least a subset of said plurality of identifiers by inserting at least one component into a parent identifier by applying a nucleic acid editing enzyme to said parent identifier.
31. The method of claim 30 , wherein said parent identifier comprises a plurality of components flanked by nuclease-specific target sites, recombinase recognition sites, or distinct spacer sequences.
32. The method of claim 30 , wherein said nucleic acid editing enzyme is selected from the group consisting of CRISPR-Cas, TALENs, Zinc Finger Nucleases, Recombinases, and functional variants thereof.
33. The method of claim 30 , wherein one symbol value at each position of said string of symbols may be represented by the absence of a distinct identifier in the identifier library.
34. The method of claim 30 , wherein each symbol in said string of symbols is one of two possible symbol values, said two possible symbol values being a bit-value of 0 and 1,
wherein said individual symbol with said bit-value of 0 in said string of symbols may be represented by an absence of a distinct identifier in said identifier library, wherein said individual symbol with said bit-value of 1 in said string of symbols may be represented by a presence of said distinct identifier in said identifier library, and vice versa.
35. The method of claim 30 , wherein constructing said individual identifier in said identifier library comprises assembling said one or more components from one or more layers and wherein each layer of said one or more layers comprises a distinct set of components.
36. The method of claim 35 , wherein said individual identifier from said identifier library comprises one component from each layer of said one or more layers.
37. The method of claim 36 , wherein said one or more components are assembled in a fixed order.
38. The method of claim 36 , wherein said one or more components are assembled in any order.
39. The method of claim 36 , wherein said one or more components are assembled with one or more partitioning components disposed between two components from different layers of said one or more layers.
40. The method of claim 35 , wherein said individual identifier comprises one component from each layer of a subset of said one or more layers.
41. The method of claim 35 , wherein said individual identifier comprises at least one component from each of said one or more layers.
42. The method of claim 35 , wherein said one or more components are assembled using overlap-extension polymerase chain reaction (PCR), polymerase cycling assembly, sticky end ligation, biobricks assembly, golden gate assembly, gibson assembly, recombinase assembly, ligase cycling reaction, or template directed ligation.
43. The method of claim 30 , wherein said identifier library comprises a plurality of nucleic acid sequences.
44. The method of claim 43 , wherein said plurality of nucleic acid sequences stores metadata of said information and/or conceals said information.
45. The method of claim 44 , wherein said metadata comprises secondary information corresponding to a source of said information, an intended recipient of said information, an original format of said information, instrumentation and methods used to encode said information, a date and a time of writing said information into said identifier library, modifications made to said information, and/or a reference to other information.
46. The method of claim 30 , wherein one or more identifier libraries are combined and wherein each identifier library of said one or more identifier libraries is tagged with a distinct barcode.
47. The method of claim 46 , wherein each individual identifier in said identifier library comprises said distinct barcode.
48. The method of claim 30 , wherein said plurality of identifiers is selected for ease of read, write, access, copy, and deletion operations.
49. The method of claim 30 , wherein said plurality of identifiers is selected to minimize write errors, mutations, degradation, and read errors.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
US18/230,385 US20230376788A1 (en) | 2016-11-16 | 2023-08-04 | Nucleic acid-based data storage |
Applications Claiming Priority (7)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
US201662423058P | 2016-11-16 | 2016-11-16 | |
US201762457074P | 2017-02-09 | 2017-02-09 | |
US201762466304P | 2017-03-02 | 2017-03-02 | |
PCT/US2017/062098 WO2018094108A1 (en) | 2016-11-16 | 2017-11-16 | Nucleic acid-based data storage |
US15/850,112 US10650312B2 (en) | 2016-11-16 | 2017-12-21 | Nucleic acid-based data storage |
US16/847,064 US20200250546A1 (en) | 2016-11-16 | 2020-04-13 | Nucleic acid-based data storage |
US18/230,385 US20230376788A1 (en) | 2016-11-16 | 2023-08-04 | Nucleic acid-based data storage |
Related Parent Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
US16/847,064 Continuation US20200250546A1 (en) | 2016-11-16 | 2020-04-13 | Nucleic acid-based data storage |
Publications (1)
Publication Number | Publication Date |
---|---|
US20230376788A1 true US20230376788A1 (en) | 2023-11-23 |
Family
ID=62146775
Family Applications (5)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
US16/461,774 Active 2040-02-14 US11763169B2 (en) | 2016-11-16 | 2017-11-16 | Systems for nucleic acid-based data storage |
US18/230,382 Pending US20230376786A1 (en) | 2016-11-16 | 2023-08-04 | Nucleic acid-based data storage |
US18/230,383 Pending US20230376787A1 (en) | 2016-11-16 | 2023-08-04 | Nucleic acid-based data storage |
US18/230,385 Pending US20230376788A1 (en) | 2016-11-16 | 2023-08-04 | Nucleic acid-based data storage |
US18/230,273 Active US12001962B2 (en) | 2016-11-16 | 2023-08-04 | Systems for nucleic acid-based data storage |
Family Applications Before (3)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
US16/461,774 Active 2040-02-14 US11763169B2 (en) | 2016-11-16 | 2017-11-16 | Systems for nucleic acid-based data storage |
US18/230,382 Pending US20230376786A1 (en) | 2016-11-16 | 2023-08-04 | Nucleic acid-based data storage |
US18/230,383 Pending US20230376787A1 (en) | 2016-11-16 | 2023-08-04 | Nucleic acid-based data storage |
Family Applications After (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
US18/230,273 Active US12001962B2 (en) | 2016-11-16 | 2023-08-04 | Systems for nucleic acid-based data storage |
Country Status (9)
Country | Link |
---|---|
US (5) | US11763169B2 (en) |
EP (3) | EP3542295A4 (en) |
JP (3) | JP7179008B2 (en) |
KR (4) | KR102521152B1 (en) |
AU (4) | AU2017363139B2 (en) |
CA (2) | CA3043884A1 (en) |
ES (1) | ES2979182T3 (en) |
GB (1) | GB2563105B (en) |
WO (2) | WO2018094115A1 (en) |
Families Citing this family (32)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US10650312B2 (en) | 2016-11-16 | 2020-05-12 | Catalog Technologies, Inc. | Nucleic acid-based data storage |
KR102521152B1 (en) | 2016-11-16 | 2023-04-13 | 카탈로그 테크놀로지스, 인크. | Nucleic Acid-Based Systems for Data Storage |
US10982276B2 (en) * | 2017-05-31 | 2021-04-20 | Molecular Assemblies, Inc. | Homopolymer encoded nucleic acid memory |
US11612873B2 (en) | 2017-05-31 | 2023-03-28 | Molecular Assemblies, Inc. | Homopolymer encoded nucleic acid memory |
US11174512B2 (en) | 2017-05-31 | 2021-11-16 | Molecular Assemblies, Inc. | Homopolymer encoded nucleic acid memory |
US11810651B2 (en) * | 2017-09-01 | 2023-11-07 | Seagate Technology Llc | Multi-dimensional mapping of binary data to DNA sequences |
SG11201903333SA (en) | 2017-12-29 | 2019-08-27 | Clear Labs Inc | Automated priming and library loading services |
EP3766077A4 (en) | 2018-03-16 | 2021-12-08 | Catalog Technologies, Inc. | Chemical methods for nucleic acid-based data storage |
US20200193301A1 (en) | 2018-05-16 | 2020-06-18 | Catalog Technologies, Inc. | Compositions and methods for nucleic acid-based data storage |
AU2019270160B2 (en) | 2018-05-16 | 2024-09-19 | Catalog Technologies, Inc. | Printer-finisher system for data storage in DNA |
AU2019315604A1 (en) * | 2018-08-03 | 2021-03-25 | Catolog Technologies, Inc | Systems and methods for storing and reading nucleic acid-based data with error protection |
EP3847649A4 (en) * | 2018-09-07 | 2022-08-31 | Iridia, Inc. | Improved systems and methods for writing and reading data stored in a polymer |
US11921750B2 (en) * | 2018-10-29 | 2024-03-05 | Salesforce, Inc. | Database systems and applications for assigning records to chunks of a partition in a non-relational database system with auto-balancing |
EP3874058A4 (en) * | 2018-11-01 | 2022-08-03 | President And Fellows Of Harvard College | Nucleic acid-based barcoding |
US11249941B2 (en) * | 2018-12-21 | 2022-02-15 | Palo Alto Research Center Incorporated | Exabyte-scale data storage using sequence-controlled polymers |
EP3904527A4 (en) * | 2018-12-26 | 2022-08-10 | BGI Shenzhen | Method and device for fixed-point editing of nucleotide sequence stored with data |
WO2020227718A1 (en) | 2019-05-09 | 2020-11-12 | Catalog Technologies, Inc. | Data structures and operations for searching, computing, and indexing in dna-based data storage |
GB201907460D0 (en) | 2019-05-27 | 2019-07-10 | Vib Vzw | A method of storing information in pools of nucleic acid molecules |
CN112703558A (en) * | 2019-05-31 | 2021-04-23 | 伊鲁米那股份有限公司 | System and method for storage |
EP4022300A1 (en) | 2019-08-27 | 2022-07-06 | President and Fellows of Harvard College | Modifying messages stored in mixtures of molecules using thin-layer chromatography |
CA3157804A1 (en) | 2019-10-11 | 2021-04-15 | Catalog Technologies, Inc. | Nucleic acid security and authentication |
CA3159718A1 (en) * | 2019-11-26 | 2021-06-03 | Michael Borg | Methods and compositions for providing identification and/or traceability of biological material |
US10917109B1 (en) * | 2020-03-06 | 2021-02-09 | Centre National De La Recherche Scientifique | Methods for storing digital data as, and for transforming digital data into, synthetic DNA |
KR20230008877A (en) | 2020-05-11 | 2023-01-16 | 카탈로그 테크놀로지스, 인크. | Programs and functions of DNA-based data storage |
WO2022055885A1 (en) | 2020-09-08 | 2022-03-17 | Catalog Technologies, Inc. | Systems and methods for writing by sequencing of nucleic acids |
AU2021347675A1 (en) * | 2020-09-22 | 2023-04-20 | Catalog Technologies, Inc. | Temperature-controlled fluidic reactions system |
WO2022203958A1 (en) | 2021-03-24 | 2022-09-29 | Catalog Technologies, Inc. | Fixed point number representation and computation circuits |
KR20240024899A (en) * | 2021-06-25 | 2024-02-26 | 카탈로그 테크놀로지스, 인크. | Processing methods for storing nucleic acid data |
WO2023100188A1 (en) * | 2021-12-05 | 2023-06-08 | Ramot At Tel-Aviv University Ltd. | Efficient information coding in living organisms |
CN118715528A (en) * | 2021-12-31 | 2024-09-27 | 卡特默瑞有限公司 | Apparatus and method for embedding data in genetic material |
EP4254416A1 (en) * | 2022-04-01 | 2023-10-04 | BioSistemika d.o.o. | A device and a method for recording data in nucleic acids |
WO2023187132A1 (en) | 2022-04-01 | 2023-10-05 | Biosistemika D.O.O. | A device and a method for recording data in nucleic acids |
Family Cites Families (185)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20050032048A1 (en) | 1988-05-03 | 2005-02-10 | Oxford Gene Technology Limited | Analyzing polynucleotide sequences |
US5821886A (en) | 1996-10-18 | 1998-10-13 | Samsung Electronics Company, Ltd. | Variable length code detection in a signal processing system |
US6419883B1 (en) | 1998-01-16 | 2002-07-16 | University Of Washington | Chemical synthesis using solvent microdroplets |
IL131978A0 (en) | 1997-03-20 | 2001-03-19 | Univ Washington | Solvent for biopolymer synthesis solvent microdots and methods of use |
EP2327797B1 (en) | 1997-04-01 | 2015-11-25 | Illumina Cambridge Limited | Method of nucleic acid sequencing |
US6537747B1 (en) | 1998-02-03 | 2003-03-25 | Lucent Technologies Inc. | Data transmission using DNA oligomers |
US6187537B1 (en) | 1998-04-27 | 2001-02-13 | Donald E. Zinn, Jr. | Process and apparatus for forming a dry DNA transfer film, a transfer film product formed thereby and an analyzing process using the same |
US6458583B1 (en) | 1998-09-09 | 2002-10-01 | Agilent Technologies, Inc. | Method and apparatus for making nucleic acid arrays |
US6309828B1 (en) | 1998-11-18 | 2001-10-30 | Agilent Technologies, Inc. | Method and apparatus for fabricating replicate arrays of nucleic acid molecules |
US6221653B1 (en) | 1999-04-27 | 2001-04-24 | Agilent Technologies, Inc. | Method of performing array-based hybridization assays using thermal inkjet deposition of sample fluids |
US7501245B2 (en) | 1999-06-28 | 2009-03-10 | Helicos Biosciences Corp. | Methods and apparatuses for analyzing polynucleotide sequences |
US6446642B1 (en) | 1999-11-22 | 2002-09-10 | Agilent Technologies, Inc. | Method and apparatus to clean an inkjet reagent deposition device |
KR20080072102A (en) | 2001-05-11 | 2008-08-05 | 마츠시타 덴끼 산교 가부시키가이샤 | Biomolecular substrate and method and apparatus for examination and diagnosis using the same |
WO2003025123A2 (en) * | 2001-08-28 | 2003-03-27 | Mount Sinai School Of Medecine | Dna: a medium for long-term information storage specification |
CA2461413A1 (en) | 2001-09-25 | 2003-04-03 | Kabushiki Kaisha Dnaform | Printed materials comprising a support having an oligomer and/or a polymer applied thereon, a method for preparing the same and a method for delivering and/or storing the same |
US7361310B1 (en) | 2001-11-30 | 2008-04-22 | Northwestern University | Direct write nanolithographic deposition of nucleic acids from nanoscopic tips |
US20030116630A1 (en) | 2001-12-21 | 2003-06-26 | Kba-Giori S.A. | Encrypted biometric encoded security documents |
US6773888B2 (en) | 2002-04-08 | 2004-08-10 | Affymetrix, Inc. | Photoactivatable silane compounds and methods for their synthesis and use |
US7306316B2 (en) | 2002-05-29 | 2007-12-11 | Arizona Board Of Regents | Nanoscale ink-jet printing |
US20040043390A1 (en) * | 2002-07-18 | 2004-03-04 | Asat Ag Applied Science & Technology | Use of nucleotide sequences as carrier of cultural information |
US8071168B2 (en) | 2002-08-26 | 2011-12-06 | Nanoink, Inc. | Micrometric direct-write methods for patterning conductive material and applications to flat panel display repair |
US7491422B2 (en) | 2002-10-21 | 2009-02-17 | Nanoink, Inc. | Direct-write nanolithography method of transporting ink with an elastomeric polymer coated nanoscopic tip to form a structure having internal hollows on a substrate |
DE10308931A1 (en) | 2003-02-28 | 2004-09-23 | Apibio Sas | System and method for the synthesis of polymers |
US6943417B2 (en) | 2003-05-01 | 2005-09-13 | Clemson University | DNA-based memory device and method of reading and writing same |
JP2005080523A (en) | 2003-09-05 | 2005-03-31 | Sony Corp | Dna to be introduced into biogene, gene-introducing vector, cell, method for introducing information into biogene, information-treating apparatus and method, recording medium, and program |
WO2005038431A2 (en) | 2003-10-14 | 2005-04-28 | Verseon | Method and device for partitioning a molecule |
US20050239102A1 (en) | 2003-10-31 | 2005-10-27 | Verdine Gregory L | Nucleic acid binding oligonucleotides |
DE102005012567B4 (en) | 2005-03-04 | 2008-09-04 | Identif Gmbh | Marking solution, its use and process for its preparation |
EP1752213A1 (en) | 2005-08-12 | 2007-02-14 | Samsung Electronics Co., Ltd. | Device for printing droplet or ink on substrate or paper |
US9616661B2 (en) | 2005-10-07 | 2017-04-11 | Koninklijke Philips N.V. | Inkjet device and method for the controlled positioning of droplets of a substance onto a substrate |
EP1933974A1 (en) | 2005-10-07 | 2008-06-25 | Koninklijke Philips Electronics N.V. | Ink jet device for the controlled positioning of droplets of a substance onto a substrate, method for the controlled positioning of droplets of a substance, and use of an ink jet device |
EP1782886A1 (en) | 2005-11-02 | 2007-05-09 | Sony Deutschland GmbH | A method of patterning molecules on a substrate using a micro-contact printing process |
US20080309701A1 (en) | 2005-11-28 | 2008-12-18 | Koninklijke Philips Electronics, N.V. | Ink Jet Device for Releasing Controllably a Plurality of Substances Onto a Substrate, Method of Discrimination Between a Plurality of Substances and Use of an Ink Jet Device |
JP2009520598A (en) | 2005-12-22 | 2009-05-28 | コーニンクレッカ フィリップス エレクトロニクス エヌ ヴィ | Ink jet device for positioning material on substrate, method for positioning material on substrate, and use of ink jet device |
EP1976637A1 (en) | 2006-01-12 | 2008-10-08 | Koninklijke Philips Electronics N.V. | Ink jet device and method for releasing a plurality of substances onto a substrate |
EP2007907A2 (en) | 2006-04-19 | 2008-12-31 | Applera Corporation | Reagents, methods, and libraries for gel-free bead-based sequencing |
GB0610045D0 (en) | 2006-05-19 | 2006-06-28 | Plant Bioscience Ltd | Improved uracil-excision based molecular cloning |
US20100029490A1 (en) | 2006-09-21 | 2010-02-04 | Koninklijke Philips Electronics N.V. | Ink-jet device and method for producing a biological assay substrate using a printing head and means for accelerated motion |
EP2084532A1 (en) | 2006-10-30 | 2009-08-05 | Koninklijke Philips Electronics N.V. | Porous biological assay substrate and method and device for producing such substrate |
CA2681443A1 (en) | 2007-05-09 | 2008-11-20 | Nanoink, Inc. | Compact nanofabrication apparatus |
EP2175983A2 (en) | 2007-06-20 | 2010-04-21 | Northwestern University | Matrix assisted ink transport |
WO2009011709A1 (en) | 2007-07-19 | 2009-01-22 | The Board Of Trustees Of The University Of Illinois | High resolution electrohydrodynamic jet printing for manufacturing systems |
US9684678B2 (en) | 2007-07-26 | 2017-06-20 | Hamid Hatami-Hanza | Methods and system for investigation of compositions of ontological subjects |
US8452725B2 (en) | 2008-09-03 | 2013-05-28 | Hamid Hatami-Hanza | System and method of ontological subject mapping for knowledge processing applications |
CZ301799B6 (en) | 2007-07-30 | 2010-06-23 | Kencl@Lukáš | Processing of data information in a system |
JP5356389B2 (en) | 2007-08-20 | 2013-12-04 | ムーア ウォリス ノース アメリカ、 インコーポレーテッド | Composition applicable to jet printing and printing method |
DE102007057802B3 (en) | 2007-11-30 | 2009-06-10 | Geneart Ag | Steganographic embedding of information in coding genes |
JP5171346B2 (en) | 2008-03-28 | 2013-03-27 | 株式会社日立ハイテクノロジーズ | String search system and method |
EP2329425B1 (en) | 2008-09-10 | 2013-07-31 | DataLase Ltd | Multi-coloured codes |
WO2010029629A1 (en) | 2008-09-11 | 2010-03-18 | 長浜バイオラボラトリー株式会社 | Dna-containing ink composition |
US8769689B2 (en) | 2009-04-24 | 2014-07-01 | Hb Gary, Inc. | Digital DNA sequence |
US8806127B2 (en) | 2009-10-26 | 2014-08-12 | Genisyss Llc | Data storage device with integrated DNA storage media |
US20110269119A1 (en) | 2009-10-30 | 2011-11-03 | Synthetic Genomics, Inc. | Encoding text into nucleic acid sequences |
US8735327B2 (en) | 2010-01-07 | 2014-05-27 | Jeansee, Llc | Combinatorial DNA taggants and methods of preparation and use thereof |
US9187777B2 (en) | 2010-05-28 | 2015-11-17 | Gen9, Inc. | Methods and devices for in situ nucleic acid synthesis |
US8398940B2 (en) | 2010-06-17 | 2013-03-19 | Silverbrook Research Pty Ltd | USB-interfaceable portable test module for electrochemiluminescent detection of targets |
US9114399B2 (en) | 2010-08-31 | 2015-08-25 | Canon U.S. Life Sciences, Inc. | System and method for serial processing of multiple nucleic acid assays |
CA2815076C (en) | 2010-10-22 | 2021-01-12 | Cold Spring Harbor Laboratory | Varietal counting of nucleic acids for obtaining genomic copy number information |
EP2633080B1 (en) | 2010-10-29 | 2018-12-05 | President and Fellows of Harvard College | Method of detecting targets using fluorescently labelled nucleic acid nanotube probes |
US20120329561A1 (en) | 2010-12-09 | 2012-12-27 | Genomic Arts, LLC | System and methods for generating avatars and art |
KR101345337B1 (en) | 2011-06-13 | 2013-12-30 | 한국생명공학연구원 | Preparation apparatus and method of nanopositioning for one-tip multicomponent nano-inking system in the dip-pen nanolithography |
MX342569B (en) | 2011-07-20 | 2016-10-05 | Univ California | Dual-pore device. |
US20130253839A1 (en) | 2012-03-23 | 2013-09-26 | International Business Machines Corporation | Surprisal data reduction of genetic data for transmission, storage, and analysis |
US20150083797A1 (en) | 2012-05-09 | 2015-03-26 | Apdn (B.V.I.) Inc. | Verification of physical encryption taggants using digital representatives and authentications thereof |
DK2856375T3 (en) | 2012-06-01 | 2018-11-05 | European Molecular Biology Laboratory | High capacity storage of digital information in DNA |
EP2875458A2 (en) | 2012-07-19 | 2015-05-27 | President and Fellows of Harvard College | Methods of storing information using nucleic acids |
JP6175453B2 (en) * | 2012-08-07 | 2017-08-02 | 日立造船株式会社 | Encryption and decryption method using nucleic acid |
US9266370B2 (en) | 2012-10-10 | 2016-02-23 | Apdn (B.V.I) Inc. | DNA marking of previously undistinguished items for traceability |
US8937564B2 (en) | 2013-01-10 | 2015-01-20 | Infinidat Ltd. | System, method and non-transitory computer readable medium for compressing genetic information |
EP2951319B1 (en) | 2013-02-01 | 2021-03-10 | The Regents of the University of California | Methods for genome assembly and haplotype phasing |
EP2953524B1 (en) | 2013-02-06 | 2018-08-01 | Freenome Holdings Inc. | Systems and methods for early disease detection and real-time disease monitoring |
KR102245192B1 (en) | 2013-05-06 | 2021-04-29 | 온테라 인크. | Target detection with nanopore |
CA2926436A1 (en) | 2013-10-07 | 2015-04-16 | Judith Murrah | Multimode image and spectral reader |
US10027347B2 (en) | 2014-03-28 | 2018-07-17 | Thomson Licensing | Methods for storing and reading digital data on a set of DNA strands |
US10020826B2 (en) | 2014-04-02 | 2018-07-10 | International Business Machines Corporation | Generating molecular encoding information for data storage |
US20150312212A1 (en) | 2014-04-24 | 2015-10-29 | David Holmes | Holistic embodiment of dna and ipv6 |
EP2958238A1 (en) | 2014-06-17 | 2015-12-23 | Thomson Licensing | Method and apparatus for encoding information units in code word sequences avoiding reverse complementarity |
WO2015199440A1 (en) * | 2014-06-24 | 2015-12-30 | 서울대학교산학협력단 | Nucleic acid sequence security method, device, and recording medium having same saved therein |
KR101788673B1 (en) | 2014-06-24 | 2017-11-15 | 싸이퍼롬, 인코퍼레이티드 | Method for protecting nucleic acid sequence data security and computer readable storage medium storing the method |
US20170218228A1 (en) | 2014-07-30 | 2017-08-03 | Tufts University | Three Dimensional Printing of Bio-Ink Compositions |
WO2016015701A1 (en) | 2014-07-31 | 2016-02-04 | Schebo Biotech Ag | Bioanalysis device, the production thereof and method for detecting bioanalytes by means of the device |
EP2983297A1 (en) * | 2014-08-08 | 2016-02-10 | Thomson Licensing | Code generation method, code generating apparatus and computer readable storage medium |
JP6630347B2 (en) | 2014-09-03 | 2020-01-15 | ナントヘルス,インコーポレーテッド | Synthetic genomic variant-based secure transaction devices, systems, and methods |
US10860562B1 (en) | 2014-09-12 | 2020-12-08 | Amazon Technologies, Inc. | Dynamic predicate indexing for data stores |
SG11201703138RA (en) * | 2014-10-18 | 2017-05-30 | Girik Malik | A biomolecule based data storage system |
EP4350056A3 (en) | 2014-10-22 | 2024-06-26 | 48Hour Discovery Inc. | Genetic encoding of chemical post-translational modification for phage-displayed libraries |
EP3215895B1 (en) | 2014-11-03 | 2022-02-23 | Universität Osnabrück | Method for carrying out capillary nanoprinting, field of ink drops and field of wires obtained according to the method |
AU2015349782B2 (en) | 2014-11-20 | 2020-08-13 | Cytonics Corporation | Therapeutic variant alpha-2-macroglobulin compositions |
EP3067809A1 (en) | 2015-03-13 | 2016-09-14 | Thomson Licensing | Method and apparatus for storing and selectively retrieving data encoded in nucleic acid molecules |
WO2016164779A1 (en) * | 2015-04-10 | 2016-10-13 | University Of Washington | Integrated system for nucleic acid-based storage of digital data |
US10385387B2 (en) | 2015-04-20 | 2019-08-20 | Pacific Biosciences Of California, Inc. | Methods for selectively amplifying and tagging nucleic acids |
WO2016182814A2 (en) | 2015-05-08 | 2016-11-17 | Illumina, Inc. | Cationic polymers and method of surface application |
US10423341B1 (en) | 2015-06-12 | 2019-09-24 | Bahram Ghaffarzadeh Kermani | Accurate and efficient DNA-based storage of electronic data |
US9898579B2 (en) | 2015-06-16 | 2018-02-20 | Microsoft Technology Licensing, Llc | Relational DNA operations |
JP6920275B2 (en) | 2015-07-13 | 2021-08-18 | プレジデント アンド フェローズ オブ ハーバード カレッジ | Methods for Retrievable Information Memory Using Nucleic Acids |
US10474654B2 (en) | 2015-08-26 | 2019-11-12 | Storagecraft Technology Corporation | Structural data transfer over a network |
WO2017053450A1 (en) | 2015-09-22 | 2017-03-30 | Twist Bioscience Corporation | Flexible substrates for nucleic acid synthesis |
US20170093851A1 (en) | 2015-09-30 | 2017-03-30 | Aetna Inc. | Biometric authentication system |
EP3160049A1 (en) * | 2015-10-19 | 2017-04-26 | Thomson Licensing | Data processing method and device for recovering valid code words from a corrupted code word sequence |
WO2017082978A1 (en) | 2015-11-13 | 2017-05-18 | SoluDot LLC | Method for high throughput dispensing of biological samples |
US10566077B1 (en) | 2015-11-19 | 2020-02-18 | The Board Of Trustees Of The University Of Illinois | Re-writable DNA-based digital storage with random access |
US10047235B2 (en) | 2015-12-08 | 2018-08-14 | Xerox Corporation | Encoding liquid ink with a device specific biomarker |
WO2017106777A1 (en) | 2015-12-16 | 2017-06-22 | Fluidigm Corporation | High-level multiplex amplification |
WO2017142999A2 (en) | 2016-02-18 | 2017-08-24 | President And Fellows Of Harvard College | Methods and systems of molecular recording by crispr-cas system |
US10640822B2 (en) | 2016-02-29 | 2020-05-05 | Iridia, Inc. | Systems and methods for writing, reading, and controlling data stored in a polymer |
WO2017151195A1 (en) | 2016-02-29 | 2017-09-08 | The Penn State Research Foundation | Nucleic acid molecular diagnosis |
US10438662B2 (en) | 2016-02-29 | 2019-10-08 | Iridia, Inc. | Methods, compositions, and devices for information storage |
WO2017184677A1 (en) | 2016-04-21 | 2017-10-26 | President And Fellows Of Harvard College | Method and system of nanopore-based information encoding |
WO2017189914A1 (en) | 2016-04-27 | 2017-11-02 | Massachusetts Institute Of Technology | Sequence-controlled polymer random access memory storage |
US12123878B2 (en) | 2016-05-02 | 2024-10-22 | Encodia, Inc. | Macromolecule analysis employing nucleic acid encoding |
EP3470997B1 (en) | 2016-05-04 | 2024-10-23 | BGI Shenzhen | Method for using dna to store text information, decoding method therefor and application thereof |
EP3478852B1 (en) | 2016-07-01 | 2020-08-12 | Microsoft Technology Licensing, LLC | Storage through iterative dna editing |
US11326200B2 (en) | 2016-07-22 | 2022-05-10 | Hewlett-Packard Development Company, L.P. | Method of preparing test samples |
CN110352253A (en) | 2016-07-22 | 2019-10-18 | 核素示踪有限公司 | The method of amplifying nucleic acid sequence |
WO2018049272A1 (en) | 2016-09-08 | 2018-03-15 | Thomas Villwock | Methods and systems for authenticating goods using analyte encoded security fluids |
KR102217487B1 (en) | 2016-09-21 | 2021-02-23 | 트위스트 바이오사이언스 코포레이션 | Nucleic acid-based data storage |
US10370246B1 (en) | 2016-10-20 | 2019-08-06 | The Board Of Trustees Of The University Of Illinois | Portable and low-error DNA-based data storage |
EP3532965A1 (en) | 2016-10-28 | 2019-09-04 | Integrated DNA Technologies Inc. | Dna data storage using reusable nucleic acids |
KR102521152B1 (en) | 2016-11-16 | 2023-04-13 | 카탈로그 테크놀로지스, 인크. | Nucleic Acid-Based Systems for Data Storage |
US10650312B2 (en) | 2016-11-16 | 2020-05-12 | Catalog Technologies, Inc. | Nucleic acid-based data storage |
US10853244B2 (en) | 2016-12-07 | 2020-12-01 | Sandisk Technologies Llc | Randomly writable memory device and method of operating thereof |
US10417208B2 (en) | 2016-12-15 | 2019-09-17 | Sap Se | Constant range minimum query |
US10984029B2 (en) | 2016-12-15 | 2021-04-20 | Sap Se | Multi-level directory tree with fixed superblock and block sizes for select operations on bit vectors |
WO2018108328A1 (en) | 2016-12-16 | 2018-06-21 | F. Hoffmann-La Roche Ag | Method for increasing throughput of single molecule sequencing by concatenating short dna fragments |
KR102622275B1 (en) | 2017-01-10 | 2024-01-05 | 로스웰 바이오테크놀로지스 인코포레이티드 | Methods and systems for DNA data storage |
US10793897B2 (en) | 2017-02-08 | 2020-10-06 | Microsoft Technology Licensing, Llc | Primer and payload design for retrieval of stored polynucleotides |
US20200038859A1 (en) | 2017-02-08 | 2020-02-06 | Essenlix Corporation | Digital Assay |
US10787699B2 (en) | 2017-02-08 | 2020-09-29 | Microsoft Technology Licensing, Llc | Generating pluralities of primer and payload designs for retrieval of stored nucleotides |
WO2018148257A1 (en) | 2017-02-13 | 2018-08-16 | Thomson Licensing | Apparatus, method and system for digital information storage in deoxyribonucleic acid (dna) |
CA3054303A1 (en) | 2017-02-22 | 2018-08-30 | Twist Bioscience Corporation | Nucleic acid based data storage |
US10774379B2 (en) | 2017-03-15 | 2020-09-15 | Microsoft Technology Licensing, Llc | Random access of data encoded by polynucleotides |
WO2018213856A2 (en) | 2017-05-16 | 2018-11-22 | Artentika (Pty) Ltd | Digital data minutiae processing for the analysis of cultural artefacts |
US11174512B2 (en) | 2017-05-31 | 2021-11-16 | Molecular Assemblies, Inc. | Homopolymer encoded nucleic acid memory |
US10982276B2 (en) | 2017-05-31 | 2021-04-20 | Molecular Assemblies, Inc. | Homopolymer encoded nucleic acid memory |
US11612873B2 (en) | 2017-05-31 | 2023-03-28 | Molecular Assemblies, Inc. | Homopolymer encoded nucleic acid memory |
US10742233B2 (en) | 2017-07-11 | 2020-08-11 | Erlich Lab Llc | Efficient encoding of data for storage in polymers such as DNA |
WO2019046768A1 (en) | 2017-08-31 | 2019-03-07 | William Marsh Rice University | Symbolic squencing of dna and rna via sequence encoding |
GB201714827D0 (en) | 2017-09-14 | 2017-11-01 | Nuclera Nucleics Ltd | Novel use |
US11100404B2 (en) | 2017-10-10 | 2021-08-24 | Roswell Biotechnologies, Inc. | Methods, apparatus and systems for amplification-free DNA data storage |
EP3682449A1 (en) | 2017-10-27 | 2020-07-22 | ETH Zurich | Encoding and decoding information in synthetic dna with cryptographic keys generated based on polymorphic features of nucleic acids |
US10940171B2 (en) | 2017-11-10 | 2021-03-09 | Massachusetts Institute Of Technology | Microbial production of pure single stranded nucleic acids |
IL275818B2 (en) | 2018-01-04 | 2024-10-01 | Twist Bioscience Corp | Dna-based digital information storage |
EP3765063A4 (en) | 2018-03-15 | 2021-12-15 | Twinstrand Biosciences, Inc. | Methods and reagents for enrichment of nucleic acid material for sequencing applications and other nucleic acid material interrogations |
EP3766077A4 (en) | 2018-03-16 | 2021-12-08 | Catalog Technologies, Inc. | Chemical methods for nucleic acid-based data storage |
US11339423B2 (en) | 2018-03-18 | 2022-05-24 | Bryan Bishop | Systems and methods for data storage in nucleic acids |
WO2019195479A1 (en) | 2018-04-03 | 2019-10-10 | Ippsec Inc. | Systems and methods of physical infrastructure and information technology infrastructure security |
KR102138864B1 (en) | 2018-04-11 | 2020-07-28 | 경희대학교 산학협력단 | Dna digital data storage device and method, and decoding method of dna digital data storage device |
US11106633B2 (en) | 2018-04-24 | 2021-08-31 | EMC IP Holding Company, LLC | DNA-based data center with deduplication capability |
AU2019270160B2 (en) | 2018-05-16 | 2024-09-19 | Catalog Technologies, Inc. | Printer-finisher system for data storage in DNA |
US20200193301A1 (en) | 2018-05-16 | 2020-06-18 | Catalog Technologies, Inc. | Compositions and methods for nucleic acid-based data storage |
GB2574197B (en) | 2018-05-23 | 2022-01-05 | Oxford Nanopore Tech Ltd | Double stranded polynucleotide synthesis method and system. |
US11093547B2 (en) | 2018-06-19 | 2021-08-17 | Intel Corporation | Data storage based on encoded DNA sequences |
US11093865B2 (en) | 2018-06-20 | 2021-08-17 | Brown University | Methods of chemical computation |
US20230027270A1 (en) | 2018-06-20 | 2023-01-26 | Brown University | Methods of chemical computation |
US11651836B2 (en) | 2018-06-29 | 2023-05-16 | Microsoft Technology Licensing, Llc | Whole pool amplification and in-sequencer random-access of data encoded by polynucleotides |
WO2020014478A1 (en) | 2018-07-11 | 2020-01-16 | The Regents Of The University Of California | Nucleic acid-based electrically readable, read-only memory |
AU2019315604A1 (en) | 2018-08-03 | 2021-03-25 | Catolog Technologies, Inc | Systems and methods for storing and reading nucleic acid-based data with error protection |
US10673847B2 (en) | 2018-08-28 | 2020-06-02 | Ofer A. LIDSKY | Systems and methods for user authentication based on a genetic sequence |
EP3847649A4 (en) | 2018-09-07 | 2022-08-31 | Iridia, Inc. | Improved systems and methods for writing and reading data stored in a polymer |
US11164190B2 (en) | 2018-11-29 | 2021-11-02 | International Business Machines Corporation | Method for product authentication using a microfluidic reader |
US11162950B2 (en) | 2018-11-29 | 2021-11-02 | International Business Machines Corporation | Zonal nanofluidic anti-tamper device for product authentication |
GB201821155D0 (en) | 2018-12-21 | 2019-02-06 | Oxford Nanopore Tech Ltd | Method |
US11704575B2 (en) | 2018-12-21 | 2023-07-18 | Microsoft Technology Licensing, Llc | Neural networks implemented with DSD circuits |
EP3904527A4 (en) | 2018-12-26 | 2022-08-10 | BGI Shenzhen | Method and device for fixed-point editing of nucleotide sequence stored with data |
US11507135B2 (en) | 2019-04-15 | 2022-11-22 | Government Of The United States Of America, As Represented By The Secretary Of Commerce | Molecular scrivener for reading or writing data to a macromolecule |
WO2020227718A1 (en) | 2019-05-09 | 2020-11-12 | Catalog Technologies, Inc. | Data structures and operations for searching, computing, and indexing in dna-based data storage |
US10956806B2 (en) | 2019-06-10 | 2021-03-23 | International Business Machines Corporation | Efficient assembly of oligonucleotides for nucleic acid based data storage |
US11066661B2 (en) | 2019-08-20 | 2021-07-20 | Seagate Technology Llc | Methods of gene assembly and their use in DNA data storage |
US20210074380A1 (en) | 2019-09-05 | 2021-03-11 | Microsoft Technology Licensing, Llc | Reverse concatenation of error-correcting codes in dna data storage |
US11495324B2 (en) | 2019-10-01 | 2022-11-08 | Microsoft Technology Licensing, Llc | Flexible decoding in DNA data storage based on redundancy codes |
US11755922B2 (en) | 2019-10-04 | 2023-09-12 | The Board Of Trustees Of The University Of Illinois | On-chip nanoscale storage system using chimeric DNA |
US10917109B1 (en) | 2020-03-06 | 2021-02-09 | Centre National De La Recherche Scientifique | Methods for storing digital data as, and for transforming digital data into, synthetic DNA |
US11702689B2 (en) | 2020-04-24 | 2023-07-18 | Microsoft Technology Licensing, Llc | Homopolymer primers for amplification of polynucleotides created by enzymatic synthesis |
KR20230008877A (en) | 2020-05-11 | 2023-01-16 | 카탈로그 테크놀로지스, 인크. | Programs and functions of DNA-based data storage |
US20230230636A1 (en) | 2020-05-15 | 2023-07-20 | The Curator's of the University of Missouri | Nanopore unzipping-sequencing for dna data storage |
CN111858510B (en) | 2020-07-16 | 2021-08-20 | 中国科学院北京基因组研究所(国家生物信息中心) | DNA type storage system and method |
US11720801B2 (en) | 2020-08-25 | 2023-08-08 | International Business Machines Corporation | Chemical reaction network for estimating concentration of chemical species based on an identified pattern of output chemical species |
WO2022055885A1 (en) | 2020-09-08 | 2022-03-17 | Catalog Technologies, Inc. | Systems and methods for writing by sequencing of nucleic acids |
WO2022204442A1 (en) | 2021-03-24 | 2022-09-29 | Northeastern University | Method and system for decoding information stored on a polymer sequence |
US20220389483A1 (en) | 2021-06-03 | 2022-12-08 | Microsoft Technology Licensing, Llc | OLIGONUCLEOTIDE ASSEMBLY USING pH BASED ELECTRODE CONTROLLED HYBRIDIZATION |
GB2610380A (en) | 2021-08-23 | 2023-03-08 | Cambridge Entpr Ltd | Nucleic acid detection |
US20230161995A1 (en) | 2021-11-23 | 2023-05-25 | International Business Machines Corporation | Dna data storage using composite fragments |
US20230215516A1 (en) | 2022-01-05 | 2023-07-06 | Quantum Corporation | Joint multi-nanopore sequencing for reliable data retrieval in nucleic acid storage |
US20230257789A1 (en) | 2022-02-11 | 2023-08-17 | Microsoft Technology Licensing, Llc | Enzymatic oligonucleotide assembly using hairpins and enzymatic cleavage |
US20230257788A1 (en) | 2022-02-11 | 2023-08-17 | Microsoft Technology Licensing, Llc | Oligonucleotide assembly using hairpins and invading strands |
-
2017
- 2017-11-16 KR KR1020197017138A patent/KR102521152B1/en active IP Right Grant
- 2017-11-16 KR KR1020197017136A patent/KR102534408B1/en active IP Right Grant
- 2017-11-16 CA CA3043884A patent/CA3043884A1/en active Pending
- 2017-11-16 KR KR1020237016476A patent/KR20230074828A/en not_active Application Discontinuation
- 2017-11-16 GB GB1721459.4A patent/GB2563105B/en active Active
- 2017-11-16 WO PCT/US2017/062106 patent/WO2018094115A1/en unknown
- 2017-11-16 EP EP17872574.3A patent/EP3542295A4/en active Pending
- 2017-11-16 AU AU2017363139A patent/AU2017363139B2/en active Active
- 2017-11-16 KR KR1020237012100A patent/KR20230054484A/en not_active Application Discontinuation
- 2017-11-16 JP JP2019547250A patent/JP7179008B2/en active Active
- 2017-11-16 JP JP2019547252A patent/JP7107956B2/en active Active
- 2017-11-16 EP EP17872172.6A patent/EP3542294B1/en active Active
- 2017-11-16 EP EP24174128.9A patent/EP4424824A2/en active Pending
- 2017-11-16 CA CA3043887A patent/CA3043887A1/en active Pending
- 2017-11-16 WO PCT/US2017/062098 patent/WO2018094108A1/en active Application Filing
- 2017-11-16 US US16/461,774 patent/US11763169B2/en active Active
- 2017-11-16 ES ES17872172T patent/ES2979182T3/en active Active
- 2017-11-16 AU AU2017363146A patent/AU2017363146B2/en active Active
-
2022
- 2022-11-15 JP JP2022182278A patent/JP2023029836A/en active Pending
-
2023
- 2023-08-04 US US18/230,382 patent/US20230376786A1/en active Pending
- 2023-08-04 US US18/230,383 patent/US20230376787A1/en active Pending
- 2023-08-04 US US18/230,385 patent/US20230376788A1/en active Pending
- 2023-08-04 US US18/230,273 patent/US12001962B2/en active Active
- 2023-12-21 AU AU2023285827A patent/AU2023285827A1/en active Pending
-
2024
- 2024-01-30 AU AU2024200559A patent/AU2024200559A1/en active Pending
Also Published As
Similar Documents
Publication | Publication Date | Title |
---|---|---|
US11379729B2 (en) | Nucleic acid-based data storage | |
US20230376788A1 (en) | Nucleic acid-based data storage | |
US11227219B2 (en) | Compositions and methods for nucleic acid-based data storage | |
US12006497B2 (en) | Chemical methods for nucleic acid-based data storage | |
JP2022551186A (en) | Nucleic acid security and authentication | |
AU2023234435A1 (en) | Combinatorial enumeration and search for nucleic acid-based data storage | |
KR20230160898A (en) | Fixed-point number representation and calculation circuit |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
AS | Assignment |
Owner name: CATALOG TECHNOLOGIES, INC., MASSACHUSETTS Free format text: ASSIGNMENT OF ASSIGNORS INTEREST;ASSIGNORS:ROQUET, NATHANIEL;PARK, HYUNJUN;BHATIA, SWAPNIL P.;REEL/FRAME:064539/0151 Effective date: 20180417 |
|
STPP | Information on status: patent application and granting procedure in general |
Free format text: SPECIAL NEW |