160 likes | 168 Vues
Linear-Time Encoding/Decoding of Irreducible Words for Codes Correcting Tandem Duplications. tUAN THANH NGUYEN Nanyang Technological University (NTU), Singapore. Joint work with: Yeow Meng Chee Han Mao Kiah Johan Chrisnata. Our motivation.
E N D
Linear-Time Encoding/Decoding of Irreducible Words forCodes Correcting Tandem Duplications tUAN THANH NGUYEN Nanyang Technological University (NTU), Singapore • Joint work with: • Yeow Meng Chee • Han Mao Kiah • Johan Chrisnata
Our motivation • Applications that store data in living organisms • Shipman et al. (2017) : CRISPR-Cas, encoding of a digital movie into the genomes of a population of living bacteria.
Our motivation • Errors due to the biological mutations • Deletion A C G A T G C A C G A T G A T G C • Insertion • is one of the two common repeats found in the human genome (More than 50%) • Substitution G A T G A T Duplication • Duplication • plays an important role in determining an individual’s inherited traits • is believed to be the cause of several disorders • Inversion G A T • Translocation
Problem Classification The number of errors The duplication length 1.1 Bounded 1. Fixed-length duplications A C G A C G A G C A G C A T 1.2 Unbounded Tandem Duplication A C G A G C A T Given A A C GC GA G C A G C A T T 2.1 Bounded 2. Variable-length duplications 2.2 Unbounded We focus on the worst-case scenario !
Notation Given alphabet an integer 012 -irreducible 0012 012012 -descendant cone of 01122 01212012 -descendants of 00112012212012
Problem Formulation Goal: Given , construct a code such that “For all ” Previous Works 0122 0112 • Optimal codes are found when (Jain et al. 2017) • A method to construct codes when is provided (Jain et al. 2017). • Main idea: using “irreducible words” • There is no known result when 01122
Previous Work The code is optimal 0121 1201 1210 0120 d b c a For different irreducible words generate different descendants!
Previous Work 0120 0121 1201 1210 d D b A C c a For we can choose more than one codewords in each cone! Irreducible words form an “almost optimal” code!
Our Main Results • Detailed analysis on constructed codes based on irreducible words when such codes are denoted by • Provide an explicit formula to compute the size and asymptotic rate • Provide an upper bound for optimal code and hence conclude that is almost optimal • Linear-time encoder for • The extension of this encoder provides the first known encoder for previous constructed codes. Publication: IEEE International Symposium on Information Theory 2018.
Encoder of for (Sketched idea) Duplication channel encoder Decoder Error-decoder Irr-decoder Input output x x y y' irreducible word For o achieve encoding rates at least optimal rate, we only require For we define the neighbours of … x Irr Irr Irr … y
Example 20101 1 0 1 0 2 010 212 010 120 210 120 120 210 010 212 212010210120120
Recent Work: Special attention on The GC-contentof a DNA string refers to the number of nucleotides that corresponds to G or C, and DNA strings with GC-content that are too high or too low are more prone to both synthesis and sequencing errors. Many recent works use DNA strings whose GC-content are close to 50% or exactly 50%. This is referred as “GC-balanced constraint”. Our updated encoder: Irreducible GC-balanced Irreducible ATGCTACG ATACTA AAAA
Knuth Balancing Method Modified Knuth Method Irreducible + GC-balanced 0 1 0 0 0 1 0 0 Input Input AT C A T G A T Flip Flip 1 0 1 1 0 1 0 0 G C T GT G A T 1 0 1 1 0 1 0 0 G C T GT G A T Output codeword Output codeword Redundancy: (to encode the index t+ a look-up table) Redundancy: (linear-time encoding + no need a look-up table)
Recent Work: Design codes when The size of our code is at least In term of rate:
Summary Goal: “Given , construct the largest code where each codeword is of length over -ay alphabet that can correct unbounded tandem duplications of length at most .” Previous Works Our work Further work • Optimal codes when (Jain et al. 2017) • A method to construct codes when • Provide upper bound and lower bound for codes when • Linear-time encoder for known TD codes ( • (IEEE ISIT 2018) • Linear-time encoder for TD GC-balanced code • A method to construct codes when • Find optimal codes when • Reduce the redundancy of the encoder for TD GC-balanced code • Design better codes when