Representative publications

Click here to return to the main page.

Codes for distributed storage

  • Explicit constructions of high-rate MDS array codes with optimal repair bandwidth, with M. Ye, 2016, preprint
    Explicit constructions of optimal-access MDS codes with nearly optimal sub-packetization, with M. Ye, 2016, preprint
  • Abstract: Maximum distance separable (MDS) codes are optimal error-correcting codes in the sense that they provide the maximum failure-tolerance for a given number of parity nodes. An $(n,k,l)$ MDS array code of length $n,$ dimension $k=n-r$ and sub-packetization $l$ is formed of $l\times n$ matrices over a finite field $F$, with every column of the matrix stored on a separate node in a distributed storage system and viewed as a coordinate of the codeword. Repair of a failed node (recovery of one erased column) can be performed by accessing a set of $d\le n-1$ surviving (helper) nodes. It is known that if $h$ out of the $n$ nodes are inaccessible and $d$ surviving (helper) nodes are used to recover the lost data, then we need to download at least $h/(d+h-k)$ fraction of the data stored in each of the helper nodes (Dimakis et al., 2010 and Cadambe et al., 2013). If this lower bound is achieved for the repair of any $h$ erased nodes from any $d$ helper nodes, we say that the MDS code has the $(h,d)$-optimal repair property.

    In the first of these two papers we study high-rate MDS array codes with the optimal repair property. Explicit constructions of such codes in the literature are only available for the cases where there are at most 3 parity nodes, and these existing constructions can only optimally repair a single node failure by accessing all the surviving nodes. In this paper, given any $r$ and $n$, we present two explicit constructions of MDS array codes with the $(h,d)$-optimal repair property for all $h\le r$ and $k\le d\le n-h$ simultaneously. Codes in the first family can be constructed over any base field $F$ as long as $|F|\ge sn,$ where $s=\text{lcm}(1,2,\dots,r).$ The encoding, decoding, repair of failed nodes, and update procedures of these codes all have low complexity.
    The code is said to have the optimal access property if the amount of data accessed at each of the helper nodes meets a lower bound on this quantity. Codes in the second family have the optimal access property and can be constructed over any base field $F$ as long as $|F|\ge n+1.$ Moreover, both code families have the optimal error resilience capability when repairing failed nodes. We also construct several other related families of MDS codes with the optimal repair property.
    In the conference version of this paper (ISIT2016) we also construct a family of Reed-Solomon codes with asymptotically optimal repair bandwidth, improving upon a recent work of V. Guruswami and M. Wootters.

    In the second paper (on nearly optimal sub-packetization) we address construction of codes that target the "node contents", i.e., the length of the columns in the encoding. In optimal-access MDS codes with $d=n-1$, the sub-packetization $l$ satisfies the bound $l\ge r^{(k-1)/r}.$ Codes discussed above have sub-packetization $l=r^{n-1}.$ Here we take up the question of reducing the value $l$ to make it approach the lower bound. We construct an explicit family of optimal-access codes with $l=r^{\lceil n/r\rceil},$ which differs from the optimal value by at most a factor of $r^2$. These codes can be constructed over any finite field $F$ as long as $|F|\ge r\lceil n/r\rceil,$ and afford low-complexity encoding and decoding procedures.

  • A family of optimal locally recoverable codes
    with I. Tamo, IEEE Trans. Inform. Theory, 60, no.8, 2014, 4661-4676, preprint
  • Abstract: We consider codes over a finite field ${\mathbb F}_q$ with the property that every symbol of the codeword can be recovered from the values of some other $r$ symbols in that codeword. This property, called local recoverability, is useful for data coding for distributed storage. Apart from recoverability, it is desirable that the code have large minimum distance in order to correct multiple disk failures. Let $(n,k,d,r)$ denote the length, dimension, distance, and the recovery parameter of the code. It is known that $ d\le n-k-\lceil\frac kr\rceil+2. $ This bound, which is analogous to the classical Singleton bound of coding theory, establishes the maximum possible minimum distance of an LRC code. In this work we construct $q$-ary codes of a given length $n$ whose parameters attain this bound. The codes are constructed as evaluations of specially designed polynomials over ${\mathbb F}_q.$ The recovery of one lost symbol (correction of one erasure) is done using simple polynomial interpolation.
    We also construct families of codes with two disjoint recovering sets for every symbol as well as observe several algebraic properties of the constructed code families.

  • Constructions of LRC codes; Codes with availability

    This is a group of ongoing projects related to the above work. So far three papers have been completed.
    Here we extend the RS-type construction of the above paper to codes on algebraic curves. It turns out that codes on curves are very well suited to handle locality. We also give examples of code families that asymptotically improve a Gilbert-Varshamov type bound derived in another recent work. In this paper we study a cyclic subfamily of the RS-type family, deriving conditions for locality in terms of code's zeros and studying binary LRC codes. As a part of this work, we address the problem of bounding above the distance of a cyclic codes in terms of the zeros of the code, and prove several estimates of the distance.

    Tutorial on "Codes with locality constraints," joint with Itzhak Tamo. 2016 IEEE Int. Sympos. Inform. Theory (ISIT)

    A tutorial on codes for storage at 2015 San Paulo Coding School.
    Part I: Slides, video;
    Part II: Slides, video.

    Polar codes

  • Polar codes for q-ary channels, q=2r
    with Woomyong Park, IEEE Trans. Inform. Theory, 59, no.2, 2013, pp. 955-969, preprint.
  • Abstract: Polarization phenomenon enables one to construct codes that attain capacity of discrete memoryless channels. The original construction of polar codes was restricted to binary-input channels. In this paper we addressed the question of constructing codes for codes with input alphabet of size $q=2^r, r\ge 2,$ relying on Arikan's transform matrix to attain polarization, with operations modulo $q.$ It turns out that message symbols polarize to $r+1$ extremal configurations, from symbols all of whose bits are approximately uniform to symbols transmitting reliably one, two, ..., all $r$ bits. Interestingly, if bit $i$ of the symbol is reliable, then all the bits $i+1,\dots,r$ are reliable, too. This result follows from a monotone property of the Bhattacharyya parameters proved in the paper. Later it was shown that multilevel polarization occurs for many other group alphabets, giving rise to different sets of extremal configurations. Apart from the monotone property, another technical result in this paper is a bound on the Bhattacharyya parameters of channels of capacity decreased with respect to the original channel after one polarization step (the so-called minus-channels).
    In a later associated paper we modified the technique of this paper to arrive at codes that reduce the set of $r+1$ extremal configurations to any number between $r+1$ and $2$ (as in Arikan's construction). Interestingly, it is possible to control the polarization process in any pre-defined way, although we have not been able to find a good application of this feature.

  • Polar codes for distributed hierarchical source coding
    with Min Ye, Advances in Mathematics of Communications, 9, no. 1, 2015, 87--103. preprint
    Interactive function computation via polar codes
    with T. C. Gulcu, Probl. Inform. Trans., 52, no. 1, 2016, pp. 66-91. preprint
  • Abstract: The successive refinement problem (also known as source divisibility) is a scheme of hierarchical compression in which two users are provided with a coarse and a refined representation of a memoryless source in such a way that the refined representation is obtained by adjoining a bit sequence to the already computed coarse representation. Studying discrete memoryless sources, we construct a polar coding scheme for successive refinement that attains the rate-distortion functions found by Koshelev (1981) and later independently by Equitz and Cover (1991). We also extend our construction to a distributed version of the Wyner-Ziv problem (lossy compression with side information), again attaining the optimal compression rates found in the literature.
    The second paper explores a similar setting, looking at applications of distributed compression for interative computation of a function by two terminals observing correlated time-independent realizations of a pair of random variables. We construct a polar-coding scheme that attains optimal rates of compression for this problem.

    Algebraic combinatorics; applications of association schemes

  • Linear codes on posets with extension property
    with L. Felix, M. Firer, and M. Spreafico, Discrete Mathematics, 317, 2014, 1-13, preprint
  • Abstract: One of the classical parts of coding theory connects additive codes over group and field alphabets with translation association schemes. In particular, such schemes offer a straightforward way of proving MacWilliams-like identities for code enumerators. Translation schemes have naturally defined duals, and linear-algebraic duality of linear codes in the Hamming space is usually identified with the duality of schemes.
    The true picture is more complicated in the sense that the dual code may not follow the structure of the dual scheme. A good example is given by codes in partially ordered metric spaces (i.e., spaces in which the distance is governed by a partial order on the set of coordinates). We study the question of when the two mentioned dualities coincide, focusing on partial orders saitsfy the extension property (meaning that automorphisms of ideals extend to automorphisms of the poset). The main result is that the dual association scheme is isomorphic to the association scheme on the dual poset if and only if the poset is self-dual.
    Partially ordered metric spaces have a number of interesting applications in coding theory, including list decoding, and in approximation theory.

    I summarized my work on ordered metrics in an hour-long talk at the 2016 IEEE International Symposium on Information Theory (Barcelona, July 15, 2016). The slides can be found on this page.

  • Association schemes on general measure spaces and zero-dimensional Abelian groups
    with M. Skriganov, Advances in Mathematics, 281, 2015, pp. 142-247, preprint | video
  • Abstract: Association schemes form one of the main objects of algebraic combinatorics, classically defined on finite sets. In classical theory, a symmetric association scheme is defined on a finite set. Given a finite set $X$, let us color the edges of the complete graph $K_{|X|}$ into $s$ colors so that, given any two vertices $x,y\in X$ such that the edge $(x,y)$ has color $k$, the number of neighbors $z$ such that the color of $(x,z)$ is $i$ and the color of $(y,z)$ is $j$, depends only on $k$ but not on $x,y.$ This definition can be phrased in algebraic terms, giving rise to a commutative matrix algebra called the Bose-Mesner algebra. Its properties have a number of far-reaching consequences for graphs, designs, codes and other objects.

    At the same time, direct extensions of this concept to infinite sets encounter some problems even in the case of countable sets, for instance, countable discrete Abelian groups. In an attempt to resolve these difficulties, we define association schemes on arbitrary, possibly uncountable sets with a measure. We study operator realizations of the adjacency algebras of schemes and derive simple properties of these algebras. However, constructing a complete theory in the general case faces a set of obstacles related to the properties of the adjacency algebras and associated projection operators. To develop a theory of association schemes, we focus on schemes on topological Abelian groups where we can employ duality theory and the machinery of harmonic analysis. Using the language of spectrally dual partitions, we prove that such groups support the construction of general Abelian (translation) schemes and establish properties of their spectral parameters (eigenvalues).

    Addressing the existence question of spectrally dual partitions, we show that they arise naturally on topological zero-dimensional Abelian groups, for instance, Cantor-type groups or the groups of $p$-adic numbers. This enables us to construct large classes of examples of dual pairs of association schemes on zero-dimensional groups with respect to their Haar measure, and to compute their eigenvalues and intersection numbers (structural constants). We also derive properties of infinite metric schemes, connecting them with the properties of the non-Archimedean metric on the group. Next we focus on the connection between schemes on zero-dimensional groups and harmonic analysis. We show that the eigenvalues have a natural interpretation in terms of Littlewood-Paley wavelet bases, and in the (equivalent) language of martingale theory. For a class of nonmetric schemes constructed in the paper, the eigenvalues coincide with values of orthogonal function systems on zero-dimensional groups. We observe that these functions, which we call Haar-like bases, have the properties of wavelet bases on the group, including in some special cases the self-similarity property. This establishes a seemingly new link between algebraic combinatorics and (non-Archimedean) harmonic analysis. We conclude the paper by studying some analogs of problems of classical coding theory related to the theory of association schemes.

  • Bounds on ordered codes and orthogonal arrays
    with P. Purkayastha, Moscow Mathematical Journal 9, 2, 2009, pp. 211-243, preprint
  • Abstract: Consider the following norm on the space of $r$-vectors over a finite field ${\mathbb F}_q:$ given a vector $x\ne 0,$ let $\|x\|=\max\{i: x_i\ne 0\}.$ Further, given $x\in {\mathbb F_q}^{N}, N=r\,n$ compute its norm as the sum of the norms of all of its $r$-blocks. It is not difficult to convince oneself that this is a properly defined norm (metric), and it is called the Niederreiter-Rosenbloom-Tsfasman or NRT distance. The main aim of this paper is to derive bounds on the maximum cardinality of a code $C\subset {\mathbb F_q}^{N}$ with a given value of the minimum distance.
    We derive a Bassalygo-Elias type bound on codes and analogs of the classical linear programming bounds of coding theory. Along the way we establish a set of properties of the association scheme of the space ${\mathbb F_q}^{N}$ (the NRT scheme) defined previously here, as well as some properties of multivariate Krawtchouk polynomials, which are the eigenvalues of the scheme. In particular, we derive (matrix) three-term recurrence relations that give a tool for computing the bounds.

    Spherical few-distance sets

  • Finite two-distance tight frames
    with A. Glazyrin, K. Okoudjou and W.-H. Yu, Linear Algebra Appl., 475, 2015, pp. 163--175, preprint
  • Abstract: A finite collection of unit vectors $S\subset{\mathbb R}^n$ is called a spherical two-distance set if there are two numbers $a$ and $b$ such that the inner products of distinct vectors from $S$ are either $a$ or $b$. We prove that if $a\ne -b,$ then a two-distance set that forms a tight frame for ${\mathbb R}^n$ is a spherical embedding of a strongly regular graph, and every strongly regular graph gives rise to two-distance tight frames through standard spherical embeddings. Together with an earlier work by S. Waldron on the equiangular case this completely characterizes two-distance tight frames. As an intermediate result, we obtain a classification of all two-distance spherical 2-designs.

  • New bounds for equiangular lines
    in: Discrete Geometry and Algebraic Combinatorics, A. Barg and O. Musin, Editors, AMS Series: Contemporary Mathematics vol. 625, 2014, pp.111--121, preprint
  • Abstract: We consider the problem of bounding the maximum number of lines that pass through the origin in ${\mathbb R}^n$ with the property that the smaller angle between any two of them is awlays the same. This problem has been attracting attention in the literature since at least 1970s, and there are a few classical bounds on the size of such sets. We use semidefinite programming to improve the known estimates in many dimensions. Improvements are obtained in dimensions $24 \leq n \leq 136$. In particular, we show that the maximum number of equiangular lines in ${\mathbb R}^n$ is $276$ for all $24 \leq n \leq 41$ and is 344 for $n=43.$ This provides a partial resolution of the conjecture set forth by Lemmens and Seidel (1973).

    Earlier works

  • Digital fingerprinting codes: Problem statements, constructions, identification of traitors
    with G. R. Blakley and G. Kabatiansky, IEEE Trans. Inform. Theory, 49, no.4, 2003, 852-865. [pdf]
    Fingerprinting capacity under the marking assumption
    with N.P. Anthapadmanabhan and I. Dumer, IEEE Trans. Inform. Theory, 54, no.6, 2008, 2678-2689, preprint

  • Abstract: These two papers formulate and study the problem of fingerprinting capacity (a communication scenario with the channel having inputs $x_1,\dots,x_t$ and one output $y_t$ which is determined from the inputs in an arbitrary way given that that $|\{x_1,\dots,x_t\}|\ge 2)$). The goal of the receiver (motivated by the abstraction of the digital copyright protection problem) is to correctly recover at least one of the messages transmitted over the channel. The main result of the 2003 paper is the statement of the fingerprinting problem as an information theoretic problem and a coding procedure that yields a positive communication rate (substantially smaller than the value of capacity). The construction uses concatenation of separating codes and algebraic codes such as Reed-Solomon or algebraic-geometric codes. The feature that enables the decoder to recover a message with high probability is related to the so-called list recoverability of algebraic codes, i.e., the feature that enables the Reed-Solomon decoder to recover the codeword closest by to a subset of vectors in the Hamming space.
    The 2008 paper advances this setting in two ways. First it uses the typicality properties of the input to improve the lower bounds on the code rate in the fingerprinting channel. Next, it introduces ideas from converse theorems of information theory to prove the first upper bounds on the fingerprinting capacity. The so-called "strong converse" for the multiple-access channel by Ahlswede-Cai is an important component in the proof of the bounds. For $t=2,3$ we show that the capacity of binary fingerprinting is bounded, respectively, as $0.25\le C_{2}\le 0.322$ and $0.083 \le C_{3}\le 0.199.$

  • Distance properties of expander codes
    with G. ZĂ©mor, IEEE Trans. Inform. Theory, 52, no.1, 2006, 78-90, preprint
  • Abstract: Define a code on a regular degree $d$ bipartite graph $\Gamma$ by assigning a bit to each edge and requiring that the edges incident to every vertex form a codeword in a code of length $d,$ termed the "local code". Bipartite-graph codes generalize the standard product construction which is obtained when $\Gamma=K_{d,d},$ a complete bipartite graph. By assuming that the degree is much smaller than the number $n$ of vertices in each part of the graph it is possible to obtain new properties of the code family. In particular, Sipser and Spielman introduced the second eigenvalue of $\Gamma$ as a tool for the analysis of the distance of the bipartite-graph code. This work is formed of two parts. In the first part, we study the weight spectrum and the minimum distance of a random ensemble of such codes are computed and it is shown that it sometimes meets the Gilbert-Varshamov (GV) bound. In the second part, we switch to explicit code families whose analysis is based on the eigenvalues of $\Gamma.$ We prove a lower bound on the minimum distances of constructive families of expander codes, showing that it exceeds the product bound, which is a common benchmark in the analysis of two-level code constructions. As a consequence of this, we find a polynomially constructible family of expander codes whose relative distance exceeds the Zyablov bound on the distance of serial concatenations. This paper extends our earlier work which explored analogies between serial concatenations and graph-based codes in terms of code distance and error exponents. The main result there is that graph codes can be decoded iteratively to correct the number of errors (or meet the error exponent) similar to that of standard concatenated codes.

  • Bounds on packings of spheres in the Grassmann manifolds
    with D. Yu. Nogin, IEEE Trans. Inform. Theory, 48, no. 9, 2002, 2450-2454. [pdf] Erratum: IEEE Trans.
  • Abstract: The Grassmann manifold $G_{n,k}$ is the set of $k$-dimeinsional subspaces passing through the origin in the $n$-dimensional space $L^n,$ where $L={\mathbb R}$ or $\mathbb C.$ It is a compact manifold with a finite volume, and so it makes sense to consider the question of the maximum number of spheres of a certain radius that can be packed into it.
    Given two $k$-dimensional planes $p$ and $q,$ let $\theta_1$ be the maximum possible angle between vectors $x_1\in p$ and $y_1\in q.$. Having found a pair $x_1,y_1$ with angle $\theta_1,$ consider the subspaces $p_1=\{x\in p: x\bot x_1\}$ and $q_1=\{y \in q: y\bot y_1\}$ and define $\theta_2$ to be the maximum attainable angle between a pair of vectors $x_2\in p_1, y_2\in q_1.$ In this way we obtain $k$ angles $\theta_1,\dots,\theta_k,$ called the principal angles between $p$ and $q.$ Metrics on $G_{n,k}$ are defined using the vector of principal angles $\theta=(\theta_1,\dots,\theta_k),$ and there are several distance functions that are of interest in applications. We restrict our attention to the intrinsic (or geodesic) distance $d(x,y)=\|\theta\|_2$ and to the chordal distance $d(x,y)=\|\sin \theta\|_2.$ Our main contribution is computing the asymptotic volume of the metric ball in $G_{n,k}$ with respect to these distances as $n\to\infty$ and $k$ constant. This is done by writing the explicit integral of the volume form and using the multidimensional Laplace method. As a conclusion, we find expressions for the Hamming and GV bounds on codes in $G_{n,k}(L),$ where $L={\mathbb R}$ or $\mathbb C.$
    In a follow-up work we used Blichfeldt's density method to derive an Elias-like upper bound on the cardinality of a code in $G_{k,n}(L).$ A forgotten link between Blichfeldt's method and the Elias bound is apparent in the work of Rankin on an analogous result for spherical codes.

  • Random codes: minimum distances and error exponents
    with G. D. Forney, Jr. IEEE Trans. Inform. Theory 48, 9, 2002, 2568-2573. [pdf]
  • Abstract: Basic results for the standard ensembles of random binary codes (the Shannon ensemble of random codes and the Elias ensemble of random linear codes) have been established in classical papers and books in the early days of coding theory, but never published in one accessible place. We organize these results in a short paper, presenting self-contained proofs. In particular, we show that with high probability, a random linear code attains the Gilbert-Varshamov bound, while a random unrestricted code does not. We also analyze typical events for maximum likelihood decoding of random linear codes, finding the typical weight of the error vector that accounts for the error for a given code rate, and similar quantities. These considerations also relate to the error exponent of random linear and unrestricted codes.

  • Distance distribution of binary codes and the error probability of decoding
    with A. McGregor, IEEE Trans. Inform. Theory, 51, no.12, 2005, 4237-4246.  preprint
  • Abstract: One of the classic problems in information theory is bounding the error probability of decoding for optimal codes on discrete channels starting with such simple models as the Binary Symmetric Channel. The basic setting was established in the papers by Gallager (1965) and Shannon-Gallager-Berlekamp (1968) for the binary symmetric channel, and a similar set of results was obtained to the Gaussian channel (mostly in Shannon's work of 1959, with some later additions). The main result of this work is a new bound on the exponent of the error probability (the error exponent, or the reliability function) for the binary symmetric channel. Interestingly, the bound gives a new region of rates for which the reliability function is known exactly.
    This result is also extended to the Gaussian channel, with similar results for the error exponent.

  • Binomial moments of the distance distribution: Bounds and applications
    with A. Ashikhmin, IEEE Trans. Inform. Theory, 45, no. 2, 1999, 438--452. [pdf]
  • Abstract: Binomial moments were initially defined for linear codes in the work of MacWilliams (1965) to prove what has become known as the MacWilliams identities. In this paper we observe that it is possible to derive universal lower bounds on the binomial moments of a code of given cardinality (linear or not). The approach we take is an easy modification of the main Delsarte bound on codes (the so-called linear programming bound). We derive asymptotic bounds for binomial moments and used them to prove new bounds on the probability of undetected error of a sequence of codes of a given rate.
    We also observe that there are codes whose binomial moments $B_w$ meet the Delsarte bound with equality for all $w=1,\dots,n.$ It turned out much later that such codes can be thought of as universally optimal configurations in the Hamming space.

    Click here to return to the main page.