GlycoCT: A Unified Sequence Format for Carbohydrates in Glycobiology

GlycoCT—a unifying sequence format for carbohydrates S. Herget, R.Ranzinger, K.Maass and C.-W.v.d.Lieth Presented by YingxinGuo

An overview of the sequence formats used in glycobioinformatics

Special structural features

Uniqueness—A central requirement for encoding carbohydrate sequences • Why • Server as primary key in database • Beneficial for the implementation of exact structure search • How • Apply strict sorting rules • Define a controlled vocabulary • Support encoding of uncertain linkages and unspecified monosaccharides

General idea of GlycoCT

Basic monosaccharide namespace

Basic residue(RES) entities in GlycoCT • Substituents and other entities

Modeling the topology • Residue entities are modeled in RES section. • Linkages are modeled in LIN section. • Atom replacement schema.

Encoding linkage

Encoding Repeating units

Encoding alternative units

Encoding underdetermined units

Sorting • Why • One central requirement is to generate a unique representation for all carbohydrates. • Sorting is used to determine the order of appearance of elements. • How • A set of hierarchical rules are used in GlycoCT to define the ordering of residues, linkages and special structural features. • Residue comparison algorithm • Linkage comparison algorithm • Underdetermined subtree comparison algorithm • Alternative subtree comparison algorithm

Residue comparison • Apply when there are multiple starting points exist. • Rules • Number of child residues. • Length of the longest branch. • Number of terminal residues. • Number of branching points. • Lexical order.

Linkage comparison • Rules • Number of bonds between parent and child residues. • Atom linkage position at the parent residue. • Atom linkage position at the child residue. • Linkage type at the parent residue. • Comparison of child residues with residue comparison algorithm. • Decide the internal orderof the RES and LIN sections

Underdetermined subtree & Alternative subtree comparison • The encoding of UND and ALT is handled separately from the description of the other topological features. • Apply the set of rules from the residue and linkage comparison algorithm to each UND and ALT to determine internal order. • The reducing residues of UNDs and ALTs are compared with the residue comparison. • If two compared UNDs are identical, the parent residues and linkages(linkage between UND and main graph) are compared.

First application and results • All the monosaccharides from CarbBank were translated to the naming defined by GlycoCT. • 1439 different names in CarbBank resulted in 474 different basetypes and 29 different substituents, reducing the number of distinct residues by 65%. • Two main reasons for the reduction • The separation of monosaccharides into basetype and substituents • The unique encoding for monosaccharides

Conclusion • A superset of capabilities of all known sequence formats in glycobioinformatics • Support structurally undetermined sequences • The consistent naming scheme for monosaccharides can be easily maintained.

GlycoCT: A Unified Sequence Format for Carbohydrates in Glycobiology