1 / 11

Introduction

Introduction. XML stands for eXtensible Markup Language. Designed to transport and store data; not to display it XML is similar to HTML, but tags are not predefined. Tags are defined by users. XML is a W3C recommendation.

abdalla
Télécharger la présentation

Introduction

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Introduction • XML stands for eXtensible Markup Language. • Designed to transport and store data; not to display it • XML is similar to HTML, but tags are not predefined. • Tags are defined by users. • XML is a W3C recommendation. The main idea is to compress well formed xml files, for an application, which are generated from database queries.

  2. <!header…> <Main Tag> <Row1 Tag> <Col1 Tag>Data</Col1 Tag> … <Coln Tag>Data</Coln Tag> </Row1 Tag> … </Main Tag> xml file head main xml ELEMENT xml ELEMENT by query’s row Xml ELEMENT by query’s col Xml file structures

  3. Algorithm • The algorithm takes advantages of the well defined structure of the xml files. • Also, the frequency that row’s columns could have. This is the big deal of the algorithm! • Some compression strategies, similar to Static Dictionary, where xml tags, and “DataKeys” are replace by unused Ascii characters.

  4. Description Compression Algorithm • The file is processed in two (2) phases. • Phase One means figuring out xml tags, Ascii characters available, and DataKeys. • DataKey are sorting by the following rule: Length(DataKey) * frequency – (Length(DataKey) + frequency). Any DataKey over availability is discarded. Example: Key len= 20, frequency= 10; means 30 instead of 200= 170 Key len= 15, frequency= 10; means 25 instead of 150= 125 Key len= 30, frequency= 5; means 35 instead of 150= 115

  5. Description Compression Algorithm • Phase II means reading again the xml file in order to create a new file with a header -built from the information taken from Phase I, and its detail is shown later-, to reconstruct the xml file, and replacing Tags/DataKeys by available Ascii Characters.

  6. Description Compression Algorithm • Rules to replace Tags/DataKeys • Main Tag is skipped • Row Tag, an Ascii char is assigned. • Column Tag, an Ascii char is assigned. • If Column Data is a DataKey • If Ascii char is assigned, just Assigned Ascii • Else Assigned Column Char + Column Data • Else • Assigned Column Char + Column Data

  7. Description Decompression Algorithm • Read Header file • First four (4) Characters mean • Number of BitWise characters. -used Ascii chars. • First used Ascii char. • Number of Element tag. • Number of Data Keys set. • According to Char 4, reads pair Col/Num • According to Char 1, reads Bitwise • According to Char 3, reads Element String • According to Total Num from pairs, reads DK • Reads the rest of file replacing assigned Ascii

  8. Application Syntax xmlzip [-c filename.xml] [-k column _1 … column_n]] | [-d filename.xzp] Where -c: Compressing -k: Column numbers to be Data Keys -d: Decompressing

  9. We can notice Header Length is proportional to characters found in XML file, XML file Elements, and Datakey founds in XML file: NUMELEMENT ∑ SUBDATAKEY H = 4 + DATAKEYNUM*2 + NUMBITWISE + ∑ [length(ELEMENTSTRi)+1] + ∑ [DATAKEYSTRj)+1] + 1 i=1 j=1 In this case, the file HEADER is: H= 4 + 2 * 1 + 12 + 8 + 3 + 6 + 7 + 8 + 8 + 6 + 5 + 4 + 3 + 9 + 1= 87

  10. <?xml version="1.0" encoding="ISO-8859-1"?> <CATALOG> <CD> <TITLE>Empire Burlesque</TITLE> <ARTIST>Bob Dylan</ARTIST> <COUNTRY>USA</COUNTRY> <COMPANY>Columbia</COMPANY> <PRICE>10.90</PRICE> <YEAR>1985</YEAR> </CD> <CD> <TITLE>Hide your heart</TITLE> <ARTIST>Bonnie Tyler</ARTIST> <COUNTRY>UK</COUNTRY> <COMPANY>CBS Records</COMPANY> <PRICE>9.90</PRICE> <YEAR>1988</YEAR> </CD> <CD> <TITLE>Thriller</TITLE> <ARTIST>Michael Jackson</ARTIST> <COUNTRY>USA</COUNTRY> <COMPANY>Columbia</COMPANY> <PRICE>11.90</PRICE> <YEAR>1985</YEAR> </CD> <CD> <TITLE>Love Songs</TITLE> <ARTIST>Bee Gee</ARTIST> <COUNTRY>UK</COUNTRY> <COMPANY>Records</COMPANY> <PRICE>12.00</PRICE> <YEAR>1980</YEAR> </CD> <CD> <TITLE>Oral Fixation</TITLE> <ARTIST>Shaquira</ARTIST> <COUNTRY>Colombia</COUNTRY> <COMPANY>Epic</COMPANY> <PRICE>18.70</PRICE> <YEAR>2006</YEAR> </CD> </CATALOG> HEADER <?xml version="1.0" encoding="ISO-8859-1"?> @ &Empire Burlesque !Bob Dylan % *Columbia $10.90 #1985 @ &Hide your heart !Bonnie Tyler ~ *CBS Records $9.90 #1988 @ &Thriller !Michael Jackson % *Columbia $11.90 #1985 @ &Love Songs !Bee Gee ~ *Records $12.00 #1980 @ &Oral Fixation !Shaquira ^ *Epic $18.70 #2006

  11. Next • The next step is to make the algorithm generic. I mean the algorithm feature of taking column frequency advantage. • It can be exploited by Tag’s name instead of column number. I didn’t try to make it available because of time, but it’s a good point in order to avoid any conflict due to column order. • Also, it’s necessary the implementation of xml Attribute recognition. It’s almost done so far, but I didn’t keep going because of time constraint. It would be a good implementation that the user could say -by parameters- which specific Attribute is going to be taken into account. A good example is that Element’s Tags, and Attributes Tags could share the same name, even thought they are different data type. • Finally, but not least, complete the implementation of a modified PPM algorithm. The first task would be adding to the HEADER those DataKey over the available Ascii chars holding the condition: Length(DataKey) > Largest Context, and frequency >1 –at least. In order to add them to a “temporary” count array, where the size of the DataKey no matter.

More Related