World Wide Web: Basic Technologies

Basic WWW Technologies Thanks to P. Smyth at UCI and Hayes at UNC

What Is the World Wide Web? • The world wide web (web) is a network of information resources. The web relies on three mechanisms to make these resources readily available to the widest possible audience: • 1. A uniform naming scheme for locating resources on the web (e.g., URIs). • 2. Protocols, for access to named resources over the web (e.g., HTTP). • 3. Hypertext, for easy navigation among resources (e.g., HTML).

Internet vs. Web • Internet: • Internet is a more general term • Includes physical aspect of underlying networks and mechanisms such as email, FTP, HTTP… • Web: • Associated with information stored on the Internet • Refers to a broader class of networks, i.e. Web of English Literature • Both Internet and web are networks

Essential Components of WWW • Resources: • Conceptual mappings to concrete or abstract entities, which do not change in the short term • ex: IST402 website (web pages and other kinds of files) • Resource identifiers (hyperlinks): • Strings of characters represent generalized addresses that may contain instructions for accessing the identified resource • http://clgiles.ist.psu.edu/IST402 is used to identify our course homepage • Transfer protocols: • Conventions that regulate the communication between a browser (web user agent) and a server

Standard Generalized Markup Language (SGML) • Based on GML (generalized markup language), developed by IBM in the 1960s • An international standard (ISO 8879:1986) defines how descriptive markup should be embedded in a document • Gave birth to the extensible markup language (XML), W3C recommendation in 1998

The Purpose of SGML • from Goldfarb's Infrequently Asked Questions:SGML is designed to make your information last longer than the systems that created it. Such longevity also implies immunity to short-term changes -- such as a change from one application program to another -- so SGML is also inherently designed for re-purposing and portability.

What is SGML? • SGML (and it's derivatives, HTML and XML) are ASCII character based representations of data • Remember, it's all bits--meaning is derived from how they are organized… • Think of SGML docs as strings that must be parsed--A web browser parses an HTML doc and uses the markup codes to display the data contained • Since it's all ASCII, these docs can also be handled by non parsing tools (such as vi, emacs, perl, etc.)

SGML Components • SGML documents have three parts: • Declaration: specifies which characters and delimiters may appear in the application • DTD (document type definition) / style sheet: defines the syntax of markup constructs • Document instance: actual text (with the tag) of the documents • More info could be found: http://www.W3.Org/markup/SGML

Structure of SGML documents • Prolog • SGML Declaration--information about the dialect of SGML used, codes used, delimiters. • Document Type Description (DTD)--external description of the relationship of data elements • Instance • Content • Descriptive Markup • Output Specifications (eg. a style sheet) • DSSSL (Document Style Semantic Specification Language) • FOSI (Formatted Output Specification Instance)

SGML Markup • Looks like HTML (really, HTML looks like SGML, because it is SGML!) • What is done with tagged text determined by applications <anthology><poem><title>The SICK ROSE <stanza> <line>O Rose thou art sick. <line>The invisible worm, <line>That flies in the night <line>In the howling storm: <stanza> <line>Has found out thy bed <line>Of crimson joy: <line>And his dark secret love <line>Does thy life destroy. <poem>  </anthology>

The DTD • In SGML, documents are given a type, defined in the Document Type Definition • The DTD is just another text file that: • Lists Constituent Parts in a series of Declaration Statements • Insures Consistent Structure • Think of this as being analogous to objects, which have specific properties and values • SGML Documents can be checked against the DTD by a parser

Simple Example of a DTD • "!" marks a Declaration Statement • Elements are named and have Start Tags and End Tags • Declarations consist of three parts: • Type and Name, consisting of other Elements or Reserved keywords (eg. #PCDATA or Element) • Minimization Rules governing tags • Content Model, with Occurrence Indicators (+ ? *) or Group Connectors (, & |) <!ELEMENT anthology - - (poem+)> <!ELEMENT poem - - (title?, stanza+)> <!ELEMENT title - O (#PCDATA) > <!ELEMENT stanza - O (line+) > <!ELEMENT line O O (#PCDATA) >

Using Data:The "Traditional" Model

Using Data:The SGML Approach

The Up Side • Data Independence--data structure controlled by use of an open standard • Longevity--structure is determined by DTD, not a monolithic and possibly proprietary application • Flexibility--separation of formatting and content description yields multiple uses by different parsing systems

The Down Side • Strict encoding • DTDs • Lack of SGML Applications

DTD Example One • <!ELEMENT UL - - (LI)+> • ELEMENT is a keyword that introduces a new element type unordered list (UL) • The two hyphens indicate that both the start tag <UL> and the end tag </UL> for this element type are required • Any text between the two tags is treated as a list item (LI)

DTD Example Two • <!ELEMENT IMG - O EMPTY> • The element type being declared is IMG • The hyphen and the following "O" indicate that the end tag can be omitted • Together with the content model "EMPTY", this is strengthened to the rule that the end tag must be omitted. (no closing tag)

HTML Background • HTML was originally developed by Tim Berners-Lee while at CERN, and popularized by the Mosaic browser developed at NCSA. • The Web depends on Web page authors and vendors sharing the same conventions for HTML. This has motivated joint work on specifications for HTML. • HTML standards are organized by W3C : http://www.w3.org/MarkUp/

HTML Functionalities • HTML gives authors the means to: • Publish online documents with headings, text, tables, lists, photos, etc • Include spread-sheets, video clips, sound clips, and other applications directly in their documents • Link information via hypertext links, at the click of a button • Design forms for conducting transactions with remote services, for use in searching for information, making reservations, ordering products, etc

HTML Versions • HTML 4.01 is a revision of the HTML 4.0 Recommendation first released on 18th December 1997. • HTML 4.01 Specification: • http://www.w3.org/TR/1999/REC-html401-19991224/html40.txt • HTML 4.0 was first released as a W3C Recommendation on 18 December 1997 • HTML 3.2 was W3C's first Recommendation for HTML which represented the consensus on HTML features for 1996 • HTML 2.0 (RFC 1866) was developed by the IETF's HTML Working Group, which set the standard for core HTML features based upon current practice in 1994.

Sample Webpage

Sample Webpage HTML Structure • <HTML> • <HEAD> • <TITLE>The title of the webpage</TITLE> </HEAD> • <BODY> <P>Body of the webpage • </BODY> • </HTML>

HTML Structure • An HTML document is divided into a head section (here, between <HEAD> and </HEAD>) and a body (here, between <BODY> and </BODY>) • The title of the document appears in the head (along with other information about the document) • The content of the document appears in the body. The body in this example contains just one paragraph, marked up with <P>

HTML Hyperlink • <a href="relations/alumni">alumni</a> • A link is a connection from one Web resource to another • It has two ends, called anchors, and a direction • Starts at the "source" anchor and points to the "destination" anchor, which may be any Web resource (e.g., an image, a video clip, a sound bite, a program, an HTML document)

Resource Identifiers • URI: Uniform Resource Identifiers • URL: Uniform Resource Locators • URN: Uniform Resource Names

Ping – TCP/IP Troubleshooting • PING - Packet Internet Groper; a utility used to determine whether a particular computer is currently connected to the Internet. It works by sending a packet to the specified IP address and waiting for a reply.

Ping (Packet Internet Groper) ping command

Introduction to URIs • Every resource available on the Web has an address that may be encoded by a URI • URIs typically consist of three pieces: • The naming scheme of the mechanism used to access the resource. (HTTP, FTP) • The name of the machine hosting the resource • The name of the resource itself, given as a path

URI Example • http://www.w3.org/TR • There is a document available via the HTTP protocol • Residing on the machines hosting www.w3.org • Accessible via the path "/TR"

Protocols • Describe how messages are encoded and exchanged • Different Layering Architectures • ISO OSI 7-Layer Architecture • TCP/IP 4-Layer Architecture

ISO OSI Layering Architecture

ISO’s Design Principles • A layer should be created where a different level of abstraction is needed • Each layer should perform a well-defined function • The layer boundaries should be chosen to minimize information flow across the interfaces • The number of layers should be large enough that distinct functions need not be thrown together in the same layer, and small enough that the architecture does not become unwieldy

TCP/IP Layering Architecture

TCP/IP Layering Architecture • A simplified model, provides the end-to-end reliable connection • The network layer • Hosts drop packages into this layer, layer routes towards destination • Only promise “Try my best” • The transport layer • Reliable byte-oriented stream

Hypertext Transfer Protocol (HTTP) • A connection-oriented protocol (TCP) used to carry WWW traffic between a browser and a server • One of the transport layer protocol supported by Internet • HTTP communication is established via a TCP connection and server port 80

GET Method in HTTP

Domain Name System • DNS (domain name service): mapping from domain names to IP address • IPv4: • IPv4 was initially deployed January 1st. 1983 and is still the most commonly used version. • 32 bit address, a string of 4 decimal numbers separated by dot, range from 0.0.0.0 to 255.255.255.255. • IPv6: • Revision of IPv4 with 128 bit address

IP Addresses All devices connected to the Internet have a 32-bit IP (IPv4) address associated with it. 232 = total addresses? Think of the IP address as a logical address (possibly temporary), while the 48-bit address on every NIC is the physical, or permanent address. Computers, networks and routers use the 32-bit binary address, but a more readable form is the dotted decimal notation.

IP Addresses • For example, the 32-bit binary address • 10000000 10011100 00001110 00000111 (4 octets) • translates to • 128.156.14.7 (called dotted decimal notation) • Range of octets is 0-255 = 28 • There are basically four types of IP addresses: • Classes A, B, C and D. • A particular class address has a unique network address size and a unique host address size.

DNS Lookup • http://www.bankes.com/nslookup.htm • nslookup - Name Server Lookup; A linux/windows utility used to query Internet domain name servers. An nslookup is usually used to find the IP address corresponding to a hostname.

Top Level Domains (TLD) • Top level domain names, .com, .edu, .gov and ISO 3166 country codes • There are three types of top-level domains: • Generic domains were created for use by the Internet public • Country code domains were created to be used by individual country • The .arpa domain Address and Routing Parameter Area domain is designated to be used exclusively for Internet-infrastructure purposes

Registrars • Domain names ending with .aero, .biz, .com, .coop, .info, .museum, .name, .net, .org, or .pro can be registered through many different companies (known as "registrars") that compete with one another • InterNIC at http://internic.net • Registrars Directory: http://www.internic.net/regist.html

Server Log Files • Server Transfer Log: transactions between a browser and server are logged • IP address, the time of the request • Method of the request (GET, HEAD, POST…) • Status code, a response from the server • Size in byte of the transaction • Referrer Log: where the request originated • Agent Log: browser software making the request (spider) • Error Log: request resulted in errors (404)

Server Log Analysis • Most and least visited web pages • Entry and exit pages • Referrals from other sites or search engines • What are the searched keywords • How many clicks/page views a page received • Error reports, like broken links

Server Log Analysis

Search Engines • According to Pew Internet Project Report (2002), search engines are the most popular way to locate information online • About 33 million U.S. Internet users query on search engines on a typical day. • More than 80% have used search engines • Search Engines are measured by coverage and recency

How to Improve the Coverage? • Meta-search engine: dispatch the user query to several engines at same time, collect and merge the results into one list to the user. • Any suggestions?

Web Crawler • A crawler is a program that picks up a page and follows all the links on that page • Crawler = Spider • Types of crawler: • Breadth First • Depth First

Breadth First Crawlers • Use breadth-first search (BFS) algorithm • Get all links from the starting page, and add them to a queue • Pick the 1st link from the queue, get all links on the page and add to the queue • Repeat above step till queue is empty

World Wide Web: Basic Technologies