1 / 68

COMS E6125 Web-enHanced Information Management (WHIM)

COMS E6125 Web-enHanced Information Management (WHIM). Prof. Gail Kaiser Spring 2007. Reminders. Class attendance required! Preliminary paper proposal January 29 th Preliminary project proposal March 5 th Paper must be individual, projects may be teams of 2-5 students

Télécharger la présentation

COMS E6125 Web-enHanced Information Management (WHIM)

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. COMS E6125 Web-enHanced Information Management (WHIM) Prof. Gail Kaiser Spring 2007 Kaiser: COMS E6125

  2. Reminders • Class attendance required! • Preliminary paper proposal January 29th • Preliminary project proposal March 5th • Paper must be individual, projects may be teams of 2-5 students • See advice about team formation at http://york.cs.columbia.edu/classes/cs6125/team_advice.htm Kaiser: COMS E6125

  3. Class Attendance is Required! • Attendance will be taken at every class meeting, starting TODAY • Final grade reduced one notch for first miss (e.g., A- -> B+) • Final grade reduced full letter grade for second miss (e.g., A- -> B-) • Fail (or drop) course for third miss Kaiser: COMS E6125

  4. Today’s Topic: Basic Mechanics of the Web • URI (~URL) • HTTP • Client/Server Intermediaries Kaiser: COMS E6125

  5. What is a “URI”? • Uniform Resource Identifier • Compact string of characters, that conform to a certain syntax, for identifying an abstract or physical resource • Simple and extensible format • Example: http://york.cs.columbia.edu/classes/cs6125 Kaiser: COMS E6125

  6. What is a “Resource”? • Some piece of information that can be identified by a URI • The most common kind of resource is a file • But may also be a dynamically-generated query result, the output of a script, a document available in several languages, etc. Kaiser: COMS E6125

  7. Uniform Resource Identifier • Uniform: aka Universal, same string can be used with the same semantic interpretation, even when mechanisms used to access the resource differ • Resource: Conceptual mapping to an entity or set of entities - not necessarily the entity which corresponds to that mapping at any particular instance in time, not always network “retrievable” • Identifier: An object that can act as a reference to something that has identity Kaiser: COMS E6125

  8. Key requirement: Transcribability • Sequence of characters • May be transcribed from non-network source • Often needs to be remembered by people • Should consist of characters that are most likely to be able to be typed into a computer, within the constraints imposed by keyboards (and related input devices) across languages and locales Kaiser: COMS E6125

  9. Why do we usually say URL rather than URI? • A Uniform Resource Locator (URL) refers to the subset of URI that identify resources via a representation of their primary access mechanism (e.g., their network “location”) • Most popular form of URI Kaiser: COMS E6125

  10. What’s a URI that’s not a URL? • URN = Uniform Resource Name • Subset of URIs that denote a resource independent of its current location or the name by which it is known or the mechanism by which it is accessed • Required to remain globally unique and persistent even when the resource ceases to exist or becomes unavailable • Thus not necessarily retrievable Kaiser: COMS E6125

  11. URN vs. URL Example • Assume a published book (the resource) • ISBN assigned by the Library of Congress - this is the URN • Assume the entire contents of the book were placed on a Web server at http://www.xyz.com/book.gzand an Ftp server at ftp://ftp.xyz.com/book.gz - both of these are URLs Kaiser: COMS E6125

  12. URL Notation • <scheme>://<authority><path>?<query> typically, an Internet domain name specific to the authority, identifies the resource within the scope of the scheme and authority a string of information to be interpreted by the resource Kaiser: COMS E6125

  13. What’s a “domain name”? • Domain Name System (DNS) • Maps domain names to IP addresses and vice versa • Hierarchy of DNS servers for top level domains (.com, .edu, .uk, etc.), second level domains (columbia.edu, ibm.com, etc), and so on • Eventually finds IP address for individual host (e.g., www.cs.columbia.edu) • Originated ~1982, for email (gk60@CMUA -> gk60@CMUA.arpa -> gk60@a.cs.cmu.edu) Kaiser: COMS E6125

  14. What is a “scheme”? • <scheme>:<scheme-specific-part> • In a URL, the protocol employed for retrieval (http, ftp, file, mailto, etc.) • More generally, a specification for defining the syntax and semantics of the rest of the URI • Extensible because new schemes can be defined, with their own scheme-specific format after the colon (:) Kaiser: COMS E6125

  15. Example URLs • http://www.ietf.org/rfc/rfc3986.txt • gopher://gopher.quux.org/1/Software/Gopher • mailto:kaiser+6125@cs.columbia.edu • news:news.newusers.questions • telnet:cs.columbia.edu Kaiser: COMS E6125

  16. Example Absolute URIs • http://somehost/absolute/URI/with/absolute/path/to/resource.txt • ftp://somehost/resource.txt • urn:a-rose-by-any-other-name Kaiser: COMS E6125

  17. Example Relative URIs • http://somehost/absolute/URI/with/absolute/path/to/resource.txt • /relative/URI/with/absolute/path/to/resource.txt • relative/path/to/resource.txt • ../../../resource.txt • resource.txt • /resource.txt#frag01 • #frag01 • [empty string] Kaiser: COMS E6125

  18. Relative Addresses • Allows document trees to be (partially) independent of their location and scheme • A single set of hypertext documents can be simultaneously traversable via each of the ftp, http and file schemes if the documents refer to each other using relative URIs • Such document trees can be moved, as a whole, without changing any of the relative references Kaiser: COMS E6125

  19. URI “Standard” • URI is an Internet protocol element defined currently in RFC 3986 (2005) • Originally RFC1630 (1994) Kaiser: COMS E6125

  20. What is an “RFC”? • Request for Comments • One of a series, begun in 1969, of numbered Internet informational documents and standards widely followed by commercial software and freeware in the Internet and Unix communities • All Internet standards are recorded in RFCs Kaiser: COMS E6125

  21. Who keeps track of RFCs? • IETF = Internet Engineering Task Force • Open, all-volunteer organization, with no formal membership or membership requirements • Organized into a large number of working groups, each dealing with a specific topic • April 1st RFCs, e.g., http://www.apps.ietf.org/rfc/rfc3514.html Kaiser: COMS E6125

  22. What is “W3C”? • World Wide Web Consortium defines data formats and usage conventions as well as Internet protocols relevant to Web • Members pay fees depending on country, revenues and non-profit/for-profit status (e.g., $953 vs. $63,500) • Otherwise organized similar to IETF, but writes “Recommendations” instead of “Requests for Comments” • http://www.w3.org/ Kaiser: COMS E6125

  23. Back to URLs • Most (?) Web documents use the “http” scheme • What is “http” (HyperText Transfer Protocol)? Kaiser: COMS E6125

  24. HTTP • The default Internet protocol used to deliver data on the World Wide Web • Usually through TCP/IP sockets on port 80, but can use any port and can be implemented on top of any reliable networking protocol • A Web browser (HTTP client) sends requests to an Web server (HTTP server), which sends responses back to the client Kaiser: COMS E6125

  25. What’s “TCP/IP”? • IP = Internet Protocol • Delivers individual packets from one host to another, based on their IP address (in IPv4, four 8-bit octets as in 128.59.16.20) • Network routers direct traffic of IP packets Kaiser: COMS E6125

  26. What’s “TCP/IP”? • TCP = Transmission Control Protocol • Provides an abstraction of reliable, bidirectional connections for the delivery of IP packets to a particular port at a given IP address • The so-called well known ports (< 1024) are reserved for specific protocols • By default, HTTP uses port 80; this can change in the URL • http://www.foo.com:2007/doc.html Kaiser: COMS E6125

  27. HTTP History • HTTP/0.9 (1990) - simple protocol for raw data transfer • HTTP/1.0 (RFC 1945, 1996) - Allowed MIME-like messages, containing meta-information about the resources transferred and modifiers on the request/response semantics • HTTP/1.1 (RFC 2616, 1999) • HTTP Extension Framework (RFC 2774, 2000) Kaiser: COMS E6125

  28. What is “MIME”? • Multipurpose Internet Mail Extensions • Standard representation for “complex” message bodies (numerous RFCs since 1993) • Examples include messages with embedded graphics or audio clips, messages with file attachments, messages in Japanese or Russian, signed messages Kaiser: COMS E6125

  29. MIME Header Fields • Mime-Version, Content-Type, Content-Transfer-Encoding, Content-Description, Content-ID, Content-Location, Content-Disposition, Part Body • Discrete (text, image, audio) and Multipart (mixed, digest) content types Kaiser: COMS E6125

  30. HTTP Request/Response HTTP request Port 80 Processing HTTP Client Response Other port Kaiser: COMS E6125

  31. HTTP Requests and Responses • Consist of a start-line, zero or more headers (one per line), an empty line (CRLF) indicating the end of the header fields, and possibly a message-body • Message body only allowed with certain request methods and response status codes (200 OK vs. 404 NOT FOUND) Kaiser: COMS E6125

  32. Sample HTTP Exchange • To retrieve the file at the URL http://www.somehost.com/path/file.html • First open a socket to the host www.somehost.com, port 80 (use the default port of 80 because none is specified in the URL) Kaiser: COMS E6125

  33. Sample • Then, send something like the following through the socket: GET /path/file.html HTTP/1.0 From: someuser@columbia.edu User-Agent: HTTPTool/1.0 Accept: text/html, image/gif, image/jpeg [blank line here] Kaiser: COMS E6125

  34. The server should respond with something like the following HTTP/1.0 200 OK Server: Apache/1.3.0 (Linux)Date: Sun, 31 Dec 2006 23:59:59 GMT Last-Modified: Sun, 31 Dec 2006 23:59:58 GMT Content-Type: text/html Content-Length: 1354 <html> <body> <h1>Happy New Year!</h1> (more file contents) . . . </body> </html> Kaiser: COMS E6125

  35. Some Request Headers • From: gives the email address of whoever's making the request, or running the program doing so (for bots) • User-Agent: identifies the program that's making the request, in the form "Program-name/x.xx", where x.xx is the alphanumeric version of the program (e.g., browser) • User-Agent: Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; .NET CLR 1.0.3705) Kaiser: COMS E6125

  36. Some Response Headers • Server: analogous to User-Agent:, identifies the server software in the form "Program-name/x.xx" • Server: Apache/1.3.12 (Unix) • Last-Modified: gives the modification date of the resource that's being returned, e.g., for use in caching • Use Greenwich Mean Time, in the format Last-Modified: Tue, 23 Jan 2007 00:00:01 GMT Kaiser: COMS E6125

  37. Start Line • HTTP Version (0.9, 1.0, 1.1) • URI • Method (request) or Status Code (response) Kaiser: COMS E6125

  38. HTTP URIs • Up to some bounded length (often 255), or “unbounded”, status code 414 (Request-URI Too Long) • Equivalence comparison http://abc.com:80/~smith/home.html http://ABC.com/%7Esmith/home.html http://ABC.com:/%7esmith/home.html Kaiser: COMS E6125

  39. Request Messages • Method SP Request-URI SP HTTP-Version CRLF • GET http://www.w3.org/pub/WWW/ TheProject.html HTTP/1.1 • Equivalent to client making TCP connection to www.w3.org on port 80, then sending GET /pub/WWW/TheProject.html HTTP/1.1 Host: www.w3.org • Host field allows for virtual hosts Kaiser: COMS E6125

  40. What is a “virtual host”? • Enables the same machine to host multiple domain names, sometimes at the same IP address (name-based virtual hosting) • Important for website hosting (e.g., www.foo.com maps to /www/foo/site1 and www.bar.com maps to /www/bar/site2), but usually there can be only one secure https website per IP address/port Kaiser: COMS E6125

  41. GET • Retrieve whatever information (in the form of an entity) is identified by the URI • If the URI refers to a data-producing process, it is the produced data (given the input parameters after the “?”, if any) that is returned as the entity in the response - not the source text of the process (unless that text happens to be the output of the process) • http://foo.com/run.cgi?name1=val1&name2=val2 Kaiser: COMS E6125

  42. Conditional and Partial GET • Conditional if the request message includes an If-Modified-Since, If-Unmodified-Since, If-Match, If-None-Match, or If-Range header field • Partial if the request message includes a Range header field • Don’t retrieve data the client doesn’t need (e.g., at least part and up to date already in cache) Kaiser: COMS E6125

  43. HEAD • Identical to GET except that the server must not return a message-body in the response - only returns headers • Often used for testing hypertext links for validity and modification • Can mark cache entries as stale if certain header information changes (e.g., length, last-modified) Kaiser: COMS E6125

  44. POST • Used to request that the origin server accept the entity enclosed in the request as a new subordinate of the resource identified by the Request-URI in the Request-Line • Actual function performed by the POST method is determined by the server, usually dependent on the Request-URI Kaiser: COMS E6125

  45. POST supports several functions • Annotation of an existing resource • Posting a message to a bulletin board, newsgroup, mailing list, or similar group of articles • Providing a block of data, such as the result of submitting a form, to a data-handling process • Extending a database through an append operation Kaiser: COMS E6125

  46. POST vs. GET • GET can be used to send small amounts of data to a server, with the data following the ? character • The rest of the request-URI (before the ?) refers to some kind of processing program GET /path/script.cgi?field1=value1&field2=value2 HTTP/1.0 Kaiser: COMS E6125

  47. PUT and DELETE • Often unsupported (501 Not Implemented) • PUT requests that the enclosed entity be stored under the supplied Request-URI • May create a new resource at a new URI, or modify an existing resource already at that URI • DELETE requests that the origin server delete the resource identified by the Request-URI • May be overridden, e.g., by human intervention, even if status code indicates successfully completed Kaiser: COMS E6125

  48. OPTIONS and TRACE • OPTIONS allows the client to determine the requirements associated with a resource, or the capabilities of a server (OPTIONS *), without implying a resource action or initiating a resource retrieval • TRACE used to invoke application-layer loop-back of the request message, allowing the client to see what is being received at the other end of the request chain for testing or diagnostic information Kaiser: COMS E6125

  49. HTTP is “Stateless” • Server doesn’t remember anything about client between connections • Not even between requests during the same persistent connection, except TCP data • But some state can be encoded in complex URLs or in forms • Or saved on client in “cookies” Kaiser: COMS E6125

  50. Cookies • Opaque string associated with a website, stored at the browser • Create in HTTP response with “Set-Cookie:” • In all subsequent requests to this site, until cookie’s expiration, the client sends the HTTP header “Cookie:” • Name-value pairs • Cookie: user=“alex” lastvisit=“20070123-11:00” • Interpretation up to the Web application Kaiser: COMS E6125

More Related