SENG2220 Web Development II

SENG2220Web Development II Mohammed A. Saleh http://ifm.ac.tz/staff/msaleh/SENG2220.html 6th November 2009

Module Content The HTTP Protocol • HTTP version 1.0. GET and POST methods. Request line. Status line. Headers. Carrying Data. Relationship to TCP. HTTP version 1.1. Methods available. Persistent connections. Chunked encoding. Mandatory headers. Future evolution of HTTP. Suitability of HTTP as transport for higher protocols.

Hypertext Transfer ProtocolHTTP • HTTP is a simple stateless request-response protocol • It's the network protocol used to deliver virtually all files and other data (collectively called resources) on the World Wide Web • the file may contain static data • HTML pages, GIFs, JPEGs, Microsoft Word documents, Adobe PDF documents, etc., etc. • the file may be a program that runs on the server to output data • ASP, PHP, Perl, JSP, etc., etc. • Takes place through TCP/IP sockets • A web client (user agent) requests a resource identified by a uniform resource locator (URL) • The web server identified in the URL responds with the file identified in the URL

Cont … • The standard (and default) port for HTTP servers to listen on is 80 What are "Resources"? • A resource is some chunk of information that can be identified by a URL (it's the R in URL). • The most common kind of resource is a file, but a resource may also be a dynamically-generated query result or the output of a CGI script • HTTP/1.0 highly successful • HTTP/1.1 introduced to address flaws in 1.0 and improve network performance • pipelining requests and responses

WWW Architecture Platform: Win, Mac, Unix Browser: IE, Mozilla, Opera Client Request: http://www.ifm.ac.tz/about/ Network HTTP over TCP/IP Response: <html>…</html> Server Platform: Win, Mac, Unix, Web Server: Apache, IIS

WWW Architecture • Client-Server Request-Response architecture • You request a web page • e.g. http://www.ifm.ac.tz/about/index.html • HTTP request • The web server responds with data • HTTP response • usually in the form of a web page (HTML document) • could be any file format • web page is written using HyperText Markup Language (HTML) • Web pages are identified by a Uniform Resource Locator (URL) • protocol: e.g. http • web server: e.g. www.ifm.ac.tz • [machine name].[domain name] • web page: e.g. about/index.html

Structure of HTTP Transactions • HTTP uses the client-server model • An HTTP client opens a connection and sends a request message to an HTTP server • the server then returns a response message, usually containing the resource that was requested • After delivering the response, the server closes the connection • The format of the request and response messages are similar, and English-oriented • an initial line, • zero or more header lines, • a blank line (i.e. a CRLF by itself), and • an optional message body (e.g. a file, or query data, or query output). • CR and LF here mean ASCII values 13 and 10, even though some platforms may use different characters

Initial Request Line • The initial line is different for the request than for the response • A request line has three parts, separated by spaces • a method name • the local path of the requested resource • the version of HTTP being used • A typical request line is: • GET /path/to/file/index.html HTTP/1.1 • Notes: • GET is the most common HTTP method; it says "give me this resource". Other methods include POST and HEAD. Method names are always uppercase • The path is the part of the URL after the host name, also called the request URI • The HTTP version always takes the form "HTTP/x.x", uppercase.

HTTP Request Method File name HTTP version GET /msaleh/index.html HTTP/1.1 Host: staff.ifm.ac.tz Connection: close Accept: text/xml,text/html,text/plain,image/png,*/* Accept-Language: en-gb,en User-Agent: Mozilla/4.0 (compatible;MSIE 6.0;Windows NT 5.0) Accept-Charset: ISO-8859-1,utf-8;q=0.7,* If-Modified-Since: Mon, 18 Sep 2006 22:57:19 GMT Referer: http://web-sniffer.net Headers Blank line Data – none for GET

Initial Response Line • The initial response line, called the status line, also has three parts separated by spaces • the HTTP version, • a response status code that gives the result of the request, and • an English reason phrase describing the status code • Typical status lines are: • HTTP/1.0 200 OK or • HTTP/1.0 404 Not Found • Notes: • The HTTP version is in the same format as in the request line, "HTTP/x.x". • The status code is meant to be computer-readable; the reason phrase is meant to be human-readable, and may vary.

HTTP Response HTTP version Status code Reason phrase Headers HTTP/1.0 200 OK Date: Thu, 21 Sep 2006 22:06:05 GMT Server: Apache/1.3.33 (Unix) PHP/4.3.10 Connection: close Content-Type: text/html ETag: "5d150-141c-450f244f" Last-Modified: Mon, 18 Sep 2006 22:57:19 GMT Content-Length: 5184 <?xml version="1.0" encoding="UTF-8"?> <!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict <html xmlns="http://www.w3.org/1999/xhtml"> ... </html> Data

HTTP Server Status Codes

Headers Lines • Headers are name/value pair that appear on both the request and response lines • The name of the header is separated from the value by a single colon • For example, this line in a request message: • User-Agent: Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1) • provides a header called User-Agent whose value is Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1) • The purpose of this particular header is to supply the web server with information about the type of browser making the request

Request Headers • HTTP clients use headers in the request message to identify themselves and control how content is returned. • Example if you are using IE: • Accept:*/* • This header indicates that the browser will accept all types of content. • Accept-Language: en-gb • The browser prefers British English content. • Accept-Encoding: gzip, deflate • The browser can handle gzip or deflate compressed content • Connection Keep-Alive • The browser is requesting the use of persistent TCP connections. • Referer: http://www.httpwatch.com/httpgallery/headers/ • This is supplied by the browser to indicate if the current request was the result of a link from another web page • User-Agent: Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1) • This identifies the browser is Internet Explorer Version 6 running on Windows XP.

Response Headers • HTTP servers use headers in the response message to specify how content is being returned and how it should be handled • Example if using IE: • Cache-Control: no-cache • This header indicates whether the resource may be cached by the browser or any immediate caches. • value no-cache disables all caching • Content-Length: 2748 • This header contains the length in bytes of the resource (i.e. the gif image) that follows the headers. • Content-Type: image/gif • The content is in GIF format. • Date: Wed, 4 Oct 2004 12:00:00 GMT • This is the current date and time on the web server. • Expires: -1 • The Expires header specifies when the content should be considered to be out of date. The value -1 indicates that the content expires immediately and would have to be re-requested before being displayed again.

Response Headers • Server: Microsoft-IIS/6.0 • The web server is an IIS 6 web server. • X-AspNet-Version: 2.0.50727 • The web server is running ASP.NET 2.0 • X-Powered-By: ASP.NET • The web server is running ASP.NET.

HTTP Methods • HTTP method is supplied in the request line and specifies the operation that the client has requested. • Two methods that are mostly used are the GET and POST • GET for queries that can be safely repeated • POST for operations that may have side effects (e.g. ordering a book from an on-line store). The GET Method • It is used to retrieve information from a specified URI and is assumed to be a safe, repeatable operation by browsers, caches and other HTTP aware components • Operations have no side effects and GET requests can be re-issued • For example, displaying the balance of a bank account has no effect on the account and can be safely repeated

HTTP Methods • Most browsers will allow a user to refresh a page that resulted from a GET, without displaying any kind of warning • Proxies may automatically retry GET requests if they encounter a temporary network connection problem. • GET requests is that they can only supply data in the form of parameters encoded in the URI (known as a Query String) – [downside] • Cannot be unused for uploading files or other operations that require large amounts of data to be sent to the server. The POST Method • Used for operations that have side effects and cannot be safely repeated

HTTP Methods • For example, transferring money from one bank account to another has side effects and should not be repeated without explicit approval by the user • If you try to refresh a page in Internet Explorer that resulted from a POST, it displays the following message to warn you that there may be side effects:

HTTP Methods • The POST request message has a content body that is normally used to send parameters and data • The IIS server returns two status codes in its response for a POST request • The first is 100 Continue to indicate that it has successfully received the POST request • The second is 200 OK after the request has been processed.

HTTP Recap • HTTP is a stateless protocol • Each HTTP request is independent of previous and subsequent requests • HTTP/1.0 defaults to Connection: close • closes the channel of communication immediately after a response • Connection: keep-alive was introduced to enable persistent connections • no need to re-negotiate a connection for each request • a connection can be re-used for multiple requests • HTTP/1.1 defaults to keep-alive for efficiency • supports pipelining to allow multiple requests to be sent in one TCP packet • The stateless nature of HTTP has a big impact on how web applications are designed

State Preservation • State preservation mechanisms come in three basic variations: • Cookies • is a small piece of text stored on a user's computer by a web browser, it stored locally on your computers hard disk drive • store a small amount of information on the client • sent to the server at each HTTP request • session variables • a unique identifier is used to associate information stored on the server with a particular client • passing data at each request-response cycle • store information in the web page • appending data to a URL • hidden fields in HTML forms

Caching • Web pages often contain content that remains unchanged for long periods of time. • For example, an image containing a company logo may be used without modification for many years. • It is wasteful in terms of bandwidth and round trips to repeatedly download images or other content that is not regularly updated • HTTP supports caching so that content can be stored locally by the browser and reused when required • By carefully controlling caching, it is possible to reuse static content and prevent the storage of dynamic data. • Browser caching is controlled by the use of the Cache-Control, Last-Modified and Expires response headers

Caching • Preventing Caching • Servers set the Cache-Control response header to no-cache to indicate that content should not be cached by the browser: • Cache-Control: no-cache • Allowing Caching • The Cache-Control header can be set to one of the following values to allow caching: • <absen t>: If the Cache-Control header is not set, then any cache may store the content. • Private: The content is intended for use by a single user and should only be cached locally in the browser. • Public: The content may be cached in public caches (e.g. shared proxies)

Caching • If the browser is to make effective use of cached content, two extra pieces of information should be supplied. • modification date/time of the content. The server supplies this in the Last-Modified response header: • Last-Modified: Wed, 15 Sep 2004 12:00:00 GMT • The second piece of information is the expiration date, that is specified with the Expires header: • Expires: Sun, 17 Jan 2038 19:14:07 GMT • If a cached entry has a valid expiration date the browser can reuse the content without having to contact theserver at all when a page or site is revisited • This greatly reduces the number of network round trips for frequently visited pages

Caching • For example, the Google logo is set to expire in 2038 and will only be downloaded on your first visit to google.com

Let us s-QUIZ our BRAINS • What do the following acronyms stand for? HTML, HTTP, TCP, UDP, IP, FTP, SMTP, DNS and OSI • How many layers are found on the OSI reference model? How about the TCP/IP protocol stack? List them • Why is HTTP considered to be a stateless protocol? • Is there a need to maintain state on the web? What is a cookie? • Mention the two main HTTP headers. How do they differ?

Encoding • When an HTTP client is reading a response message from a server it needs to know when it has reached the end of the message. • It is important with persistent (keep alive) connections, because a connection can only be re-used by another HTTP transaction after the response message has been fully received • Three ways in which an HTTP server can indicate the end of the response message:

Cont … Connection Closed by Server • The connection can be closed at the end of the response message by the server • Prevents connections being re-used. Content-Length Header • The length of the content after the response headers can be specified in bytes with the Content-Length header Chunked Encoding • The content can be broken up into a number of chunks; each of which is prefixed by its size in bytes • A zero size chunk indicates the end of the response message.

Cont … • If a server is using chunked encoding it must set the Transfer-Encoding header to "chunked". • Useful when a large amount of data is being returned to the client and the total size of the response may not be known until the request has been fully processed • An example of this is generating an HTML table of results from a database query • If you wanted to use the Content-Length header you would have to buffer the whole result set before calculating the total content size • with chunked encoding you could just write the data one row at a time and write a zero sized chunk when the end of the query was reached.

Key differences between HTTP/1.1 and HTTP /1.0 • Version numbers • The version number in an HTTP message refers to the hop-by-hop sender of the message, not the end-to-end sender • For example, if an HTTP/1.1 origin server receives a message forwarded by an HTTP/1.1 proxy, it cannot tell from that message whether the ultimate client uses HTTP/1.0 or HTTP/1.1 • HTTP/1.1 defines a Via header that describes the path followed by a forwarded message • The OPTIONS method • HTTP/1.1 introduces the OPTIONS method • A way for a client to learn about the capabilities of a server without actually requesting a resource

Cont … • Upgrading to other protocols • To ease the deployment of incompatible future protocols, HTTP/1.1 includes the new Upgrade request-header • A client can inform a server of the set of protocols it supports as an alternate means of communication • Caching • effective because a few resources are requested often by many users, or repeatedly by a given user • employed in most Web browsers and in many proxy servers; occasionally they are also employed in conjunction with certain origin servers • eliminates the network communication with the origin server • reduces bandwidth consumption, by avoiding the transmission of unnecessary network packets • can reduce the load on origin servers

Cont … • Caching in HTTP/ 1.0 • An origin server may mark a response, using the Expires header • a cache may check the current validity of a response • Shortcomings: It did not allow either origin servers or clients to give full and explicit instructions to caches • Problems: incorrect caching of some responses that should not have been cached and failure to cache some responses that could have been cached • Caching in HTTP/ 1.1 • provide explicit and extensible protocol mechanisms for caching • a cache entry is fresh until it reaches its expiration time, at which point it becomes stale. • A cache need not discard a stale entry • but it normally must revalidate it with the origin server

Cont … • Bandwidth optimization • Network bandwidth is almost always limited • Queueing delay caused by congestion • Wasting bandwidth increases latency • HTTP/1.0 wastes bandwidth in several ways that HTTP/1.1 addresses • A typical example is a server's sending an entire (large) resource when the client only needs a small part of it • There was no way in HTTP/1.0 to request partial objects • It is possible for bandwidth to be wasted in the forward direction • If a HTTP/1.0 server could not accept large requests, it would return an error code after bandwidth had already been consumed • What was missing?

Cont … • What was missing was the ability to negotiate with a server and to ensure its ability to handle such requests before sending them • A client may need only part of a resource, may want to display just the beginning of a long document • HTTP/1.1 range requests allow a client to request portions of a resource • Persistent connections • HTTP/1.0made no provision for persistent connections. • Use a Keep-Alive to request that a connection persist • This design did not interoperate with intermediate proxies • HTTP/1.1 makes persistent connections the default • HTTP/1.1 clients, servers, and proxies assume that a connection will be kept open after the transmission of a request and its response

Cont … • Persistent connections may be cleanly terminated for resource-management reasons • Pipelining • HTTP/1.1 encourages the transmission of multiple requests over a single TCP connection • each request must still be sent in one contiguous message • and a server must send responses (on a given connection) in the order that it received the corresponding requests. • However, a client need not wait to receive the response for one request before sending another request on the same connection • a client could send large number of requests over a TCP connection before receiving any of the responses • This practice, known as pipelining.

Questions

SENG2220 Web Development II