Lecture 31
5/3/04

Reading from a URL

Last time we looked at code to read from a file. That code looked like this:

// create a stream to access dictionary file
fileReader = new BufferedReader(new FileReader(``file.txt''));

// get the first line of the file
nextLine = fileReader.readLine();
			
// Read words until we run out of words or the array becomes full
while (nextLine != null) {
   // Code to process the line omitted

   // get the next line from the file
   nextLine = fileReader.readLine();
}

// Close the file
fileReader.close();

How different are Web pages from files? There are two obvious differences:

The fact that the text in a Web page is HTML is only an issue if we want to display it nicely, like a Web browser does. HTML is, in fact, just text but with a funny syntax, in the same way that a Java file is just text with a funny syntax.

What we care about is the second point. The Web page is probably on a different machine. Because of this when we read a Web page in Java, we can't use a FileReader as above. Fortuntately, Java makes it easy for us to read web pages with a URL. We need to construct the stream reader differently, but after we do that the code to read from the stream is identical. Here it is:

// Note the different way of constructing a stream to read a URL
// over the network
pageReader = new BufferedReader(new InputStreamReader(url.openStream()));

// Read the first line
nextLine = pageReader.readLine();

// Loop until all the lines are read
while (nextLine != null) {
   // Code to process next line omitted

   nextLine = pageReader.readLine();
}
// Close the stream
pageReader.close();

How client/server systems work

Now, let's talk a bit about how networking programs work. When we write a networking program like the HTMLLinkFinder or the SpamFilter from lab, our program is communicating with another program over the network. Our program is a client. The other program is a server. In the case of HTMLLinkFinder, we are communicating with a Web server. In the case of the SpamFilter, we are communicating with a POP (Post Office Protocol) server which is one type of server that can be used to read mail.

Let's talk about how a Web client/server pair works a bit more. First, we'll look at URLs and understand what they mean better. A simple URL may look like:

http://www.williams.edu

A more complicated URL is:

http://www.williams.edu/index.html

What are all these pieces?

http
This specifies the protocol that will be used in communication. "http" means HyperText Transfer Protocol. This is the protocol used by Web servers. More on protocols later.

www.williams.edu
This identifies the name of the machine where the Web server is running.

everything else
The rest is information that is passed to the Web server to identify the Web page we want to view.

To connect to a server, we need to identify the machine and the port to use. Think of the port as something like a telephone extension. One machine may run several servers, so we identify the specific server we want by identifying its port. Common servers, like Web servres, have default port numbers. The default port number for a Web server is 80. If a server is running at its default port, we can omit the port number in the URL.

Once connected, the protocol defines a set of commands that we can use to talk to the server. The protocol also specifies the arguments that the command requires and the return values that it produces.

The HTTP protocol is very simple. It supports 1 command:

GET
The GET command takes an argument which is the name of the content we want the Web server to return. It returns a long string, which is the Web page content.

For example, if I am connected to cortland, I can get the home page for 134 by sending the string "GET /~cs134/s04/" to the Web server.

The easiest way to get a feeling for protocols is to run a program called telnet from a command line (like a DOS shell, or Mac Terminal). With telnet, we can send commands like the GET line above to the server just by typing them on the keyboard and hitting return. The responses made by the server appear on the terminal.

Here is a sample session using telnet to download a page from a Web server:

First we connect to the Web server. Using telnet we must explicitly state what the port number is.

-> telnet cortland.cs.williams.edu 80

The Web server responds:

Trying 137.165.8.5...
Connected to cortland.cs.williams.edu.
Escape character is '^]'.

Now we request the home page for cs134:

GET /~cs134/s04/

The server responds with the Web page:

<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN"
   "DTD/xhtml1-transitional.dtd">
<html xmlns="http://www.w3.org/1999/xhtml">
<!-- XML file produced from file: 134page.tex
     using Hyperlatex v 2.6 (c) Otfried Cheong
     on Emacs 21.1.1, Wed Nov 19 01:12:59 2003 -->
<head>
<title>Computer Science 134</title>

<style type="text/css">
.maketitle { align : center }
div.abstract { margin-left: 20h3.abstract  { align : center }
div.verse, div.quote, div.quotation {
  margin-left : 10  margin-right : 10}
</style>


</head>
<body BGCOLOR="FFFFFF">




<table><tbody><tr><td colspan="1" align="LEFT">

<img alt="" src="http://www.cs.williams.edu/~tom/courses/CSlogo.gif">
</td><td colspan="1" align="LEFT">
<H1> CS 134<BR> Introduction to Computer Science</H1>

</td><td colspan="1" align="RIGHT">
<ul>
<li><a href="134page_1.html">Instructors</a>
<li><a href="134page_2.html">Lectures and Readings</a>
<li><a href="134page_3.html">Programming Assignments and Laboratories</a>
<li><a href="134page_4.html">Documentation and Handouts</a>
<li><a href="134page_6.html">Exams</a>
<li><a href="134page_7.html">Text</a>
</ul>

</td></tr></tbody></table>

<hr />

<p>Computer Science 134 is an introduction to algorithm development
emphasizing structured, object-oriented design.  Algorithms will be
implemented as programs in the Java programming language.  We will
introduce data structures and recursion as tools to construct correct,
understandable, and efficient algorithms.  These topics will be
developed further in <a href="http://www.cs.williams.edu/ifg/courses/cs136.html">Computer Science 136</a>.  We highly
recommend the combination of Computer Science 134 and Computer Science
136 for those who wish a good introduction to the science of
computing.
<p>This course is a prerequisite for all upper level <a href="http://www.cs.williams.edu/ifg/allcourses.html">Computer
Science courses</a>.  In
Computer Science 134 <em>we do not assume that you have had any
previous computer programming experience</em>.  
If you have had extensive previous experience you might consider
<a href="http://www.cs.williams.edu/ifg/courses/cs136.html">CS 136</a>.  Please discuss this with
any member of the department's faculty
if you feel you fall into this category.
<p><hr />To simplify printing of the information about CS 134 found
on these pages, a single page on which all the 
<a href="full134page.html">information about CS 134</a> is grouped
together is also available.
<hr />
</body></html>
Connection closed by foreign host.

The HTTP protocol automatically closes the connection after returning the Web page. If I wanted another Web page, I would need to create another connection to get the second page.

Talking to a Web server

Now, let's consider how to make an HTTP connection in Java. (Because of Java's URL class, we don't actually need to do this, but it helps understand how the network connection is created and used.)

We need to first create a connection. To do this we create a Socket passing in the name of the machine and the port number. For example, to connect to cortland's Web server we would say:

Socket tcpSocket = new Socket("cortland.cs.williams.edu", 80);

A socket is a lot like a phone connection. When a phone connection is established, you can actually think of this as two connections bundled together. One of these connections alllows you to speak and to your mouthpiece and be heard via the earpiece on the other end. The other connection allows the other person to speak into a mouthpiece and for you to hear what they are saying in your earpiece.

A socket is similarly composed of two streams. The client writes on one stream which the server reads from. This stream is used to send commands from the client to the server. The server writes on another stream which the client reads from. The server uses this stream to send the results of commands to the client.

The next step in writing a client in Java is to get the stream to write to and the stream to read from and create a writer and a reader for them. Notice that we need to make slightly different calls to create the reader and writer, but once created we can use them in the same way as earlier streams:

// Get the stream reader and writer to enable communication
input = new BufferedReader(
	   new InputStreamReader(tcpSocket.getInputStream()));
output = new PrintWriter(tcpSocket.getOutputStream(), true);

Now that the connection is established, we can request a Web page by writing the GET command on the output stream. So, to get a Web page, we send the GET command to the server:

output.println("GET " + pageName);
To get the cs134 home page, we would set pageName to "/~cs134/f03/".

Now, we need to read the Web page that the server returns over the network connection. This a loop with a style familiar to what we have seen before:

String response = "";
String curline = input.readLine();
while (curline != null) {
   response = response + "\ n" + curline;
   curline = input.readLine();
} 
		
return response;

Talking to a POP server

Talking to other types of servers is done in a similar fashion. In all cases, we create a socket and then extract the input and output streams so that we can send commands and read replies. What differs is the protocol, that is, the list of commands that the server will understand. For each command, we need to decide how to process it.

The default port for a POP server is 110.

Here are the commands provided by a POP server:

USER username
This command tells the POP server which user's mail to read. Instead of username you would identify the account to log into. This command responds with a single line which starts with the string +OK.
PASS password
This command tells the POP server the password corresponding to the previous user account. Instead of password you would provide the user's actual password. This command responds with a single line. If the line begins with +OK the login worked. If the login failed, it returns a line beginning with -ERR.

The remaining commands will only work after a successful login.

STAT
This returns some simple statistics about the mailbox. The line it returns has the following form:
+OK 3 496
The first number reports the number of messages in the mailbox. The second reports the total number of characters in all the messages.
TOP 1 0
This returns the header of the message followed by some number of lines. The first number identifies the message number to return. The second number indicates how many lines of the message body to return. As shown above, the command would return only the header for the 1st message.

This message returns first with a line that begins either with +OK or -ERR. It the line it returns begins with +OK, it then sends multiple lines that are the header and the number of body lines requested. You can tell when there it is done sending lines because the last line will consist of a single period character (.).

RETR 1
This command is a lot like TOP except that it returns the entire message requested, both header and body, in their entirety. The number you provide is the message number.

This command returns first with a line that begins either with +OK or -ERR. It the line it returns begins with +OK, it then sends multiple lines that are the header and complete body of the mail message requested. You can tell when there it is done sending lines because the last line will consist of a single period character (.).

QUIT
This command ends the connection with the mail server. It always returns with a single line beginning with +OK.

You can try out this commands using telnet to connect to a mail server on port 110. Just type in the commands shown above and see the responses that you get back.