Given the high failure rate, the process generated much distrust and resentment in the user community. Managers knew that this was going to cost them downtime and did their best to avoid having their users' machines JumpStarted. The process was costing the company a lot of money. After the second day of watching this, I realized several things:
config's biggest advantage over add_install_client is that it checks for almost every problem that could possibly prevent a JumpStart from working. This includes errors that prevent the machine from correctly booting via the network and errors that would cause the Custom JumpStart to fail. Specifically, config checks and corrects the appropriate NIS maps and the inetboot boot block in /tftpboot on the appropriate boot server. It checks to make sure the RARP, bootparams and TFTP servers are enabled or running on the appropriate boot server. config checks for rogue bootparams servers and will eventually check for rogue RARP servers. Also, config ensures that the CD-ROM image is available and exported on the appropriate NFS server. Lastly, the script checks the hardware of the target machine to make sure it matches one of the hardware rules for the custom JumpStart. (Note to reader: This will all be filled in with more detail in the final paper. In fact, most of it is already written and just needs to be pasted in here.)
config requires root access to the NIS master. This allows it to automatically make NIS map corrections and push the updated maps. Otherwise the operator is responsible for making the map changes. Removing dependencies on the operator typing information correctly is one of the main reasons for developing config.
add_install_client does not have a built-in option for bulk configuration. Automating bulk configuration with add_install_client is difficult because it requires knowledge of the client architecture for each machine. With config I built in support for bulk configuration. The script takes a list of hostnames and attempts to configure all of the machines on the list. config queries each machine for the information it needs, including the architecture. The operator need only create the list of hostnames.
In combination with the scripts mentioned below, config tracks the status of each machine. This is especially useful when using the bulk configuration feature described above. The operator compiles a list of several hundred hostnames and runs them through config. The script will print a report at the end of its run listing which machines failed and the reason for the failure. The operator can then fix the problems and rerun config with the same list of hostnames. config knows which machines are already configured and skips over them. This saves hours of waiting when working with a large number of machines.
Another advantage of config is that it automatically picks the boot and NFS servers for a machine when configuring it. This allows the operator to be blissfully ignorant of the network topology and the various servers. With add_install_client, the operator has to know which boot server is on the same subnet as the machine. While usually easy to figure out, it is one more mistake that may be made.
config creates a centralized place to make JumpStart configurations. With add_install_client in a large environment, the operator must know which boot server to use for each machine and log into that server to perform JumpStart configuration. With config, the operator always logs into the same server and doesn't need to know which boot server to use. This prevents uninformed operators from setting up more than one boot server on a subnet.
With config, a method for users and administrators to tag machines as non-standard is introduced. Tagged machines are automatically skipped over, eliminating the need for a failure-prone, manual method of keeping track of these machines.
As you can see there are many advantages to replacing add_install_client with a more capable script. Hopefully Sun will realize this and improve add_install_client.
The first consideration is boot servers. The protocols that Sun uses to boot via the network are not routable. As such, a boot server must be connected to each subnet. Almost any sort of machine will work as a boot server since the total amount of data transferred from the boot server for each machine is approximately 150KB. If your network design already contains a server on each subnet with Sun clients then boot servers are not a concern. If not, then you will need to put something together. We used several Sparc 5s with quad ethernet cards. Doing so allowed a single machine to serve up to 5 subnets.
The next consideration is network bandwidth. As a rough estimate, each workstation will transfer about 500 MB in the course of the JumpStart. The client's network connection is of little concern as 10-Base-T can easily handle that much data in an hour. However, the bandwidth of the NFS servers' network connections must be considered. It is easy to figure out the required network bandwidth. (Note to reader: I have an example calculation that will be inserted here in the final paper.) Keep in mind that the network traffic of JumpStarting many machines will overload shared ethernet; so avoid it if possible. Server interfaces on shared ethernet will only produce about 10% of the rated bandwidth during JumpStarts.
The last issue, and possibly the hardest to judge, is disk bandwidth. While the bandwidth of the various forms of the SCSI bus are well known, it is more difficult to find the realistic bandwidth of individual drives or RAID volumes. We found that a single drive in a SPARCstation could keep up with a 10-Base-T connection and that a 4 disk RAID 0 stripe in an Auspex fileserver was able to keep up with three FDDI (100 Mbit) connections. Note that the Auspex caches much more aggressively than a regular Sun fileserver and JumpStarts are perfect for caching. (Note to reader: I'll research this a little more and get some real numbers.)
My start script also generates web pages with the status of all of the machines indicated by red, yellow or green dots and a status message. The script runs in a loop and checks on the status of the machines periodically as the JumpStarts progress. This allowed the company previously mentioned to scale up to performing several hundred simultaneous JumpStarts. We had several people on hand to deal with the rare problems that cropped up. They would watch the web pages for troubled machines, figure out what the problem was and attempt to fix it. Our record was just over 250 machines in an hour and a half.
The main difficulty we had with the scripts I wrote was the status file that the various scripts used to keep track of the status of all the machines. Similar to the problems that large webservers encounter with their log files, this file turned out to be a bottleneck in the middle of JumpStarting a large number of machines. The multiple forks of the start script were in contention with each other in trying to update the status file. In the future I will split the status file so that there is a separate file for each machine. This will eliminate the contention problems within start. It will also allow multiple operators to configure machines at the same time and not have to wait for access to the monolithic status file.
Another change that will improve the system is to split the start script into a true start script and a script that keeps track of the status of the JumpStart on each machine. These two scripts will run simultaneously during a rollout. On the other hand, helpdesk personnel or an operator could use just the basic start script to remotely JumpStart a single machine. This was done at the company after I left and made the rollouts proceed more smoothly.
Earlier I mentioned that we check for rogue bootparams servers but not rogue RARP servers. This is because the rpcinfo program provides a feature that makes it easy to check for bootparams servers on a subnet. If given an RPC service number and the -b flag, rpcinfo will broadcast a request on the subnet for that service and report all of the servers that respond. Any servers other than our designated boot server can be flagged as needing to be disabled. Unfortunately there isn't a similar program to check for rogue RARP servers. Eventually, I will write a program to do this.
The greatest improvement that I have planned is to switch to multicasting to distribute the disk images to the workstations. Each machine of a given architecture gets an identical image, thus this is a prime candidate for multicasting. (Note to reader: This will be one of my major research areas in preparing the final paper. I hope to include detailed instructions on using multicasting for this purpose or details on why it is not currently possible.)
[Zub99] Zuberi, Asim. "Jumpstart
in a Nutshell." Inside Solaris February 1999, pp 7-10.