January 2012 - Jonas M Palencia

There are times when you need to perform hardware maintenance (such as adding a new Network Interface Card [NIC]) on VMware hosts, or the host simply disconnects from vCenter. The only way to perform maintenance is to shutdown or reboot the hosts. To minimize damage, here’s the procedure I use:

Run vSphere client on the workstation. Do not use the vSphere client on the servers. The reason being – a server might be a virtual machine (VM) which will go down.
Using vSphere client, connect to VMware host, *not* the vCenter server.
Login as user root.
Shutdown all the VM’s, by right clicking the VM, selecting Power, Shutdown Guest. This is faster than logging in to each machine using RDP and shutting it down. The vmtools though have to be up to date, or else the Shutdown Guest option will be grayed out. If Shutdown Guest is grayed out, you need to login to the VM to shut it down. Performing “Power Off” on the VM should be the last resort.
Once all the VM’s are powered down, right click on the VMware host and select Enter Maintenance Mode.
Go to the console of the VMware host, and press Alt-F11 to get the login prompt.
Login as root.
Issue the command “shutdown -h now” to power down the host. If you just want to reboot, issue the command “shutdown -r now”.
Wait until the machine is powered off.
Perform maintenance.
Power on the VMware host. Look for any problems on the screen. The equivalent of blue screen in VMware is purple screen. When there’s a purple screen, that means there is something very wrong.
When the VMware host is all booted up, go back to your workstation, and connect using vSphere client to the VMware host.
Right click on the Vmware host first, and select “Exit Maintenance Mode”
Power On all the VM’s.

If there are multiple VMware hosts, and Vmotion is licensed and enabled (i.e. Enterprise License), you can vmotion VMs to the other hosts, and perform maintenance. When the host gets back, you can vmotion back the VM’s to the host, and do the same maintenance on the other.

This writeup describes how to restore a node back to the cluster after a node hard disk has been wiped out due to hardware error.

I was prompted to write this instruction because one of the nodes in our cluster failed. After the hardware has been replaced, I tried to put it back to the cluster, however, I was not able to. I tried to follow the instructions to no avail. I also posted a message to the scyld beowulf mailing list but I did not get any response.

Anyway, I was trying to add the node back to the cluster. Using beosetup, the new MAC address was registered as node 0. I tried to partition the disk using the beofdisk tool, then I restarted the node. Here’s the output:

# beofdisk -w -n 0


Disk /dev/hda: 4865 cylinders, 255 heads, 63 sectors/track

Old situation:

Units = cylinders of 8225280 bytes, blocks of 1024 bytes, counting from 0 
Device Boot Start End #cyls #blocks Id System

/dev/hda1 * 0+ 0 1- 8001 89 Unknown

/dev/hda2 1 516 516 4144770 82 Linux swap

/dev/hda3 517 4864 4348 34925310 83 Linux

/dev/hda4 0 - 0 0 0 Empty

New situation:

Units = sectors of 512 bytes, counting from 0 
Device Boot Start End #sectors Id System

/dev/hda1 * 63 16064 16002 89 Unknown

/dev/hda2 16065 8305604 8289540 82 Linux swap

/dev/hda3 8305605 78156224 69850620 83 Linux

/dev/hda4 0 - 0 0 Empty

Successfully wrote the new partition table 
Re-reading the partition table ... 
If you created or changed a DOS partition, /dev/foo7, say, then use dd (1) to zero the first 512 bytes: dd if=/dev/zero of=/dev/foo7 bs=512 count=1

(See fdisk(8).)

The partition table on node 0 has been modified.

You must reboot each affected node for changes to take effect. 
# beoboot-install 0 /dev/hda

Creating boot images...

Installing beoboot on partition 1 of /dev/hda.

mke2fs 1.32 (09-Nov-2002)

/dev/hda1: 11/2000 files (0.0% non-contiguous), 268/8001 blocks

Done

rcp: /boot/boot.b: No such file or directory Failed to copy boot.b to node 0:/tmp/.beoboot-install.mnt

After rebooting, it came out with an ERROR state on the BeoSetup window. Here’s the log:

node_up: Initializing cluster node 0 at Wed Mar 9 15:44:55 EST 2005. node_up: Setting system clock from the master. node_up: Configuring loopback interface. node_up: Loading device support modules for kernel version 2.4.27-294r0048.Scyldsmp. setup_fs: Configuring node filesystems using /etc/beowulf/fstab... setup_fs: Checking /dev/hda2 (type=swap)... chkswap: /dev/hda2: Unable to find swap-space signature setup_fs: FSCK failure. (OK for RAM disks) setup_fs: Mounting /dev/hda2 on swap (type=swap; options=defaults) swapon: /dev/hda2: Invalid argument setup_fs: Failed to mount /dev/hda2 on swap (fatal).

So, to solve this problem, you have to do 2 extra steps before rebooting the node. After executing beoboot-install, you should execute bpsh mk2fs -j on the data partitions and bpsh mkswap on the swap partition, such as

# bpsh 0 mk2fs -j /dev/hda3 # bpsh 0 mkswap /dev/hda2

Jonas M Palencia

Principal IT Engineer :: helping companies architect, implement, secure, and operate IT infrastructure on premise and on the cloud.

Monthly Archives: January 2012

Performing maintenance tasks on vmware hosts

Reinstalling a Node on a Scyld Beowulf cluster