Archive for January, 2008

n62: good news/bad news

Thursday, January 31st, 2008

Ah… the joys of experimental hardware.

As many of you know, node n62 has been causing us grief. I’ve been trying to run some large MPI test jobs and this one node has been refusing to boot a MPI ACE files.

The good news is that I have a big clue. After testing several “known to work” ACE files (all of which failed), I built the simplest BIT file I could think of. It worked! The only difference was that I used the on-board DDR memory rather than DDR2 in the DIMM slot. Next I built a standalone C system that runs out of BRAM and exercises the DDR2. Here’s the result:

-- Entering main() --

Starting MemoryTest for DDR2_SDRAM_32Mx64:

 Running 32-bit test...FAILED!

 Running 16-it test...

The good news is that I think the problem is that the DDR2 DIMM is not seated properly.

The bad news: it is the 63rd (of 64) nodes. We’ll be turning a lot screws to gain physical access to this node. We knew this was coming…

Ron

Sixteen Nodes flashed

Tuesday, January 29th, 2008

Hi,

On Monday, I flashed sixteen nodes of the cluster (n00-n15 and n56-63 — basically the bottom eight of each rack) with the latest boot controller daemon, version 1.1. This new daemon is a little more robust and has  features needed for the next version of our remote FPGA session control.   New features include:

  • console’s default speed is 115,200 baud
  • display shows the hostname (or IP address if hostname is not available or MAC address if IP is not available)
  • optionally shows user who has allocated the node
  • shows the number of blocks of flash in use, total available, and the percentage used
  • list 4 slots instead of 8 (8 was overkill)
  • now the first two slots are reserved (0 is the bootcontroller, 1 is for a recovery Linux boot)
  • second two slots (2 and 3) are for general use
  • client can now tell the node to halt (shuts down Linux but does not remove power)
  • the protocol also allows the client to erase a slot (no more uploading of blank ACE files :-) )

The next release is will add LCD support and a purely daemon mode so that it can be combined with MPI builds.  After that, I think 2.0 should be a complete re-write to accommodate the growing protocol.

Ron

Weekly Meeting

Thursday, January 24th, 2008

The first weekly meeting will be tomorrow (Friday 1/25) at 12:30pm. Note the new time! We will have the same room (Woodward Hall 244) and the usual teleconf dial-in 704-687-8982 number.

Agenda:

  • discuss proposal to move meetings from Fridays to Mondays
  • discuss upcoming CSE Special Issue … submit?
  • any lab issues?
  • Spring retreat … start discussing goals

Spirit Lives!

Thursday, January 24th, 2008

Hello All!

Yesterday was a great day. On January 23rd around 3:00pm the cluster ran its first MPI program!

The steps we have taken thus far:

  • assembled 64 ml-410 nodes (mainboard, LCD, front panel, chassis) and racked in two 7-foot 19″ racks … about 3500 screws :-)
  • wired the Gigabit Ethernet (two 48-port Nortel switches)
  • loaded our custom boot strap code on each flash drive
  • synthesized a minimal base hardware system for each FPGA
  • cross-compiled a root file system from scratch (busybox for most plus various Internet extras, start-up scripts for IP, hostname, date, etc.)
  • cross-compiled an off-the-shelf OpenMPI system and installed on rootfs and locally

I was able to compile and run a simple MPI program that ran on two nodes.

We’re still waiting on a couple of ML-410 boards that have been RMA’d and the lastest version of our boot control programs… but basically it works! Thanks goes to all the students.