FLATLAND96: ADAM/NETWORK Messages: 32 Entries..

Return to Logbook Contents Page
Entry Date Title Site Author #Graphics
26 Thu 13-Jun-1996ADAM <-> FreeWave connectionASTERMaclean, Gordon
145 Sun 30-Jun-1996Boulder connection down, back upnoneSemmer, Steve
161 Mon 01-Jul-1996Cosmos upASTERSemmer, Steve
165 Mon 01-Jul-1996Krypton running on cosmos1Semmer, Steve
168 Tue 02-Jul-1996Cosmos re-booted1Semmer, Steve
177 Wed 03-Jul-1996cosmos went down at ~18:00 gmt1Semmer, Steve
199 Fri 05-Jul-1996Cosmos went down1Semmer, Steve
208 Mon 08-Jul-1996Cosmos restarted1Semmer, Steve
212 Mon 08-Jul-1996One more cosmos rebootASTERMaclean, Gordon
221 Wed 10-Jul-1996Adam slip network jambsASTERMaclean, Gordon1
223 Wed 10-Jul-1996xstrip of adam network jambsASTERMaclean, Gordon1
225 Wed 10-Jul-1996Cosmos down due to power failure!1Semmer, Steve
233 Thu 11-Jul-1996Cosmos down1Semmer, Steve
248 Sat 13-Jul-1996Cosmos is back up1Semmer, Steve
284 Wed 17-Jul-1996Datel voltage source attached to ADAM1Semmer, Steve
295 Fri 19-Jul-1996Cosmos down1Semmer, Steve
297 Fri 19-Jul-1996Cosmos rebooted, cycled power1Semmer, Steve
314 Sat 20-Jul-1996Cosmos down for RF test1Semmer, Steve
316 Sat 20-Jul-1996Cosmos back on the air1Semmer, Steve
329 Tue 23-Jul-1996cosmos died last night1Oncley, Steve
336 Wed 24-Jul-1996cosmos crashes1Oncley, Steve
337 Thu 25-Jul-1996cosmos ingest times out again1Martin, Charlie
338 Thu 25-Jul-1996cosmos ingestor ran all night1Martin, Charlie
341 Thu 25-Jul-1996another cosmos crash1Oncley, Steve
353 Thu 25-Jul-1996pattern in recent cosmos ingest shutdowns1Martin, Charlie
355 Fri 26-Jul-1996cosmos ran all last night!1Oncley, Steve
364 Sat 27-Jul-1996cosmos crash1Oncley, Steve
374 Sun 28-Jul-1996cosmos still up1Oncley, Steve
395 Thu 01-Aug-1996added Tony's ozone channels1Oncley, Steve
419 Mon 05-Aug-1996Cosmos yoyo up and downASTERMichaelis, Matt
428 Wed 07-Aug-1996Changes to ingest & adam codeASTERMaclean, Gordon
480 Fri 23-Aug-1996cosmos is down and out2Knudson, Kurt


26: ADAM/NETWORK, Site ASTER, Thu 13-Jun-1996 16:49:02 GMT, ADAM <-> FreeWave connection
        Adam <--> FreeWave connections
                    
50-cond. ribbon     Serial A    FW
from matrix         9 pin       DB9
                    on adam 


RDin    5               1       2
TDout   3               2       3
GND     13              4       5
RTSout  7               3       7
CTSin   9               5       8

Both freewaves should be configured for 38400 baud, pt to pt.
The ASTER freewave is to be connected to serial port B.

145: ADAM/NETWORK, Site none, Sun 30-Jun-1996 14:07:53 GMT, Boulder connection down, back up
 The Boulder connection went down sometime this morning.
ISS was also down so I would assume it had something to
do with the local netrwork link. System was back up at
13:55 GMT.

161: ADAM/NETWORK, Site ASTER, Mon 01-Jul-1996 20:57:27 GMT, Cosmos up
  Cosmos is now running via FreeWave link.


Also there are 3 eggs in the nest at the base
of the PAM tripod.
165: ADAM/NETWORK, Site 1, Mon 01-Jul-1996 23:29:57 GMT, Krypton running on cosmos
  The krypton is running on cosmos.

168: ADAM/NETWORK, Site 1, Tue 02-Jul-1996 16:13:43 GMT, Cosmos re-booted
  Cosmos was re-booted to correct a channel_config error.

177: ADAM/NETWORK, Site 1, Wed 03-Jul-1996 19:36:12 GMT, cosmos went down at ~18:00 gmt
  Notice cosmos has gone down. I tryed an mxreset from base.
This was unsuccessful so I am heading to the corn fields.

  A reset on cosmos got it going again. At this time the reason for
the crash is unknown. Cosmos backup around  20:15 GMT.

199: ADAM/NETWORK, Site 1, Fri 05-Jul-1996 21:40:36 GMT, Cosmos went down
  Cosmos went down after Gordon loaded some new code.
Gordon is looking at the code and will modify.

208: ADAM/NETWORK, Site 1, Mon 08-Jul-1996 16:38:02 GMT, Cosmos restarted
  Cosmos went down for a short period starting at 16:28 GMT.
Based on the console log it was done by Gordon to download
new software code. It was backup at 16:33 GMT.

212: ADAM/NETWORK, Site ASTER, Mon 08-Jul-1996 23:42:20 GMT, One more cosmos reboot
The latest code was being too verbose with errors messages from the adam
to the console.  The newest sync will only log an error every 100
lost samples, not every 100 blocked writes, as before.

The size of the output sample buffer was increased from 8192 to 16384 bytes.
The write threshold is 4096 bytes, meaning writes of 4096 bytes
are attempted when 4096 bytes are available (or 10 seconds have
passed since the last write).
Samples are not discarded unless the buffer would overflow, at 16384 bytes.

221: ADAM/NETWORK, Site ASTER, Wed 10-Jul-1996 00:06:12 GMT, Adam slip network jambs
Two new data variables are now available from the adam:
	blocked.cosmos		# of blocked socket writes/second
	lost.cosmos		# of lost samples/second

These can be displayed by cockpit and xstrip.  They are also
averaged in the covar files.

From the attached plot, you can see that network jambs are occuring every 
5 minutes. It is not exactly 5 minutes however. They occur in groups of 6,
with five spikes spaced 4 min 40 seconds apart, and then a longer gap of
about 6 minutes 30 seconds, so that the entire group takes about exactly 
30 minutes.

I made this plot from Splus:
	> fun.plot.prep(c("blocked.cosmos","lost.cosmos"),1996,191)

(Day 191 is Jul 9).

Since they are not spaced exactly 5 minutes apart, it does not apear
to be timed with the pam polling process.  To further exclude pam
as a suspect, John suggested that I shut down eve_rf on cocklebur.
I shut it down from 00:21 to 00:34 on Jul 10.  The blockages
still occured, so pam is off the hook.

Could it be a profiler?

A useful diagnostic is to display "blocked.cosmos" with xstrip, and set
the options->chartwidth to 3000, which results in a grid line every 10
minutes.

223: ADAM/NETWORK, Site ASTER, Wed 10-Jul-1996 15:02:05 GMT, xstrip of adam network jambs
Here is a window dump of an xstrip plot of blocked.cosmos and lost.cosmos

Press "Grapics Viewer" to see it.

225: ADAM/NETWORK, Site 1, Wed 10-Jul-1996 18:15:31 GMT, Cosmos down due to power failure!
  Cosmos will be down for awhile due to a large
mechanical rodent crewing up the power cable; ie,
a farmer decided to cut the hay to the north of the
site not knowing about the power cable. The cable
was cut in two places.
  Pam should stay alive since the battery is fully
charged. A backup battery will be put on charge in
case repairs take awhile.

233: ADAM/NETWORK, Site 1, Thu 11-Jul-1996 01:30:58 GMT, Cosmos down
  Cosmos is down for the night. It appears to have a
backplane problem. Gordon has been informed of the problem.
We should have a spare set of ADAM boards sent out. Will
try things in the morning.
248: ADAM/NETWORK, Site 1, Sat 13-Jul-1996 15:33:05 GMT, Cosmos is back up
  The matrix card had to be replaced. Cosmos is
looking good now.
284: ADAM/NETWORK, Site 1, Wed 17-Jul-1996 19:29:31 GMT, Datel voltage source attached to ADAM
  The Datel voltage source is attached to analog channel 0 on
cosmos. It is being used to check out the analog channels for the
ozone system.

295: ADAM/NETWORK, Site 1, Fri 19-Jul-1996 15:28:58 GMT, Cosmos down
  Cosmos went down at 14:34 GMT. mxreset would not
get it going again and cosmos does not respond to
ping. Will do a hard reset.

297: ADAM/NETWORK, Site 1, Fri 19-Jul-1996 18:19:45 GMT, Cosmos rebooted, cycled power
We also reset EVE to try to reset the Tc spiking.
314: ADAM/NETWORK, Site 1, Sat 20-Jul-1996 17:55:26 GMT, Cosmos down for RF test
  Cosmos was shutdown at 17:55 GMT to see if the
FreeWave communication is the source of profiler
interference.

316: ADAM/NETWORK, Site 1, Sat 20-Jul-1996 19:42:57 GMT, Cosmos back on the air
  Cosmos is back up. During the bootup period there was no
noise seen on the profiler. Will monitor the profiler.

329: ADAM/NETWORK, Site 1, Tue 23-Jul-1996 15:21:25 GMT, cosmos died last night
Cosmos died about 0745 last night with "ingest socket write to aster.8000"
errors.  In the confusion of introducing Matt, Greg arriving, and Lou calling,
I neglected to check_aster or look at the log file to isolate the problem.
We rebooted it about 40 minutes ago and it seems to be running again.
336: ADAM/NETWORK, Site 1, Wed 24-Jul-1996 18:13:35 GMT, cosmos crashes
We came in this morning to find that cosmos had crashed again, with network 
error messages in the "asterlog" file which were similar to those from 
previous crashes.  I rebooted (with the key, since it didn't ping), and
it crashed again in about 2 hours.  We have taken the following steps:
1. "niced" the pam ppf2netcdf process, which may give the network handling
higher relative priority
2. removed a lot of the error messages in "sync.cc" to avoid the network
being "message bombed" when a crash starts.  We have noticed that once
cosmos gets one sample that doesn't get through that a crash is imminent.
3. We looked at the statistics on the freewave in the trailer which said
that only 57% of messages were received.  However, we don't trust this
number, since Matt got 9% when he repeated it, and 0% on the pam modem.

After discussions with Steve S., there are many more options - some harder
than others - depending on what the source of these crashes is.  At this
point, we don't know if the problem is aster being overloaded, the RF link
not being reliable (or being interfered with) or a power glitch (our only
guess as to why it is worse now that Greg is here).

P.S. it crashed again while I was typing this!

337: ADAM/NETWORK, Site 1, Thu 25-Jul-1996 01:01:22 GMT, cosmos ingest times out again
I see that cosmos ingest stopped again; cosmos did not produce
the EPIPE errors, and did not "self mxreset". I have done an
mxreset to try to bring it back.

338: ADAM/NETWORK, Site 1, Thu 25-Jul-1996 14:57:44 GMT, cosmos ingestor ran all night
The cosmos ingestor ran happily all night; about 14 hours so
far. Who knows what is causing it to intermittantly timeout
on the polling?

341: ADAM/NETWORK, Site 1, Thu 25-Jul-1996 15:37:14 GMT, another cosmos crash
This time, cosmos self-recovered.  Wonder why it didn't last night?

353: ADAM/NETWORK, Site 1, Thu 25-Jul-1996 23:11:48 GMT, pattern in recent cosmos ingest shutdowns
I noticed, by grepping on
EOF in asterlog, that the cosmos
problems are mostly between 15Z and 0Z.

It has successfully reset itself several
times this afternoon., so it looks like
the auto mxreset is working okay
right now.
355: ADAM/NETWORK, Site 1, Fri 26-Jul-1996 13:31:31 GMT, cosmos ran all last night!

364: ADAM/NETWORK, Site 1, Sat 27-Jul-1996 14:32:48 GMT, cosmos crash
cosmos ran flawlessly last night, and then crashed this morning (while I
was at the site).  Apparently, the antenna switch yesterday didn't solve
the problem.  It rebooted by itself as usual.
374: ADAM/NETWORK, Site 1, Sun 28-Jul-1996 14:12:18 GMT, cosmos still up
...but it lost a few samples just as we walked in now.
395: ADAM/NETWORK, Site 1, Thu 01-Aug-1996 00:55:04 GMT, added Tony's ozone channels
We've just added Tony's ozone instrument (conc.O3.secl.3m) and turned on
an additional channel for the CU instrument if it ever comes.  I just
rebooted cosmos to make the change.  (mxreset killed it, so the key was
necessary).

Note that the tower was down for about 2 hours (5pm to 7pm local) while
we did that.  This was mostly due to prop cabling problems - see next
comment.
419: ADAM/NETWORK, Site ASTER, Mon 05-Aug-1996 21:45:14 GMT, Cosmos yoyo up and down
Cosmos has been up and down several times today.  Latest outage was at 21:30.
Had to go out and reset the ADAM.  Other outages rebooted automatically.

428: ADAM/NETWORK, Site ASTER, Wed 07-Aug-1996 19:05:44 GMT, Changes to ingest & adam code
On Monday, August 5th these changes were made to the aster system:

ingest: increased the no-activity timeout from 2 minutes to
	 5 minutes	
sync code on matrix: increased the sample buffer from 16*4096 to
			24 * 4096 bytes.

Ingest was rebuilt, installed and restarted.  The matrix code
was rebuilt.  Since the adam was conveniently crashing every hour or so
I just let it load the new code and spawn a new ingest on its next reboot,
which happened at 21:36 on aug 5th.

It has been up since then, so perhaps these changes helped.
480: ADAM/NETWORK, Site 2, Fri 23-Aug-1996 14:35:13 GMT, cosmos is down and out
cosmos was shut down at aprox. 14:25, freewaves are going down iminately,
everything is close to shutdown. pam 1 will continue to xmit via goes until
station teardown.