| Graphics file: X Window Dump |
Two new data variables are now available from the adam:
blocked.cosmos # of blocked socket writes/second
lost.cosmos # of lost samples/second
These can be displayed by cockpit and xstrip. They are also
averaged in the covar files.
From the attached plot, you can see that network jambs are occuring every
5 minutes. It is not exactly 5 minutes however. They occur in groups of 6,
with five spikes spaced 4 min 40 seconds apart, and then a longer gap of
about 6 minutes 30 seconds, so that the entire group takes about exactly
30 minutes.
I made this plot from Splus:
> fun.plot.prep(c("blocked.cosmos","lost.cosmos"),1996,191)
(Day 191 is Jul 9).
Since they are not spaced exactly 5 minutes apart, it does not apear
to be timed with the pam polling process. To further exclude pam
as a suspect, John suggested that I shut down eve_rf on cocklebur.
I shut it down from 00:21 to 00:34 on Jul 10. The blockages
still occured, so pam is off the hook.
Could it be a profiler?
A useful diagnostic is to display "blocked.cosmos" with xstrip, and set
the options->chartwidth to 3000, which results in a grid line every 10
minutes.
223: ADAM/NETWORK, Site ASTER, Wed 10-Jul-1996 15:02:05 GMT, xstrip of adam network jambs
- Previous -
Next -
Index
| Graphics file: X Window Dump |
Here is a window dump of an xstrip plot of blocked.cosmos and lost.cosmos
Press "Grapics Viewer" to see it.
- 225: ADAM/NETWORK, Site 1, Wed 10-Jul-1996 18:15:31 GMT, Cosmos down due to power failure!
Cosmos will be down for awhile due to a large
mechanical rodent crewing up the power cable; ie,
a farmer decided to cut the hay to the north of the
site not knowing about the power cable. The cable
was cut in two places.
Pam should stay alive since the battery is fully
charged. A backup battery will be put on charge in
case repairs take awhile.
- 233: ADAM/NETWORK, Site 1, Thu 11-Jul-1996 01:30:58 GMT, Cosmos down
Cosmos is down for the night. It appears to have a
backplane problem. Gordon has been informed of the problem.
We should have a spare set of ADAM boards sent out. Will
try things in the morning.
- 248: ADAM/NETWORK, Site 1, Sat 13-Jul-1996 15:33:05 GMT, Cosmos is back up
The matrix card had to be replaced. Cosmos is
looking good now.
- 284: ADAM/NETWORK, Site 1, Wed 17-Jul-1996 19:29:31 GMT, Datel voltage source attached to ADAM
The Datel voltage source is attached to analog channel 0 on
cosmos. It is being used to check out the analog channels for the
ozone system.
- 295: ADAM/NETWORK, Site 1, Fri 19-Jul-1996 15:28:58 GMT, Cosmos down
Cosmos went down at 14:34 GMT. mxreset would not
get it going again and cosmos does not respond to
ping. Will do a hard reset.
- 297: ADAM/NETWORK, Site 1, Fri 19-Jul-1996 18:19:45 GMT, Cosmos rebooted, cycled power
We also reset EVE to try to reset the Tc spiking.
- 314: ADAM/NETWORK, Site 1, Sat 20-Jul-1996 17:55:26 GMT, Cosmos down for RF test
Cosmos was shutdown at 17:55 GMT to see if the
FreeWave communication is the source of profiler
interference.
- 316: ADAM/NETWORK, Site 1, Sat 20-Jul-1996 19:42:57 GMT, Cosmos back on the air
Cosmos is back up. During the bootup period there was no
noise seen on the profiler. Will monitor the profiler.
- 329: ADAM/NETWORK, Site 1, Tue 23-Jul-1996 15:21:25 GMT, cosmos died last night
Cosmos died about 0745 last night with "ingest socket write to aster.8000"
errors. In the confusion of introducing Matt, Greg arriving, and Lou calling,
I neglected to check_aster or look at the log file to isolate the problem.
We rebooted it about 40 minutes ago and it seems to be running again.
- 336: ADAM/NETWORK, Site 1, Wed 24-Jul-1996 18:13:35 GMT, cosmos crashes
We came in this morning to find that cosmos had crashed again, with network
error messages in the "asterlog" file which were similar to those from
previous crashes. I rebooted (with the key, since it didn't ping), and
it crashed again in about 2 hours. We have taken the following steps:
1. "niced" the pam ppf2netcdf process, which may give the network handling
higher relative priority
2. removed a lot of the error messages in "sync.cc" to avoid the network
being "message bombed" when a crash starts. We have noticed that once
cosmos gets one sample that doesn't get through that a crash is imminent.
3. We looked at the statistics on the freewave in the trailer which said
that only 57% of messages were received. However, we don't trust this
number, since Matt got 9% when he repeated it, and 0% on the pam modem.
After discussions with Steve S., there are many more options - some harder
than others - depending on what the source of these crashes is. At this
point, we don't know if the problem is aster being overloaded, the RF link
not being reliable (or being interfered with) or a power glitch (our only
guess as to why it is worse now that Greg is here).
P.S. it crashed again while I was typing this!
- 337: ADAM/NETWORK, Site 1, Thu 25-Jul-1996 01:01:22 GMT, cosmos ingest times out again
I see that cosmos ingest stopped again; cosmos did not produce
the EPIPE errors, and did not "self mxreset". I have done an
mxreset to try to bring it back.
- 338: ADAM/NETWORK, Site 1, Thu 25-Jul-1996 14:57:44 GMT, cosmos ingestor ran all night
The cosmos ingestor ran happily all night; about 14 hours so
far. Who knows what is causing it to intermittantly timeout
on the polling?
- 341: ADAM/NETWORK, Site 1, Thu 25-Jul-1996 15:37:14 GMT, another cosmos crash
This time, cosmos self-recovered. Wonder why it didn't last night?
- 353: ADAM/NETWORK, Site 1, Thu 25-Jul-1996 23:11:48 GMT, pattern in recent cosmos ingest shutdowns
I noticed, by grepping on
EOF in asterlog, that the cosmos
problems are mostly between 15Z and 0Z.
It has successfully reset itself several
times this afternoon., so it looks like
the auto mxreset is working okay
right now.
- 355: ADAM/NETWORK, Site 1, Fri 26-Jul-1996 13:31:31 GMT, cosmos ran all last night!
- 364: ADAM/NETWORK, Site 1, Sat 27-Jul-1996 14:32:48 GMT, cosmos crash
cosmos ran flawlessly last night, and then crashed this morning (while I
was at the site). Apparently, the antenna switch yesterday didn't solve
the problem. It rebooted by itself as usual.
- 374: ADAM/NETWORK, Site 1, Sun 28-Jul-1996 14:12:18 GMT, cosmos still up
...but it lost a few samples just as we walked in now.
- 395: ADAM/NETWORK, Site 1, Thu 01-Aug-1996 00:55:04 GMT, added Tony's ozone channels
We've just added Tony's ozone instrument (conc.O3.secl.3m) and turned on
an additional channel for the CU instrument if it ever comes. I just
rebooted cosmos to make the change. (mxreset killed it, so the key was
necessary).
Note that the tower was down for about 2 hours (5pm to 7pm local) while
we did that. This was mostly due to prop cabling problems - see next
comment.
- 419: ADAM/NETWORK, Site ASTER, Mon 05-Aug-1996 21:45:14 GMT, Cosmos yoyo up and down
Cosmos has been up and down several times today. Latest outage was at 21:30.
Had to go out and reset the ADAM. Other outages rebooted automatically.
- 428: ADAM/NETWORK, Site ASTER, Wed 07-Aug-1996 19:05:44 GMT, Changes to ingest & adam code
On Monday, August 5th these changes were made to the aster system:
ingest: increased the no-activity timeout from 2 minutes to
5 minutes
sync code on matrix: increased the sample buffer from 16*4096 to
24 * 4096 bytes.
Ingest was rebuilt, installed and restarted. The matrix code
was rebuilt. Since the adam was conveniently crashing every hour or so
I just let it load the new code and spawn a new ingest on its next reboot,
which happened at 21:36 on aug 5th.
It has been up since then, so perhaps these changes helped.
- 480: ADAM/NETWORK, Site 2, Fri 23-Aug-1996 14:35:13 GMT, cosmos is down and out
cosmos was shut down at aprox. 14:25, freewaves are going down iminately,
everything is close to shutdown. pam 1 will continue to xmit via goes until
station teardown.
|
|