[geeks] What do you get paid for wearing a pager?

velociraptor velociraptor at gmail.com
Mon Sep 25 13:36:10 CDT 2006


Warning: Long...

On 9/24/06, Patrick Giagnocavo <patrick at zill.net> wrote:
> What are *your* experiences or ideas?

I will second a lot of what others have said, particularly DanD.  Keep
in mind: a) I am assuming you are not an employee of the company, and
2) these first couple of paragraphs are based on some of my own
experiences in dealing with businesses and the way people can abuse
verbal agreements.  "Good fences make good neighbors."  Ditto for
contracts.

If you decide to do this, you should have a very strict agreement with
these folks, and you really need to create a definition *you* can live
with for what constitutes a "response".  I would also suggest setting
a fixed "trial" period or a clause that favors your re-assessment/
break-ability of the contract, so that if it isn't working out, they
can't sue you for terminating your services.  You will also want to
keep very strict account and records of how long you work on a problem
*and* what you do when you are called if you lay hands on equipment/
OSes.  Making suggestions on the phone is one thing, logging into a
server and "fixing" something that might have unintended consequences
is another.  Getting buy-off on anything serious you do wouldn't be a
bad idea.

Having a brief chat with a lawyer might be in order as far as review
of your agreements.  And, really, lawyers don't charge that much for
contract consultations--know what you want when you go in the door,
and don't waste his/her time, and it should be reasonable.  The a$$
you save may well be your own.

As for my own on-call experiences, they have varied.

At 'hurtling network juggernaut', in a rotation of anywhere from 20-30
SA's, I was paid a flat ~$1100 (before taxes) for ~7pm-7am on-call
service--this was supporting 1000+ servers (Solaris, HP-UX, NetApps,
etc. all configured "sort of" the same--but not really).  The response
time was "within 30 min", which meant I had to call back the customer
(in this case, internal engineers) within 30 minutes.  Engineers had
both the on-call cell number and a pager number to contact the on-call
person directly.  I was also the traffic director to other on-call
folks like networking, 'doze servers, security, etc.  It was generally
understood that if things got out of hand that you could call in the
"owner" SA, and if you worked "heroic" hours (i.e. ~2+) overnight, you
would be late going into your "day job" the next day.  Keep in mind we
didn't run shifts--mgmt was just flexible about what constituted
"normal" hours.  In most cases, you would get a smallish add'l bonus
(~$250 after taxes) for working over a holiday period like
Thanksgiving, New Year's, etc.

With a couple of exceptions, this was really no big deal for the 4+
years I worked in that group.  I dodged the bullet the week that NIS+
crapped itself.

More recently, in a division of a 'big evil telco company', we got
paid straight overtime for every hour worked over 40.  My personal
policy was to record 1hr if I was called after 11pm, and the other 2
SAs in the rotation pretty much did the same.  We had 20 minutes to
respond to the hosting folks if the alert was generated by their
monitoring.  If it was generated by our monitoring on the back side,
it would usually be a matter of minutes before the hosting site paged
us, so we just called anyway.  When on call, we had to be within 1hr
of the hosting facility, though this was somewhat elided through
agreements with the 3 SAs as one lived about 10 minutes away from the
data center.  Also, the hosting facility could provide us with remote
hands, so it was really more of an issue of being within 1 hr of
having network connectivity through VPN to the hosts.

The killer for me on this job was there was not enough QA of apps that
the client put into production.  After 3 months of being paged 2x per
night @ 2330 and 0330, and with neither my mgmt or the client being
willing to do anything about it, I bailed.  It was exacerbated by new
mgmt insisting that I must be in the office at 0900 every morning, in
spite of these pages.  As team lead, I did not feel it was right to
subject someone to this kind of sleep deprivation on a daily basis, so
I took over on-call myself for most of the 3 months.  My understanding
is that it went on for another 6 months after I left, and that the
final solution was *turning off monitoring* rather than fixing the
problem.

At the $ork-2 ('huge defense contractor trying to break into civilian
sector') we had a 4-5 person rotation.  The escalation process was
very formal.  We got called by hosting (or other on-call folks in
other groups), and had to respond w/in 15 minutes to ACK.  At the 15
min. mark it would be escalated to the next level.  After our ACK, we
needed to decide a ball-park ETA for resolution, and if that was <30
min., that was the end of it, though we needed to FIN to verify to
hosting when we were completed.  If it was more than 30 min, we had to
call the next level and they would decide if it required a response
team ("service restoration call"), depending on your estimation of the
severity, etc.  Keep in mind there were very rigid definitions of what
consituted the Sev 1/Sev 2 problems that received these treatments.
There was no extra on-call pay involved, but if you worked more than
30 min or so you were expected to take comp time off the next week,
and depending on the situation you might get a bonus if it came down
to "heroic" measures.

In theory, this was a great plan.  In reality, the technical team lead
and the administrative team lead became the targets of all the furor.
We were expected to be available 24x7 irregardless, answer for any
member of our team who was supposed to be on-call and didn't respond,
and then deal with the outage ourselves; the fact we weren't on mgmt
pay scale be damned.  There was a lot of finger-pointing about
procedures, but very little effort toward fixing the problems in the
environment that lead to the outages.  The drama within upper mgmt at
the clients' and w/in the company's mgmt team was painful to watch,
much less have to deal with.  I can only say that I am happy that the
*NIX environment was stable; our two major outages while I was there
happened during normal business hours.  For example, the *NIX team
didn't find out about the 4th of July outage until the team leads from
the 'doze and backup teams complained about the >72hr "service
restoration call".

At $ork-1, an entertainment company, my boss was crazy enough to
filter all pages and call the appropriate person as necessary.  With
only one dba, developer/app support person, and one SA, his wisdom far
exceeded his own sanity.

At $ork, I am a lowly contractor without a badge and am not even
allowed on-site after hours at this point.  As far as I am aware,
there is no on-call rotation, as their mandate (in spite of providing
email & other global services for the unit) is that they only work
0700-1900, outside of scheduled maintenance that would require a
significant client outages (e.g. patching the unit-wide NFS server or
similar).  Recently due to serious power outages, they worked over the
weekends to bring things back on-line (the UPSes only last about 15
minutes--to that I say, "what was the point?"), but that's unusual.  I
suspect the after hours support policy will be changing soon as they
just finished installing a generator to for their data center, though.

So, all over the map.  YMMV, take the burned-out sys admin recounts
with a healthy grain of salt. However, I stand by the business advice,
having been burned by not writing formal agreements with previous
business partners as well as seeing numerous IT consulting
aquaintances be hammered by not doing it either.

=Nadine=



More information about the geeks mailing list