[TriLUG] Summary of Thomas Limoncelli's talk 14 May 2007, GEC center UNC
Joseph Mack NA3T
jmack at wm7d.net
Fri May 18 08:50:46 EDT 2007
Joe
----
Summary of presentation by Thomas Limoncelli of Google,
at joint NCSA/TriLUG meeting on 14 May 2007 in the GEC Auditorium UNC
The presentation was preceded by dinner consisting of
Chinese food and pizza, arranged by Liyun. The food was
eaten in a large enclosed atrium, with sidewalk cafe type
tables. The various speakers from the podium, including
Thomas, agreed that it was the best food ever at a geek
meeting.
Introductory remarks were given by the head of the UNC GEC
center, welcoming us to their new (2mo old) building.
Thomas has a 20yr history of sysadmining and managing
sysadmins, including about 8 yrs at Bell Labs. He is the
author of several books.
Thomas gave an introduction showing the great lifestyle
working for Google. The employees all seemed to be young.
Presumably few have families or a life outside Google and
are paid salaries characteristic of young people.
comment by Joe:
For 30yrs, I was in businesses dependant on the industrious
efforts of young and low paid workers, who were promised
great things for great work. The rewards for the successful
was more low paid work and more promises; the ones who
expected pay commensurate with their results (the failures)
were let go.
Thomas then moved into the technical aspects of Google.
Google has several Exabytes of storage (don't remember how
many). He declined to give the number of pages stored,
saying it was a competition that no-one could win: e.g. if
www.foo.com and foo.com are the same page, do you count them
once or twice? If a page has a calendar, do you count it
once for each date till the 64bit counter runs out? The
problem was that whatever page count Google gave, the
competition could give a higher count. He dodged the
question of the number of bytes of webpages stored.
He showed racks of servers with the lights on and with the
lights off. The photo with the lights off showed only
darkness with rows of LEDs converging on the vanishing
point.
The technical part of Google is based on
o cheap hardware
o parallelisation of the searches
In the old days you bought a 25k$ server from SGI. Of course
it may fail, so you bought another one powered off as a
backup. The bean counters didn't like having a $25k machine
turned off, so you put a loadbalancer in front and ran them
both at 50%. The bean counters didn't like this either,
since you were still only using half of your $50k of
computers. They wanted both running at 100%. The bean
counters didn't know what would happen in that case when one
server failed.
The solution was to buy lots of cheap hardware. It didn't
matter than the cheap hardware wasn't nearly as fast, since
you now had 10-20 x86 servers for the price of 2 SGI servers
and you could afford to have upto two of them out at a time
(only 10-20% of your hardware was unused instead of 50%). If
you had enough servers, with failover you could guarantee
service. (The problem then became power - see below.) Google
buys only cheap hardware; no colored bezels, no frills.
The Google parallel filesystem was explained at least at the
level that it's the same as other parallel file systems. The
metadata is on one lot of loadbalanced and redundant
servers, while the data is on another set of loadbalanced
and redundant servers. The data (say a 16Tbyte file) is
split into 64Mbyte chunks spread over 256k servers (in
triplicate, so make that 768k servers).
To grep the 16Tbyte file on RAID would take all day, but
instead the 64Mbyte files are grep'ed on each separate chunk
server and the results merged (Google calls it "reducing").
Following the reduction, another step is called sawzall, but
that was given too fast for me to follow. Unlike most
parallel applications, which are limited in scaling by
Amdahl's law to a small number of servers, the shifting of
the small amount of metadata to a separate set of servers
allows scaling of the parallel file system without any limit
that Google has run into.
Thomas said to look on google's own website for papers on
how any of this works.
Google monitors every data fetch. Thomas showed a diagram
with server number (1..n) on the x-axis and % completion on
the y axis. As the request is processed, you could see a
couple of servers lagging by about 25% - these would be
pulled for remedial attention - maybe a disk had to retry a
few times. As well about 20% extra servers would be thrown
into the mix to duplicate 20% of the chunk requests. The
first home would win, but the main point is that the
duplicates were used to check the speed of the other
machines for health problems.
The health of servers is monitored 10/sec with an "are you
there?" request to do some relatively minor processing.
With Moore's Law sending the cost of hardware to $0 and the
use of F/OSS sending the cost of software to $0, the main
cost of running Google, Thomas said was power. The cost of
power was the reason for the siting the new installation in
Oregon at a closed aluminum smelter near a hydro electric
plant, and in the choice of Lenoir, NC
http://en.wikipedia.org/wiki/Lenoir,_North_Carolina
for another new Google installation. I found the power
explanation little hard to believe. The amount of hardware
and clocks/sec being bought increases with time, at least
partially offsetting the decrease in cost coming from
Moore's Law. From watching businesses over the past 40yrs,
it seems that the capital in computing has stayed
approximately constant with Moore's Law giving an increase
in compute power rather than a decrease in cost. (Joe: the
cost of the 4004 microprocessor when first released is about
the same as the top of the line CPU nowadays. However after
inflation the cost of the current processors is less than
the 4004).
I expect rather than power, the major cost of running Google
would be salaries, and hence the young work force. Lenoir is
not known for its surplus of power, but for its poverty,
laid-off furniture workers from jobs sent overseas and the
absence of a technically skilled work force. From reports in
"The Independant", it seems that Google muscled its way into
a town desperate for jobs. If Lenoir really is only a server
farm, and Google really is there for the power, then they'll
only need to employ enough people to pull disks and reboot
servers; the upgrades etc will be done remotely.
Google has standardised on C++ for compiled code, python for
scripting and java for GUIs. All code is checked in to
Perforce, but only after a review by a 2nd person. When a
new product (eg gmail, google maps) is rolled out, the
developers handle the first 6th months of production before
handing it over (with documentation) to the production team.
Developers can spend 20% of their time on a personal project
- gmail came from one of these projects.
Thomas spent some time explaning the mechanics of upgrading
the servers. When a new feature is added, it is added in a
way that it can be turned off without affecting the original
functionality. The servers will then be run
# application --new-feature=off
and tested against the specs for the previous generation of
code. Then the new feature is turned on and tested. Then a
few servers are put on line with the new feature off. After
a while the new feature is turned on and some random people
will see the new feature, but the rest of the world won't.
Then they'll bring on-line 100 servers and watch them for a
while. When they're happy with the new code, they'll do a
rolling upgrade at 1server/sec. (I assume kernel upgrades
are done with something like PXE boot). (All upgrades can be
rolled back.) After several rounds of upgrades, the command
line option list for invoking applications becomes quite
long.
Google uses two lots of networks for replies to requests
o its own world wide private WAN
expensive
secure
relatively low bandwidth
low latency
o the internet
cheap
insecure
high bandwidth
high latency (Google buys low priority QoS service)
The first part of the page presented to the user (eg from
google maps) comes over Google's WAN, and arrives quickly.
The background info (tiling) comes over the internet and is
filled in gradually while the user is scanning the initial
material.
The network structure of google is a central node,
surrounded by a ring of subservers. These central nodes are
themselves part of a ring around another central server. The
standard reply to a google search has 3 areas on the page -
at the top is "news" (icon is a folded newspaper), at the
right is the advertisments and below is the reply to the
request. These 3 sections of the page each come from a
different central server, itself served by a ring of
subservers.
(comment by Joe)
This is a star of stars and is the preferred design for a
cheap transportation system (eg public transport, moving of
heavy goods, where say trains/planes are used for the long
haul to a hub, then cars/trucks locally). The internet
network design is for an unreliable transport mechanism.
With geo-mapping of IPs, Google is able to monitor requests
by location and time of day. Thomas showed a map of the
earth like this one which shows light pollution
http://www-static.cc.gatech.edu/~pesti/night/
except it shows lights where the requests are coming from.
The lights rolled across the earth with the time of day. He
showed 23 Aug 2003, the date of the NY blackout, when NY
city went black for a frame. Thomas pointed out that Africa
was completely black on his map (ie a great opportunity for
business)
Since Google makes its money by clicks on advertising,
customers whose advertisements don't get many clicks are
told to pull them and figure out why no-one is clicking on
them.
At the end of the talk Thomas showed this image
http://www.thedatafarm.com/blog/content/binary/answergoogle.gif
note the "www.mrburns.nl" near the upper left. The person
sitting next to me pointed it out. There's nothing obviously
connected to google on this URL and I don't know the
connection.
--
Joseph Mack NA3T EME(B,D), FM05lw North Carolina
jmack (at) wm7d (dot) net - azimuthal equidistant map
generator at http://www.wm7d.net/azproj.shtml
Homepage http://www.austintek.com/ It's GNU/Linux!
More information about the TriLUG
mailing list