[TriLUG] Summary of Thomas Limoncelli's talk 14 May 2007, GEC center UNC

Fri May 18 08:50:46 EDT 2007

Joe

----

Summary of presentation by Thomas Limoncelli of Google,
at joint NCSA/TriLUG meeting on 14 May 2007 in the GEC Auditorium UNC

The presentation was preceded by dinner consisting of 
Chinese food and pizza, arranged by Liyun. The food was 
eaten in a large enclosed atrium, with sidewalk cafe type 
tables. The various speakers from the podium, including 
Thomas, agreed that it was the best food ever at a geek 
meeting.

Introductory remarks were given by the head of the UNC GEC 
center, welcoming us to their new (2mo old) building.

Thomas has a 20yr history of sysadmining and managing 
sysadmins, including about 8 yrs at Bell Labs. He is the 
author of several books.

Thomas gave an introduction showing the great lifestyle 
working for Google. The employees all seemed to be young. 
Presumably few have families or a life outside Google and 
are paid salaries characteristic of young people.

comment by Joe:
For 30yrs, I was in businesses dependant on the industrious 
efforts of young and low paid workers, who were promised 
great things for great work. The rewards for the successful 
was more low paid work and more promises; the ones who 
expected pay commensurate with their results (the failures) 
were let go.

Thomas then moved into the technical aspects of Google. 
Google has several Exabytes of storage (don't remember how 
many). He declined to give the number of pages stored, 
saying it was a competition that no-one could win: e.g. if 
www.foo.com and foo.com are the same page, do you count them 
once or twice? If a page has a calendar, do you count it 
once for each date till the 64bit counter runs out? The 
problem was that whatever page count Google gave, the 
competition could give a higher count. He dodged the 
question of the number of bytes of webpages stored.

He showed racks of servers with the lights on and with the 
lights off. The photo with the lights off showed only 
darkness with rows of LEDs converging on the vanishing 
point.

The technical part of Google is based on

o cheap hardware

o parallelisation of the searches

In the old days you bought a 25k$ server from SGI. Of course 
it may fail, so you bought another one powered off as a 
backup. The bean counters didn't like having a $25k machine 
turned off, so you put a loadbalancer in front and ran them 
both at 50%. The bean counters didn't like this either, 
since you were still only using half of your $50k of 
computers. They wanted both running at 100%. The bean 
counters didn't know what would happen in that case when one 
server failed.

The solution was to buy lots of cheap hardware. It didn't 
matter than the cheap hardware wasn't nearly as fast, since 
you now had 10-20 x86 servers for the price of 2 SGI servers 
and you could afford to have upto two of them out at a time 
(only 10-20% of your hardware was unused instead of 50%). If 
you had enough servers, with failover you could guarantee 
service. (The problem then became power - see below.) Google 
buys only cheap hardware; no colored bezels, no frills.

The Google parallel filesystem was explained at least at the 
level that it's the same as other parallel file systems. The 
metadata is on one lot of loadbalanced and redundant 
servers, while the data is on another set of loadbalanced 
and redundant servers. The data (say a 16Tbyte file) is 
split into 64Mbyte chunks spread over 256k servers (in 
triplicate, so make that 768k servers).

To grep the 16Tbyte file on RAID would take all day, but 
instead the 64Mbyte files are grep'ed on each separate chunk 
server and the results merged (Google calls it "reducing"). 
Following the reduction, another step is called sawzall, but 
that was given too fast for me to follow. Unlike most 
parallel applications, which are limited in scaling by 
Amdahl's law to a small number of servers, the shifting of 
the small amount of metadata to a separate set of servers 
allows scaling of the parallel file system without any limit 
that Google has run into.

Thomas said to look on google's own website for papers on 
how any of this works.

Google monitors every data fetch. Thomas showed a diagram 
with server number (1..n) on the x-axis and % completion on 
the y axis. As the request is processed, you could see a 
couple of servers lagging by about 25% - these would be 
pulled for remedial attention - maybe a disk had to retry a 
few times. As well about 20% extra servers would be thrown 
into the mix to duplicate 20% of the chunk requests. The 
first home would win, but the main point is that the 
duplicates were used to check the speed of the other 
machines for health problems.

The health of servers is monitored 10/sec with an "are you 
there?" request to do some relatively minor processing.

With Moore's Law sending the cost of hardware to $0 and the 
use of F/OSS sending the cost of software to $0, the main 
cost of running Google, Thomas said was power. The cost of 
power was the reason for the siting the new installation in 
Oregon at a closed aluminum smelter near a hydro electric 
plant, and in the choice of Lenoir, NC

http://en.wikipedia.org/wiki/Lenoir,_North_Carolina

for another new Google installation. I found the power 
explanation little hard to believe. The amount of hardware 
and clocks/sec being bought increases with time, at least 
partially offsetting the decrease in cost coming from 
Moore's Law. From watching businesses over the past 40yrs, 
it seems that the capital in computing has stayed 
approximately constant with Moore's Law giving an increase 
in compute power rather than a decrease in cost. (Joe: the 
cost of the 4004 microprocessor when first released is about 
the same as the top of the line CPU nowadays. However after 
inflation the cost of the current processors is less than 
the 4004).

I expect rather than power, the major cost of running Google 
would be salaries, and hence the young work force. Lenoir is 
not known for its surplus of power, but for its poverty, 
laid-off furniture workers from jobs sent overseas and the 
absence of a technically skilled work force. From reports in 
"The Independant", it seems that Google muscled its way into 
a town desperate for jobs. If Lenoir really is only a server 
farm, and Google really is there for the power, then they'll 
only need to employ enough people to pull disks and reboot 
servers; the upgrades etc will be done remotely.

Google has standardised on C++ for compiled code, python for 
scripting and java for GUIs. All code is checked in to 
Perforce, but only after a review by a 2nd person. When a 
new product (eg gmail, google maps) is rolled out, the 
developers handle the first 6th months of production before 
handing it over (with documentation) to the production team. 
Developers can spend 20% of their time on a personal project 
- gmail came from one of these projects.

Thomas spent some time explaning the mechanics of upgrading 
the servers. When a new feature is added, it is added in a 
way that it can be turned off without affecting the original 
functionality. The servers will then be run

# application --new-feature=off

and tested against the specs for the previous generation of 
code. Then the new feature is turned on and tested. Then a 
few servers are put on line with the new feature off. After 
a while the new feature is turned on and some random people 
will see the new feature, but the rest of the world won't. 
Then they'll bring on-line 100 servers and watch them for a 
while. When they're happy with the new code, they'll do a 
rolling upgrade at 1server/sec. (I assume kernel upgrades 
are done with something like PXE boot). (All upgrades can be 
rolled back.) After several rounds of upgrades, the command 
line option list for invoking applications becomes quite 
long.

Google uses two lots of networks for replies to requests

o its own world wide private WAN
   expensive
   secure
   relatively low bandwidth
   low latency

o the internet
   cheap
   insecure
   high bandwidth
   high latency (Google buys low priority QoS service)

The first part of the page presented to the user (eg from 
google maps) comes over Google's WAN, and arrives quickly. 
The background info (tiling) comes over the internet and is 
filled in gradually while the user is scanning the initial 
material.

The network structure of google is a central node, 
surrounded by a ring of subservers. These central nodes are 
themselves part of a ring around another central server. The 
standard reply to a google search has 3 areas on the page - 
at the top is "news" (icon is a folded newspaper), at the 
right is the advertisments and below is the reply to the 
request. These 3 sections of the page each come from a 
different central server, itself served by a ring of 
subservers.

(comment by Joe)
This is a star of stars and is the preferred design for a 
cheap transportation system (eg public transport, moving of 
heavy goods, where say trains/planes are used for the long 
haul to a hub, then cars/trucks locally). The internet 
network design is for an unreliable transport mechanism.

With geo-mapping of IPs, Google is able to monitor requests 
by location and time of day. Thomas showed a map of the 
earth like this one which shows light pollution

http://www-static.cc.gatech.edu/~pesti/night/

except it shows lights where the requests are coming from. 
The lights rolled across the earth with the time of day. He 
showed 23 Aug 2003, the date of the NY blackout, when NY 
city went black for a frame. Thomas pointed out that Africa 
was completely black on his map (ie a great opportunity for 
business)

Since Google makes its money by clicks on advertising, 
customers whose advertisements don't get many clicks are 
told to pull them and figure out why no-one is clicking on 
them.

At the end of the talk Thomas showed this image

http://www.thedatafarm.com/blog/content/binary/answergoogle.gif

note the "www.mrburns.nl" near the upper left. The person 
sitting next to me pointed it out. There's nothing obviously 
connected to google on this URL and I don't know the 
connection.

-- 
Joseph Mack NA3T EME(B,D), FM05lw North Carolina
jmack (at) wm7d (dot) net - azimuthal equidistant map
generator at http://www.wm7d.net/azproj.shtml
Homepage http://www.austintek.com/ It's GNU/Linux!