[TriLUG] Another take on SCO vs Linux

Thu Jun 12 16:39:23 EDT 2003

Hi,

After reading some of the SCO vs Linux threads and
articles, I'd like
to propose a theory that I haven't seen suggested
before.  It may
sound strange and I may be on shaky ground, but give
it a chance.

What I have seen on the debate is mostly (probably
rightfully) just
bashing SCO.  Some people assume the code was copied
either from SCO
into Linux or through some compilicated path of
intermediate points.
Or it was the reverse, SCO took some open source code,
used it, and it
propagated in.  I wonder if it is possible for people
to work totally
independently on code that accomplishes the same
function in the same
language (i.e C) and have fairly large stretches of
code that are the
exact same.

You may be thinking that this is ridiculous.  If two
pages in two
different books were identical then yes it might be a
copyright
violation.  But this might not be true if it were a
couple of
sentences.  For example:

"It was a dark and stormy night.  It was one of those
nights where
Homer Simpson, PI, knew that nothing good could
possibly walk into his
office.  But then she walked in.  She looked good and
she had the kind
of expression on her face that told you she knew it. 
And Homer
certainly knew it ......."

Well, I am sure many of those sentences may have been
used before.
And yes I took Homer Simpson from the Simpsons.  It
was intended as a
joke.  But the point is that the length of the
violation is important
and depends on the number of possible permutations. 
In term of books
and English, the number of permutations can be
relatively high (it
won't be 26^n though, obviously).  But we still don't
necessary
consider single sentences equivalencies to be a
copyright violation.
It has to be longer.  It could be argued that the
number of
permutations of a computer program that is written in
a given language
to accomplish the same task may be relatively low.

Here are some things to consider that argue for
similarities within
programs of the same language and same function.

1. Most C coders use K&R as a guide.  K&R defines and
lays out some
standards for writing code.  There are other standards
that may be
taught in school or learned in certain work
environments (DoD, govt,
IBM, or homegrown standards).

2. They are implementing the same standard.  System V
unix is an open
standard.  Networking and other standards are open.
They define data
structures, methods, functions that are to be used. 
This data is
encorporated into any program that implements the
standards.

3. C does not have a large vocabulary like English. 
In a computer
program, there are very few things you can do.  Most
of them are:
Set/declare variables (and in C you must declare the
variables all at
once in the beginning), branch (an if test), loop
(for, while, or
whatever), or call a function.  Contrast this with the
50-100
thousand words in English.

4. Variable names.  Point 1 relates to this.  Many
variable names will
be the same.  Doesn't everyone use 'i', 'j', or 'k' to
go through a
loop.  Using current and next for pointers in a loop
is common.  The
standard will sometimes define the nature of the data
structure you
are using.  For example, let's say you had a structure
called sk_buff,
why wouldn't you call the variable skb, or cur_skb or
something like
that.

5. Algorithmic similarities.  In kernel development,
there is very
strong pressure to get the exact best algorithm for
any task.  This is
because the kernel is very central to the efficiency
of the computer.
In programming applications, efficiency is typically
measured by
O(f(n)) notation.  That is, if something is O(n) is
takes time
proportional to n.  The constants are often ignored,
and this can be a
good approximation.  In kernel programming, it is
often known that the
best algorithm is O(n) (i.e. time = kn, k is
constant).  The kernel
programmers strive to use their tricks and insight to
minimize k.  It
may be that there is only one best way to do this and
it is easy for
any intelligent person to see.  Therefore, I would
argue that much of
the similarity in some these programs are algorithmic
and are not
copyright violations.  It's just that two intelligent
people will
often think alike.

6. Function names.  Many function names may be defined
by the
standard.  Other function names are just based on the
implementation.
Many developers use getXX() or something similar.  The
function names
are often the best attempt developers (at least open
source ones) to
make the code understandable and follow from the
algorithm.  If the
algorithm is similar, often the function names will
be.  Also function
names often come from the language of the domain being
modeled/programmed.  If it's networking, networking
terms are used by
everyone.

7. File names.  Similar argument to function names. 
They arise from
standards, the domain being programmed, and the
algorithm.

8. Include file ordering.  This may be suspicious, but
there can be
reasons for it.  Sometimes include files have a
precedece order.  That
is, on some compilers if you change the order it won't
compile.  Maybe
limits.h or init.h needs to be first since it defines
macros or
variables used by other .h files.  This may only apply
to older
compilers, but the code should run on all compilers
with minimal
adjustment.  And this may give rise to some ordering
of the include
files.

9. Release of program resources.  Also can be
suspicious, but there
are some rules.  If many resources need to be
released, what I do is
release them in LIFO order.  That is, the last
resource I allocated is
the first one released.  This is just in case there
are dependencies
in the way in which the variable are allocated.

10.  Order of variable declarations.  This one may be
a little harder
to deal with.  But many times, there aren't that many
variables
declared and there may be a standard way of declaring
them (ints
before pointers maybe).  Usually if a function gets
too compilicated
it is broken down anyway, reducing the number of
variables in it.

None of these points alone is sufficient to eliminate
the accusation of
copyright violations.  But consider them all together.
 Then consider
what code was compared.  There are thousands of C
files (about
5000 or so for Linux, I don't know with SCO). They
also could have
taken the entire CVS vault for both systems and
compared all versions
against all versions.  That's alot of code to expect
that there are no
similarities.

It's also interesting to note that if my explanation
is correct, it
would also explain similarities in SCO to Linux code. 
You might not
want to tell SCO that if they are sued :-)

This post turned out longer than I thought it would
be.  I just
started thinking about factors that could occur by
chance that would
induce similarities, and the list just kept getting
longer.

So make your own judgements, and post to the list if
you think I made
any major errors or if you agree for that matter.

Later,

Bill Gooding

__________________________________
Do you Yahoo!?
Yahoo! Calendar - Free online calendar with sync to Outlook(TM).
http://calendar.yahoo.com