src.overview.html Maven / Gradle / Ivy
WebGraph
WebGraph is a framework to study the web graph. It provides simple ways to manage
very large graphs, exploiting modern compression techniques. More precisely,
it is currently made of:
- A set of simple codes, called ζ codes, which are
particularly suitable for storing web graphs (or, in general, integers
with a power-law distribution in a certain exponent range).
- Algorithms for compressing web graphs that exploit gap compression and
differential compression (à la LINK),
intervalisation and ζ codes to provide a high compression ratio (see our datasets). The
algorithms are controlled by several parameters, which provide
different tradeoffs between access speed and compression ratio.
- Algorithms for accessing a compressed graph without actually decompressing it,
using lazy techniques that delay the decompression until it is actually necessary.
- This package, providing a complete, documented implementation of
the algorithms above in Java. It is free software
distributed under the GNU General Public License.
- Data sets for very large graph (e.g., a billion of links). These are either
gathered from public sources (such as WebBase),
or gathered by UbiCrawler.
In the end, with WebGraph you can access and analyse very large web graphs. Using WebGraph is as easy as installing a few
jar files and downloading a data set. This makes studying phenomena such as PageRank, distribution of
graph properties of the web graph, etc., very easy.
You are welcome to use and improve WebGraph! If you find our software useful for your research, please quote
our paper “The WebGraph Framework I: Compression Techniques”, by Paolo Boldi and
Sebastiano Vigna, in Proc. of the Thirteenth World–Wide Web
Conference, pages 595−601, 2004, ACM Press.
Looking around
Warning: WebGraph 2+ is not fully compatible with previous versions and
requires some minor code refactoring. Please
refer to the documentation of {@link it.unimi.dsi.webgraph.ImmutableGraph}.
For in-depth information on the Webgraph framework, you should have
a look at its home page,
where you can find some papers about the compression techniques it uses.
Datasets are available at the
LAW web site.
The classes of interest for the casual Webgraph user are {@link
it.unimi.dsi.webgraph.ImmutableGraph}, which specifies the access
methods for an immutable graph, {@link it.unimi.dsi.webgraph.BVGraph},
which allow to retrieve or recompress a graph stored in the format
described in The WebGraph
Framework I: Compression Techniques, and {@link it.unimi.dsi.webgraph.Transform}, which
provides several ways to transform an {@link it.unimi.dsi.webgraph.ImmutableGraph}.
If you plan on building your graphs dynamically, the class
{@link it.unimi.dsi.webgraph.ArrayListMutableGraph} makes it possible
to create incrementally a graph and then extract an {@linkplain
it.unimi.dsi.webgraph.ArrayListMutableGraph#immutableView() immutable view}.
The package {@link it.unimi.dsi.webgraph.examples} contains useful
examples that show how to access sequentially and randomly an immutable
graph.
Importing your data
If you want to import your own data into WebGraph, you must write
an implementation of {@link it.unimi.dsi.webgraph.ImmutableGraph} that
exposes your data. A simple example is given in {@link it.unimi.dsi.webgraph.examples.IntegerListImmutableGraph},
a stub class exposing a simple, noncompressed binary format as an {@link it.unimi.dsi.webgraph.ImmutableGraph}.
Once your data is exposed in that way, you can get a compressed version
using the store()
method of your class of interest. Often, there
is a main method (see, e.g., {@link it.unimi.dsi.webgraph.BVGraph}) that
will load your class and invoke store()
for you.
As an alternative, the class {@link it.unimi.dsi.webgraph.ASCIIGraph}
can be used to read graphs specified in a very simple ASCII format. The class
implements {@link it.unimi.dsi.webgraph.ASCIIGraph#loadOnce(java.io.InputStream)} so
that the file can be just piped into a class offering a main method that supports
loadOnce()
(e.g., {@link it.unimi.dsi.webgraph.BVGraph}).
You can also generate a graph in ASCII format and read it using
{@link it.unimi.dsi.webgraph.ASCIIGraph#loadOffline(CharSequence)}—the
graph will not be loaded into main memory.
{@link it.unimi.dsi.webgraph.ASCIIGraph} requires listing the successors of each
node on a separate line. If your graph is specified arc by arc (one arc per line) you
can use {@link it.unimi.dsi.webgraph.ArcListASCIIGraph} instead.
{@link it.unimi.dsi.webgraph.ShiftedByOneArcListASCIIGraph} can be used if your input
data numbers (rather insensibly) nodes starting from one.
Importing your labelled data
Arc-labelled graphs are represented using implementations of {@link it.unimi.dsi.webgraph.labelling.ArcLabelledImmutableGraph}.
Most arc-labelled graphs are based on an underlying {@link it.unimi.dsi.webgraph.ImmutableGraph}, and
the {@link it.unimi.dsi.webgraph.labelling.ArcLabelledImmutableGraph} implementation just provides
label handling. The example {@link it.unimi.dsi.webgraph.examples.IntegerTriplesArcLabelledImmutableGraph}
shows how to expose your data as an instance of {@link it.unimi.dsi.webgraph.labelling.ArcLabelledImmutableGraph},
so you can save your data using your preferred combination of implementations.
Dependencies
WebGraph requires Java ≥6 and relies on fastutil 6.4 or greater
for high-performance containers and algorithms,
on the COLT
distribution for statistics, on the DSI utilities for bit-level I/O, on
Sux4J for succinct data structures, on JSAP for line-command parsing and on
log4j for logging.
Note that in principle the DSI utilities depend on a number of additional useful libraries from
the Jakarta commons project,
including collections,
lang,
configuration and
io.