Version 0.2
- Performance has been improved by parallelizing shard loading better (Oct 22, 2013)
- GraphChi was moved to GitHub from Google Code (July 24). Please report/fix any broken links.
- GraphChi's Java version has a new cool random walk simulation engine: https://github.com/GraphChi/graphchi-java/wiki/Personalized-Pagerank-with-DrunkardMob
Project for developing the Java version of GraphChi ( http://www.graphchi.org ), the disk-based graph computation engine. To learn more about GraphChi, visit the C++ version's project page: https://github.com/GraphChi/graphchi-cpp
NEW: GraphChi can be used in Hadoop/Pig scripts: GraphChi for Pig.
Read the README.txt for information on how to build and run the example applications. You are going to need Maven for building.
It is a very good idea to study the example applications carefully. There are currently three example applications in the package edu.cmu.graphchi.apps:
- PageRank for computing the famous PageRank ranking
- Connected Components for computing the weakly connected components
- Alternative Least Squares Matrix Factorization
GraphChi-java supports edge-list and: adjacency list formats.
To preprocess the graph, you need to call "sharder" in the beginning of your program. For example, if you graph has floating-point input values for each edge, you can call it as follows:
protected static FastSharder createSharder(String graphName, int numShards) throws IOException {
return new FastSharder<Float, Float>(graphName, numShards, new VertexProcessor<Float>() {
public Float receiveVertexValue(int vertexId, String token) {
return (token == null ? 0.0f : Float.parseFloat(token));
}
}, new EdgeProcessor<Float>() {
public Float receiveEdge(int from, int to, String token) {
return (token == null ? 0.0f : Float.parseFloat(token));
}
}, new FloatConverter(), new FloatConverter());
}
public static void main(String[] args) throws Exception {
String baseFilename = args[0];
int nShards = Integer.parseInt(args[1]);
String filetype= args[2]; // "edgelist" or "adjacency"
/* Create shards */
FastSharder sharder = createSharder(baseFilename, nShards);
if (new File(ChiFilenames.getFilenameIntervals(baseFilename, nShards)).exists()) {
sharder.shard(new FileInputStream(new File(baseFilename)), filetype);
} else {
logger.info("Found shards -- no need to preprocess");
}
....Note: (For edge-lists only) initial values for vertices are also now supported. If you define a self-edge (i.e edge with same source and destination), the value of the 'edge' is interpreted as value for the vertex. Thus you need to provide both a vertex-value and edge-value parser to the FastSharder.
Since version 0.2 (January 2013), GraphChi Java programs can be directly invoked from Pig. More information will on page GraphChiForPig.
Example PIG script:
REGISTER graphchi-java-0.2-jar-with-dependencies.jar;
pagerank = LOAD '/user/akyrola/graphs/soc-LiveJournal1.txt' USING edu.cmu.graphchi.apps.pig.PigPagerank as (vertex:int, rank:float);
STORE pagerank INTO '/user/akyrola/pagerank';
An early Scala-wrapper is also provided. See PagerankScala for example.
Following features are not implemented in the Java-version:
- dynamic graphs
- dynamic edge data
GraphChi implements a preprocessing step called "sharding", which reads an input graph and stores it in efficient binary format on the disk (see Introduction to GraphChi for more information). Java-version now includes its own sharding code (called "FastSharder"), which differs from the C++ version: FastSharder shuffles the order of vertices to guarantee (with high probability) an even distribution of edges over shards. Thus, internally GraphChi's Java-version uses different vertex IDs than in the original graph.
Translating between the internal ids and original ids is easy using the VertexIdTranslate class:
VertexIdTranslate trans = engine.getVertexIdTranslate();
for(int i=0; i < engine.numVertices(); i++) {
System.out.println("Internal id " + i + " = original id " + trans.backward(i));
}On a new MacBook Pro (2012), I can run Pagerank on a Twitter-graph with 1.5 B edges (available from http://an.kaist.ac.kr/traces/WWW2010.html/) in about 10 minutes / iteration. It is 2-3x slower than the C++ version on an SSD disk. I expect this performance to be sufficient for many users, especially researchers and analysts who do need real-time performance but favor a convenient programming environment.
I want to thank Twitter and particularly the Personalization and Recommendations Team for the my internship during Fall 2012. M any of the improvements to mature the GraphChi's Java-version were done during the course of the internship.
Aapo Kyrola akyrola----at----cs.cmu.edu
-- Aapo Kyrola, akyrola@cs.cmu.edu