GraphChi-java

Version 0.2

News

Performance has been improved by parallelizing shard loading better (Oct 22, 2013)
GraphChi was moved to GitHub from Google Code (July 24). Please report/fix any broken links.
GraphChi's Java version has a new cool random walk simulation engine: https://github.com/GraphChi/graphchi-java/wiki/Personalized-Pagerank-with-DrunkardMob

Introduction

Project for developing the Java version of GraphChi ( http://www.graphchi.org ), the disk-based graph computation engine. To learn more about GraphChi, visit the C++ version's project page: https://github.com/GraphChi/graphchi-cpp

NEW: GraphChi can be used in Hadoop/Pig scripts: GraphChi for Pig.

How to use

Read the README.txt for information on how to build and run the example applications. You are going to need Maven for building.

It is a very good idea to study the example applications carefully. There are currently three example applications in the package edu.cmu.graphchi.apps:

PageRank for computing the famous PageRank ranking
Connected Components for computing the weakly connected components
Alternative Least Squares Matrix Factorization

Input data

GraphChi-java supports edge-list and: adjacency list formats.

To preprocess the graph, you need to call "sharder" in the beginning of your program. For example, if you graph has floating-point input values for each edge, you can call it as follows:

    protected static FastSharder createSharder(String graphName, int numShards) throws IOException {
            return new FastSharder<Float, Float>(graphName, numShards, new VertexProcessor<Float>() {
                public Float receiveVertexValue(int vertexId, String token) {
                    return (token == null ? 0.0f : Float.parseFloat(token));
                }
            }, new EdgeProcessor<Float>() {
                public Float receiveEdge(int from, int to, String token) {
                    return (token == null ? 0.0f : Float.parseFloat(token));
                }
            }, new FloatConverter(), new FloatConverter());
        }
    
        public static void main(String[] args) throws  Exception {
            String baseFilename = args[0];
            int nShards = Integer.parseInt(args[1]);
            String filetype= args[2]; // "edgelist" or "adjacency"
    
            /* Create shards */
            FastSharder sharder = createSharder(baseFilename, nShards);
            if (new File(ChiFilenames.getFilenameIntervals(baseFilename, nShards)).exists()) {
                    sharder.shard(new FileInputStream(new File(baseFilename)), filetype);
                } else {
                    logger.info("Found shards -- no need to preprocess");
                }
        ....

Note: (For edge-lists only) initial values for vertices are also now supported. If you define a self-edge (i.e edge with same source and destination), the value of the 'edge' is interpreted as value for the vertex. Thus you need to provide both a vertex-value and edge-value parser to the FastSharder.

Hadoop / PIG

Since version 0.2 (January 2013), GraphChi Java programs can be directly invoked from Pig. More information will on page GraphChiForPig.

Example PIG script:

    REGISTER graphchi-java-0.2-jar-with-dependencies.jar;
    
    pagerank = LOAD '/user/akyrola/graphs/soc-LiveJournal1.txt' USING edu.cmu.graphchi.apps.pig.PigPagerank as (vertex:int, rank:float);
    
    STORE pagerank INTO '/user/akyrola/pagerank';

Scala

An early Scala-wrapper is also provided. See PagerankScala for example.

Differences to the C++ version

Following features are not implemented in the Java-version:

dynamic graphs
dynamic edge data

GraphChi implements a preprocessing step called "sharding", which reads an input graph and stores it in efficient binary format on the disk (see Introduction to GraphChi for more information). Java-version now includes its own sharding code (called "FastSharder"), which differs from the C++ version: FastSharder shuffles the order of vertices to guarantee (with high probability) an even distribution of edges over shards. Thus, internally GraphChi's Java-version uses different vertex IDs than in the original graph.

Translating between the internal ids and original ids is easy using the VertexIdTranslate class:

         VertexIdTranslate trans = engine.getVertexIdTranslate();
         for(int i=0; i < engine.numVertices(); i++) {
              System.out.println("Internal id " + i + " = original id " + trans.backward(i));
          }

Performance

On a new MacBook Pro (2012), I can run Pagerank on a Twitter-graph with 1.5 B edges (available from http://an.kaist.ac.kr/traces/WWW2010.html/) in about 10 minutes / iteration. It is 2-3x slower than the C++ version on an SSD disk. I expect this performance to be sufficient for many users, especially researchers and analysts who do need real-time performance but favor a convenient programming environment.

Acknowledgements

I want to thank Twitter and particularly the Personalization and Recommendations Team for the my internship during Fall 2012. M any of the improvements to mature the GraphChi's Java-version were done during the course of the internship.

Aapo Kyrola akyrola----at----cs.cmu.edu

-- Aapo Kyrola, akyrola@cs.cmu.edu

Name		Name	Last commit message	Last commit date
Latest commit History 362 Commits
conf		conf
example_pig		example_pig
src		src
test/edu/cmu/graphchi		test/edu/cmu/graphchi
.gitignore		.gitignore
NOTES.txt		NOTES.txt
README.md		README.md
README.txt		README.txt
pom.xml		pom.xml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

GraphChi-java

News

Introduction

How to use

Input data

Hadoop / PIG

Scala

Differences to the C++ version

Performance

Acknowledgements

About

Uh oh!

Releases

Packages

Xuvasi/graphchi-java

Folders and files

Latest commit

History

Repository files navigation

GraphChi-java

News

Introduction

How to use

Input data

Hadoop / PIG

Scala

Differences to the C++ version

Performance

Acknowledgements

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Packages