Skip to content

GraphChi's Java version

Notifications You must be signed in to change notification settings

Xuvasi/graphchi-java

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

GraphChi-java

Version 0.2

News

Introduction

Project for developing the Java version of GraphChi ( http://www.graphchi.org ), the disk-based graph computation engine. To learn more about GraphChi, visit the C++ version's project page: https://github.com/GraphChi/graphchi-cpp

NEW: GraphChi can be used in Hadoop/Pig scripts: GraphChi for Pig.

How to use

Read the README.txt for information on how to build and run the example applications. You are going to need Maven for building.

It is a very good idea to study the example applications carefully. There are currently three example applications in the package edu.cmu.graphchi.apps:

Input data

GraphChi-java supports edge-list and: adjacency list formats.

To preprocess the graph, you need to call "sharder" in the beginning of your program. For example, if you graph has floating-point input values for each edge, you can call it as follows:

    protected static FastSharder createSharder(String graphName, int numShards) throws IOException {
            return new FastSharder<Float, Float>(graphName, numShards, new VertexProcessor<Float>() {
                public Float receiveVertexValue(int vertexId, String token) {
                    return (token == null ? 0.0f : Float.parseFloat(token));
                }
            }, new EdgeProcessor<Float>() {
                public Float receiveEdge(int from, int to, String token) {
                    return (token == null ? 0.0f : Float.parseFloat(token));
                }
            }, new FloatConverter(), new FloatConverter());
        }
    
        public static void main(String[] args) throws  Exception {
            String baseFilename = args[0];
            int nShards = Integer.parseInt(args[1]);
            String filetype= args[2]; // "edgelist" or "adjacency"
    
            /* Create shards */
            FastSharder sharder = createSharder(baseFilename, nShards);
            if (new File(ChiFilenames.getFilenameIntervals(baseFilename, nShards)).exists()) {
                    sharder.shard(new FileInputStream(new File(baseFilename)), filetype);
                } else {
                    logger.info("Found shards -- no need to preprocess");
                }
        ....

Note: (For edge-lists only) initial values for vertices are also now supported. If you define a self-edge (i.e edge with same source and destination), the value of the 'edge' is interpreted as value for the vertex. Thus you need to provide both a vertex-value and edge-value parser to the FastSharder.

Hadoop / PIG

Since version 0.2 (January 2013), GraphChi Java programs can be directly invoked from Pig. More information will on page GraphChiForPig.

Example PIG script:

    REGISTER graphchi-java-0.2-jar-with-dependencies.jar;
    
    pagerank = LOAD '/user/akyrola/graphs/soc-LiveJournal1.txt' USING edu.cmu.graphchi.apps.pig.PigPagerank as (vertex:int, rank:float);
    
    STORE pagerank INTO '/user/akyrola/pagerank';

Scala

An early Scala-wrapper is also provided. See PagerankScala for example.

Differences to the C++ version

Following features are not implemented in the Java-version:

  • dynamic graphs
  • dynamic edge data

GraphChi implements a preprocessing step called "sharding", which reads an input graph and stores it in efficient binary format on the disk (see Introduction to GraphChi for more information). Java-version now includes its own sharding code (called "FastSharder"), which differs from the C++ version: FastSharder shuffles the order of vertices to guarantee (with high probability) an even distribution of edges over shards. Thus, internally GraphChi's Java-version uses different vertex IDs than in the original graph.

Translating between the internal ids and original ids is easy using the VertexIdTranslate class:

         VertexIdTranslate trans = engine.getVertexIdTranslate();
         for(int i=0; i < engine.numVertices(); i++) {
              System.out.println("Internal id " + i + " = original id " + trans.backward(i));
          }

Performance

On a new MacBook Pro (2012), I can run Pagerank on a Twitter-graph with 1.5 B edges (available from http://an.kaist.ac.kr/traces/WWW2010.html/) in about 10 minutes / iteration. It is 2-3x slower than the C++ version on an SSD disk. I expect this performance to be sufficient for many users, especially researchers and analysts who do need real-time performance but favor a convenient programming environment.

Acknowledgements

I want to thank Twitter and particularly the Personalization and Recommendations Team for the my internship during Fall 2012. M any of the improvements to mature the GraphChi's Java-version were done during the course of the internship.

Aapo Kyrola akyrola----at----cs.cmu.edu

-- Aapo Kyrola, akyrola@cs.cmu.edu

About

GraphChi's Java version

Big Data - small machine

Resources

Stars

Watchers

Forks

Packages

No packages published