bigjocker's den

Running Pig with Cassandra for MapReduce … wait for 1.2.6 if you use CQL3

I wanted to test Pig integration in a small Cassandra cluster. In theory running MapReduce jobs using Pig in Cassandra should be easy, but I kept having issues when trying to access my CQL3 tables from Pig. Turns out it’s a known issue in, and a patch is ready for Cassandra 1.2.6.

The problem is CQL3 tables are not accesible via Thrift. And as CQL3 is the recommended way to create, alter and access the data in newer Cassandra versions, that’s what we are using.

We have a small cluster of 3 nodes, a test keyspace and a words table:

Connected to test at 127.0.0.1:9160.
[cqlsh 3.0.2 | Cassandra 1.2.5-SNAPSHOT | CQL spec 3.0.0 | Thrift protocol 19.36.0]
Use HELP for help.
cqlsh> CREATE KEYSPACE test WITH replication = {
        'class': 'SimpleStrategy',
        'replication_factor': 2
};
cqlsh:test> CREATE TABLE words (word text PRIMARY KEY, dummy int);
cqlsh:test>

I’m using the JDBC driver to populate this table:

Class.forName("org.apache.cassandra.cql.jdbc.CassandraDriver");
Connection conn = DriverManager.getConnection("jdbc:cassandra://127.0.0.1:9160/test");
PreparedStatement statement = conn.prepareStatement("UPDATE words SET dummy = ? WHERE word = ?");
for (int i = 0 ; i < 100000 ; i++) {
        statement.setInt(1, i);
        statement.setString(2, "" + i);
        statement.executeUpdate();
}
statement.close();
conn.close();

We now have 100,000 records to test our cluster:

cqlsh:test> SELECT count(*) from words LIMIT 120000;

count
--------
100000

cqlsh:test>

We should now be able to use Pig to run MapReduce jobs in our cluster, as Pig integration with Cassandra is quite simple:

ngranek@trantor:~/1.2.5/examples/pig$ export PIG_HOME=/Users/ngranek/apps/pig-0.11.1
ngranek@trantor:~/1.2.5/examples/pig$ export PIG_INITIAL_ADDRESS=localhost
ngranek@trantor:~/1.2.5/examples/pig$ export PIG_RPC_PORT=9160
ngranek@trantor:~/1.2.5/examples/pig$ export PIG_PARTITIONER=org.apache.cassandra.dht.Murmur3Partitioner
ngranek@trantor:~/1.2.5/examples/pig$ bin/pig_cassandra -x local
Using /Users/ngranek/apps/pig-0.11.1/pig-0.11.1.jar.
grunt> rows = LOAD 'cassandra://test/words' USING CassandraStorage();
2013-06-14 11:00:19,784 [main] ERROR org.apache.pig.tools.grunt.Grunt - 
ERROR 2999: Unexpected internal error. Column family 'words' not found in keyspace 'test'
grunt>

But that’s as far as we can get with Cassandra 1.2.5. As it turns out, tables created via CQL3 are not supported for Pig integration in the 1.2.5 branch. I think I’ll wait for 1.2.6, as I really like the simplicity of CQL3 over the CLI interface.

Leave a Reply

Your email address will not be published. Required fields are marked *