HadoopCraft: Generating ORC files using MapReduce

Tuesday, 8 July 2014

Generating ORC files using MapReduce

The ORC File format was introduced in Hive 0.11. It offers excellent compression ratios through the use of Run length encoding. Data stored in ORC format can be read through HCatalog so any Pig or MapReduce program can work with ORC format seamlessly.

Now we will see how to output ORC Files from a MapReduce program.

The ORCMapper class is as shown below:

ORCMapper.java

1:  public class ORCMapper extends Mapper<LongWritable,Text,NullWritable,Writable> {  
2:    
3:    private final OrcSerde serde = new OrcSerde();  
4:    
5:    //Define the struct which will represent each row in the ORC file  
6:    private final String typeString = "struct<air_temp:double,station_id:string,lat:double,lon:double>";  
7:    
8:    private final TypeInfo typeInfo = TypeInfoUtils  
9:        .getTypeInfoFromTypeString(typeString);  
10:    private final ObjectInspector oip = TypeInfoUtils  
11:        .getStandardJavaObjectInspectorFromTypeInfo(typeInfo);  
12:    
13:    private final NCDCParser parser = new NCDCParser();  
14:    
15:    
16:    @Override  
17:    protected void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {  
18:    
19:        parser.parse(value.toString());  
20:    
21:        if(parser.isValid()) {  
22:    
23:          List<Object> struct =new ArrayList<Object>(4);  
24:          struct.add(0, parser.getAirTemp());  
25:          struct.add(1, parser.getStationId());  
26:          struct.add(2, parser.getLatitude());  
27:          struct.add(3, parser.getLongitude());  
28:    
29:          Writable row = serde.serialize(struct, oip);  
30:    
31:          context.write(NullWritable.get(), row);  
32:        }  
33:    }  
34:  }  
35:

Each row in a ORC file is a struct, so we define a struct (line no: 6). The type info of our struct is used by the OrcSerde to serialize our data into the ORC format.

Now onto the driver class.

RunORC.java

1:  public class RunORC extends Configured implements Tool {  
2:    
3:    
4:    public static void main(String[] args) throws Exception {  
5:    
6:      int res = ToolRunner.run(new Configuration(), new RunORC(), args);  
7:      System.exit(res);  
8:    
9:    }  
10:    
11:    public int run(String[] arg) throws Exception {  
12:      Configuration conf=getConf();  
13:    
14:      //Set ORC configuration parameters  
15:      conf.set("orc.create.index", "true");  
16:    
17:    
18:      Job job = Job.getInstance(conf);  
19:      job.setJarByClass(RunORC.class);  
20:      job.setJobName("ORC Output");  
21:      job.setMapOutputKeyClass(NullWritable.class);  
22:      job.setMapOutputValueClass(Writable.class);  
23:    
24:    
25:      job.setMapperClass(ORCMapper.class);  
26:    
27:      job.setNumReduceTasks(Integer.parseInt(arg[2]));  
28:      job.setOutputFormatClass(OrcNewOutputFormat.class);  
29:    
30:      FileInputFormat.addInputPath(job, new Path(arg[0]));  
31:      Path output=new Path(arg[1]);  
32:    
33:      OrcNewOutputFormat.setCompressOutput(job,true);  
34:      OrcNewOutputFormat.setOutputPath(job,output);  
35:    
36:      return job.waitForCompletion(true) ? 0: 1;  
37:    }  
38:    
39:    
40:  }  
41:

We have specified the OutputFormat as OrcNewOutputFormat.class (line no: 28).
We can also set various ORC configuration parameters (line no: 15).

Dependencies
For running this program add the following dependencies in your pom.xml

1:   <dependencies>  
2:      <dependency>  
3:        <groupId>org.apache.hadoop</groupId>  
4:        <artifactId>hadoop-client</artifactId>  
5:        <version>2.2.0</version>  
6:      </dependency>  
7:    
8:      <dependency>  
9:        <groupId>org.apache.hive</groupId>  
10:        <artifactId>hive-exec</artifactId>  
11:        <version>0.13.1</version>  
12:      </dependency>  
13:    </dependencies>

Creating table in Hive
Once you run the program, you can create a hive table as follows:

  create external table ncdc (temp double, station_id string, lat double, lon double) stored as orc location "/weatherout";

The source for this can be downloaded from:
https://github.com/gautamphegde/HadoopCraft

Note: This code has been tested on Hortonworks Data platform 2.0

20 comments:

Anonymous14 July 2014 at 15:14
I tried do the same on a different usecase but go the following error : "Error: java.io.IOException: Type mismatch in value from map: expected org.apache.hadoop.io.Writable, received org.apache.hadoop.hive.ql.io.orc.OrcSerde$OrcSerdeRow". Any clue why this might be happening. Thanks in advance.
ReplyDelete
Replies
Anonymous15 July 2014 at 17:17
I tested your program without changes on the NRDC sample dataset and go the same error. You might have posted a version of the code which had bugs.
ReplyDelete
Replies
Anonymous16 July 2014 at 09:07
I got the same error as before : Type mismatch in value from map: expected org.apache.hadoop.io.Writable, received org.apache.hadoop.hive.ql.io.orc.OrcSerde$OrcSerdeRow" .
.
ReplyDelete
Replies
Anonymous21 July 2014 at 11:51
Hey. I found the mistake. I didn't set my number of reducers to zero. Sorry for the trouble. Thanks for a great article.
ReplyDelete
Replies
Anonymous30 September 2014 at 02:44
Hi, I followed your example and it worked great but what I really want to is to read in an Orc file and convert it to text. I have tried using just serde.deserialize(values) but it doesn't work. I read you need to use the serde.getObjectInspector() but I don't know how to tie the two together to get a text file output. Any advice would be much appreciated!
ReplyDelete
Replies
Anonymous8 November 2014 at 05:34
Hi, When I run the job, I got the exception

Exception in thread "main" org.apache.hadoop.mapred.FileAlreadyExistsException: Output directory hdfs://service-10-0.local:8020/user/ppp/tmp/weatherout already exists

I created table at the location /user/ppp/tmp/weatherout

How can I fix this?
ReplyDelete
Replies
Anonymous8 November 2014 at 10:00
Hi, I managed to run without errors, but there is no file result generated in the output path only _SUCCESS. I add System.out.println in the mapper to debug and it really emit row data. Any ideas?
ReplyDelete
Replies
Anonymous14 November 2014 at 12:27
We are facing the exact issue as mentioned above, If the output path mentioned is /user/xyz/data/hive/testData ; the directory just contains the _SUCCESS file and the actual data is written under /usr/xyz/part-00000

Did anyone else face the same issue?, - Thanks
ReplyDelete
Replies
Anonymous13 January 2015 at 16:49
ORC files have configuration like stripe size and compression buffer size. Plus it is often desirable to sort within each stripe based on some columns. How to improve your code to handle these extra stuff?
ReplyDelete
Replies
Kiran Teja Avvaru26 June 2015 at 06:12
Could you post, how to query a ORC Partition table insertion from Hive managed table with partitions. I am failing to do so. And even after inserting the new partitions for the HIVE ORC table, i could not query it using SELECT and WHERE …!
It should be in HIVE query...
ReplyDelete
Replies
Unknown22 August 2016 at 03:25
I have copy pasted the same code with some changes and i am getting NoClassDefFoundError. Please suggest me, what is the reason.

Exception in thread "main" java.lang.NoClassDefFoundError: org/apache/hadoop/hive/ql/io/orc/OrcNewOutputFormat
at com.hpi.css.telephony.validation.ORCTest.run(ORCTest.java:44)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:76)
at com.hpi.css.telephony.validation.ORCTest.main(ORCTest.java:22)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at org.apache.hadoop.util.RunJar.run(RunJar.java:221)
at org.apache.hadoop.util.RunJar.main(RunJar.java:136)
Caused by: java.lang.ClassNotFoundException: org.apache.hadoop.hive.ql.io.orc.OrcNewOutputFormat
at java.net.URLClassLoader.findClass(URLClassLoader.java:381)
at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
... 9 more
ReplyDelete
Replies
Anonymous20 October 2024 at 18:09
Great and that i have a dandy present: What Was The First Home Renovation Show house renovation contractors
ReplyDelete
Replies

Add comment