Introduction to Avro and Its Integration with Hadoop: A Comprehensive Guide
Avro is a compact serialization framework developed within Apache's Hadoop project, adept at efficiently turning unstructured and semi-structured data into a structured format using schemas. This guide explores Avro's features, including primitive types, records, enums, arrays, and maps. Additionally, it provides step-by-step instructions to create your first Avro schema, make Avro records, and utilize the parser to manage data efficiently. Discover the power of Avro in enhancing Hadoop data processes.
Introduction to Avro and Its Integration with Hadoop: A Comprehensive Guide
E N D
Presentation Transcript
Introduction to Avro and Integration with Hadoop Introduction
What is Avro? • Avro is a serialization framework developed within Apache's Hadoop project. It uses JSON for defining data types and protocols, and serializes data in a compact binary format. Its primary use is in Apache Hadoop, where it can provide both a serialization format for persistent data. • Avro provides good way to convert unstructured and semi-structured data into a structured way using schemas
Creating your first Avro schema • Schema description: • { • "name": "User", • "type": "record", • "fields": [ • {"name": "FirstName", "type": "string", "doc": "First Name"}, • {"name": "LastName", "type": "string"}, • {"name": "isActive", "type": "boolean", "default": true}, • {"name": "Account", "type": "int", "default": 0} ] • }
Avro schema features • Primitive types (null, boolean, int, long, float, double, bytes, string) • Records • { "type": "record", • "name": "LongList", • [ {"name": "value", "type": "long"}, • {"name": ”description", "type”:”string”}] • } • Others (Enums, Arrays, Maps,Unions,Fixed)
Avro schema features • Primitive types (null, boolean, int, long, float, double, bytes, string) • Records • { "type": "record", • "name": "LongList", • [ {"name": "value", "type": "long"}, • {"name": ”description", "type”:”string”}] • } • Others (Enums, Arrays, Maps,Unions,Fixed)
How to create Avro record? String schemaDescription = " { \n" + " \"name\": \"User\", \n" + " \"type\": \"record\",\n" + " \"fields\": [\n" + " {\"name\": \"FirstName\", \"type\": \"string\", \"doc\": \"First Name\"},\n" + " {\"name\": \"LastName\", \"type\": \"string\"},\n" + " {\"name\": \"isActive\", \"type\": \"boolean\", \"default\": true},\n" + " {\"name\": \"Account\", \"type\": \"int\", \"default\": 0} ]\n" + "}"; Schema.Parser parser = new Schema.Parser(); Schema s = parser.parse(schemaDescription); GenericRecordBuilder builder = new GenericRecordBuilder(s);
How to create Avro record? (cont. 2) The first step to create Avro record is to create JSON-based schema Avro provides parser that will take a Avro schema string and return schema object. Once the schema object is created, we have created a builder that will allow us to create records with default values
How to create Avro record? (cont. 3) GenericRecord r = builder.build(); System.out.println("Record" + r); r.put("FirstName", "Joe"); r.put("LastName", "Hadoop"); r.put("Account", 12345); System.out.println("Record" + r); System.out.println("FirstName:" + r.get("FirstName")); {"FirstName": null, "LastName": null, "isActive": true, "Account": 0} {"FirstName": "Joe", "LastName": "Hadoop", "isActive": true, "Account": 12345} FirstName:Joe
How to create Avro record? (cont. 3) GenericRecord r = builder.build(); System.out.println("Record" + r); r.put("FirstName", "Joe"); r.put("LastName", "Hadoop"); r.put("Account", 12345); System.out.println("Record" + r); System.out.println("FirstName:" + r.get("FirstName")); {"FirstName": null, "LastName": null, "isActive": true, "Account": 0} {"FirstName": "Joe", "LastName": "Hadoop", "isActive": true, "Account": 12345} FirstName:Joe
How to create Avro schema dynamically? String[] fields = {"FirstName", "LastName", "Account"}; Schema s = Schema.createRecord("Ex2", “desc", ”namespace", false); List<Schema.Field> lstFields = new LinkedList<Schema.Field>(); for (String f : fields) { lstFields.add(new Schema.Field(f, Schema.create(Schema.Type.STRING), "doc", new TextNode(""))); } s.setFields(lstFields);
How to create Avro schema dynamically? String[] fields = {"FirstName", "LastName", "Account"}; Schema s = Schema.createRecord("Ex2", “desc", ”namespace", false); List<Schema.Field> lstFields = new LinkedList<Schema.Field>(); for (String f : fields) { lstFields.add(new Schema.Field(f, Schema.create(Schema.Type.STRING), "doc", new TextNode(""))); } s.setFields(lstFields);
How to sort Avro records? You can also specify the which field you would like to order on and in which order: Options: ascending, descending, ignore { "name" : "isActive", "type" : "boolean", "default" : true, "order" : "ignore" }, { "name" : "Account", "type" : "int", "default" : 0, "order" : "descending" }
How to sort Avro records? You can also specify the which field you would like to order on and in which order: Options: ascending, descending, ignore { "name" : "isActive", "type" : "boolean", "default" : true, "order" : "ignore" }, { "name" : "Account", "type" : "int", "default" : 0, "order" : "descending" }
How to write Avro records in a file? File file = new File(“<file-name>"); DatumWriter<GenericRecord> writer = new GenericDatumWriter<GenericRecord>(schema); DataFileWriter<GenericRecord> dataFileWriter = new DataFileWriter<GenericRecord>(writer); dataFileWriter.create(schema, file); for (Record rec : list) { dataFileWriter.append(rec); } dataFileWriter.close();
How to reading Avro records from a file? File file = new File(“<file-name>"); DatumWriter<GenericRecord> writer = new GenericDatumWriter<GenericRecord>(schema); DataFileWriter<GenericRecord> dataFileWriter = new DataFileWriter<GenericRecord>(writer); dataFileWriter.create(schema, file); for (Record rec : list) { dataFileWriter.append(rec); } dataFileWriter.close();
How to read Avro records from a file? File file = new File(“<file-name>"); DatumReader<GenericRecord> reader = new GenericDatumReader<GenericRecord>(); DataFileReader<GenericRecord> dataFileReader = new DataFileReader<GenericRecord>(file, reader); while (dataFileReader.hasNext()) { Record r = (Record) dataFileReader.next(); System.out.println(r.toString()); }
Running MapReduce Jobs on Avro Data 1. Set input schema on AvroJob based on the schema from input path File file = new File(DATA_PATH); DatumReader<GenericRecord> reader = new GenericDatumReader<GenericRecord>(); DataFileReader<GenericRecord> dataFileReader = new DataFileReader<GenericRecord>(file, reader); Schema s = dataFileReader.getSchema(); AvroJob.setInputSchema(job, s);
Running MapReduce Jobs on Avro Data (cont. 2) 1. Set input schema on AvroJob based on the schema from input path File file = new File(DATA_PATH); DatumReader<GenericRecord> reader = new GenericDatumReader<GenericRecord>(); DataFileReader<GenericRecord> dataFileReader = new DataFileReader<GenericRecord>(file, reader); Schema s = dataFileReader.getSchema(); AvroJob.setInputSchema(job, s);
Running MapReduce Jobs on Avro Data - Mapper public static class MapImplextends AvroMapper<GenericRecord, Pair<String, GenericRecord>> { public void map( GenericRecord datum, AvroCollector<Pair<String, GenericRecord>> collector, Reporter reporter) throws IOException { …. } }
Running MapReduce Jobs on Avro Data - Reducer public static class ReduceImpl extends AvroReducer<Utf8, GenericRecord, GenericRecord> { public void reduce(Utf8 key, Iterable<GenericRecord> values, AvroCollector< GenericRecord> collector, Reporter reporter) throws IOException { collector.collect(values.iterator().next()); return; } }
Running Avro MapReduce Jobs on Data with Different schema List<Schema> schemas= new ArrayList<Schema>(); schemas.add(schema1); schemas.add(schema2); Schema schema3=Schema.createUnion(schemas); This will allow to read data from different sources and process both of them in the same mapper
Summary • Avro is a great tool to use for semi-structured and structured data • Simplifies MapReduce development • Provides good compression mechanism • Great tool for conversion from existing SQL code • Questions?