Data Transformations with Apache Pig

Pig is an open source engine for executing parallelized data transformations which run on Hadoop. This course shows you how Pig can help you work on incomplete data with an inconsistent schema, or perhaps no schema at all.
Course info
Rating
(12)
Level
Beginner
Updated
May 12, 2017
Duration
3h 15m
Table of contents
Introducing Pig
20m 29s
Description
Course info
Rating
(12)
Level
Beginner
Updated
May 12, 2017
Duration
3h 15m
Description

Pig is an open source software which is part of the Hadoop eco-system of technologies. Pig is great at working with data which are beyond traditional data warehouses. It can deal well with missing, incomplete, and inconsistent data having no schema. In this course, Data Transformations with Apache Pig, you'll learn about data transformations with Apache. First, you'll start with the very basics which will show you how to get Pig installed and get started working with the Grunt shell. Next, you'll discover how to load data into relations in Pig and store transformed results to files via load and store commands. Then, you'll work on a real world dataset where you analyze accidents in NYC using collision data from the City of New York. Finally, you'll explore advanced constructs such as the nested foreach and also gives you a brief glimpse into the world of MapReduce and shows you how easy it is to implement this construct in Pig. By the end of this course, you'll have a better understanding of data transformations with Apache Pig.

About the author
About the author

A problem solver at heart, Janani has a Masters degree from Stanford and worked for 7+ years at Google. She was one of the original engineers on Google Docs and holds 4 patents for its real time collaborative editing framework.

More from the author
Building Classification Models with TensorFlow
Intermediate
3h 16m
19 Oct 2017
More courses by Janani Ravi
Transcript
Transcript

Hi, my name is Janani Ravi and welcome to this course on performing data transformations using Apache Pigv. I’ll introduce myself, I have a Masters in EE from Stanford and have worked at companies such as Microsoft, Google and Flipkart. At Google I was one of the first engineers working on real time collaborative editing in Google Docs and I hold 4 patents for its underlying technologies. I currently work on my own startup, Loonycorn, a studio for high quality video content.

Pig is an open source engine which is part of the Hadoop eco-system of technologies. Pig is great at working with data which are beyond traditional data warehouses. It can deal well with missing, incomplete, and inconsistent data having no schema. Pig has it's own language for expressing data manipulations i.e. Pig Latin.

This course starts from the very basics, an overview of Pig, shows you how to get Pig installed and get started working with the Grunt shell. You’ll see how you can load data into relations, store transformed results to files via the load and store commands.

The main focus of the course is on how this data can be transformed to make it more useful for analysis. It'll cover the foreach-generate command along with evaluation and filter functions.

You'll also work on a real world dataset where you analyze accidents in NYC using collision data from the City of New York.

And finally we’ll cover advanced constructs such as the nested foreach and also get a brief glimpse into the world of MapReduce, the parallel programming paradigm.