Apache Hive
What Is Hive
Hive is a data warehousing infrastructure based on Apache Hadoop. Hadoop provides massive scale out and fault tolerance capabilities for data storage and processing on commodity hardware.
Hive is designed to enable easy data summarization, ad-hoc querying and analysis of large volumes of data.
What Hive Is NOT
Hive is not designed for online transaction processing. It is best used for traditional data warehousing tasks.
Data Units
In the order of granularity - Hive data is organized into:
- Databases: Namespaces function to avoid naming conflicts for tables, views, partitions, columns, and so on. Databases can also be used to enforce security for a user or group of users.
- Tables: Homogeneous units of data which have the same schema.
- Partitions: Each Table can have one or more partition Keys which determines how the data is stored. Partitions—apart from being storage units—also allow the user to efficiently identify the rows that satisfy a specified criteria; for example, a date_partition of type STRING and country_partition of type STRING. Each unique value of the partition keys defines a partition of the Table. For example, all "US" data from "2009-12-23" is a partition of the page_views table. Therefore, if you run analysis on only the "US" data for 2009-12-23, you can run that query only on the relevant partition of the table, thereby speeding up the analysis significantly.
- Buckets (or Clusters): Data in each partition may in turn be divided into Buckets based on the value of a hash function of some column of the Table. For example the page_views table may be bucketed by userid, which is one of the columns, other than the partitions columns, of the page_view table.
Primitive Types
- Types are associated with the columns in the tables. The following Primitive types are supported:
- Integers
- TINYINT—1 byte integer
- SMALLINT—2 byte integer
- INT—4 byte integer
- BIGINT—8 byte integer
- Boolean type
- BOOLEAN—TRUE/FALSE
- Floating point numbers
- FLOAT—single precision
- DOUBLE—Double precision
- Fixed point numbers
- DECIMAL—a fixed point value of user defined scale and precision
- String types
- STRING—sequence of characters in a specified character set
- VARCHAR—sequence of characters in a specified character set with a maximum length
- CHAR—sequence of characters in a specified character set with a defined length
- Date and time types
- TIMESTAMP— a specific point in time, up to nanosecond precision
- DATE—a date
- Binary types
- BINARY—a sequence of bytes
Hive DDL Commands:
Create Database Statement:
A database in Hive is a namespace or a collection of tables.
Drop database
Creating Hive Tables:
Create a table called Sonoo with two columns, the first being an integer and the other a string.
Create a table called HIVE_TABLE with two columns and a partition column called ds. The partition column is a virtual column. It is not part of the data itself but is derived from the partition that a particular dataset is loaded into.By default, tables are assumed to be of text input format and the delimiters are assumed to be ^A(ctrl-a).
Browse the table
Altering and Dropping Tables
Hive: Internal Tables
There are 2 types of tables in Hive, Internal and External. This case study describes creation of internal table, loading data in it, creating views, indexes and dropping table on weather data.
Creating Internal Table
Internal table are like normal database table where data can be stored and queried on. On dropping these tables the data stored in them also gets deleted and data is lost forever. So one should be careful while using internal tables as one drop command can destroy the whole data. Open new terminal and fire up hive by just typing hive. Create table on weather data.
CREATE TABLE weather (wban INT, date STRING, precip INT)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ‘,’
LOCATION ‘ /hive/data/weather’;
ROW FORMAT should have delimiters used to terminate the fields and lines like in the above example the fields are terminated with comma (“,”). The default location of Hive table is overwritten by using LOCATION. So the data now is stored in data/weather folder inside hive.
Load the Data in Table
Data can be loaded in 2 ways in Hive either from local file or from HDFS to Hive. To load the data from local to Hive use the following command in NEW terminal:
hadoop fs –copyFromLocal /home/user/data/weather/2012.txt hdfs://hname:10001/hive/data/weather
Here the hdfs path was initially made in the create statement using LOCATION ‘ /hive/data/weather’.
Another way to load data is to load it from HDFS to hive using the following command:
LOAD DATA INPATH ‘hdfs:/data/2012.txt’ INTO TABLE weather;
Views
Views are used for creating virtual tables. They are faster than creating actual tables and they can work as table while using them in any other query. For example precipitation view can be made as follows and can be used in other query just like a table is used.
CREATE VIEW AS precipitation_data SELECT * FROM weatherext WHERE precip IS NOT NULL AND precip>0;
Following query can be used to retrieve data from precipitation_data
SELECT * FROM precipitation_data;
Indexing
Indexes are made on top of tables so that they speed up queries. Most popular column that are used very often in WHERE clause should be indexed to make the query run faster. Partition can be built on weather table’s date column in following way:
CREATE INDEX date_index ON TABLE weather (date) AS ‘COMPACT’ WITH REBUILD;
After making this index any query that uses date column of weather table will be faster than running it before creating index.
Learn Hadoop by working on interesting Big Data and Hadoop Projects for just $9
Drop table
On dropping the table loaded by second method that is from HDFS to Hive, the data gets deleted and there is no copy of data on HDFS. This means that on creating internal table the data gets moved from HDFS to Hive. Table can be dropped using:
DROP TABLE weather;
Hive: External Tables
Creating external table
Open new terminal and fire up hive by just typing hive. Create table on weather data.
CREATE EXTERNAL TABLE weatherext ( wban INT, date STRING)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ‘,’
LOCATION ‘ /hive/data/weatherext’;
ROW FORMAT should have delimiters used to terminate the fields and lines like in the above example the fields are terminated with comma (“,”). The default location of Hive table is overwritten by using LOCATION. So the data now is stored in data/weatherext folder inside hive.
Load the data in table
Load the data from HDFS to Hive using the following command:
LOAD DATA INPATH ‘hdfs:/data/2012.txt’ INTO TABLE weatherext;
Partitioning of table
Hive stores tables in partitions. Partitions are used to divide the table into related parts. Partitions make data querying more efficient. For example in the above weather table the data can be partitioned on the basis of year and month and when query is fired on weather table this partition can be used as one of the column.
CREATE EXTERNAL TABLE IF NOT EXSISTS weatherext ( wban INT, date STRING)
PARTITIONED BY (year INT, month STRING)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ‘,’
LOCATION ‘ /hive/data/weatherext’;
Loading data in partitioned tables is different than non-partitioned one. There is little manual work of mentioning the partition data. Data can be loaded in partition, year 2012 and month 01 and 02 as follows:
LOAD DATA INPATH ‘hdfs:/data/2012.txt’ INTO TABLE weatherext PARTITION (year=2012, month=’01’);
LOAD DATA INPATH ‘hdfs:/data/2012.txt’ INTO TABLE weatherext PARTITION (year=2012, month=’02’);
This creates the partitioned table and makes different folder for each partition which helps in querying data.
Querying of partitioned table
Partitioned tables can use partition parameters as one of the column for querying. To retrieve all the data for month of ‘02’ following query can be used on weather table.
SELECT * FROM weatherext WHERE month = ‘02’;
Drop table
On dropping the external table, the data does not get deleted from HDFS. Thus it is evident that the external table are just pointers on HDFS data. Table can be dropped using:
DROP TABLE weather;
Hive Partitioning:
In Hive’s partitioning method, all the data present in a table is divided into
multiple partitions. Each partition corresponds to a specific value(s) of partition column(s) and is kept as a sub-record inside the table’s record present in the HDFS. So on querying a particular table, appropriate partition of the table is queried which contains the query value. Hence it decreases the I/O time required by the query which increases the performance speed.
In Hive’s partitioning method, all the data present in a table is divided into
multiple partitions. Each partition corresponds to a specific value(s) of partition column(s) and is kept as a sub-record inside the table’s record present in the HDFS. So on querying a particular table, appropriate partition of the table is queried which contains the query value. Hence it decreases the I/O time required by the query which increases the performance speed.
Partitioning Example
Let’s understand this with an example; consider a table named record which contains student’s data such as rollno, name, class, and batch. Suppose you need to retrieve the details of the entire students who left the school in the year 2011. The whole table is queried if a query has been raised to retrieve any information. However, if you partition the student’s table with the year, a lot of query processing time will be reduced on retrieving any information. The example below will help you to learn how to partition a file and its data:
The file named file1 contains student data table.
1
2
3
4
5
6
| record/studentdata/file1rollno, name, class, batch1, monika, 12, 20102, yash, 12, 20103, shilpa,11, 20114, parv, 11, 2011 |
Now we will partition the above data into two files using year.
1
2
3
4
5
6
| record/studentdata/2010/file21, monika, 12, 20102, yash, 12, 2010record/studentdata/2011/file33, shilpa,11, 20114, parv,11, 2011 |
While retrieving the data from the table, only the data of specified partition will be queried. Create a partitioned table:
1
2
3
| CREATE TABLE studentTab (rollno INT, name STRING, class INT, batch INT) PARTITIONED BY (year STRING);LOAD DATA LOCAL INPATH 'record/studentdata/2011/file2' OVERWRITE INTO TABLE studentTab PARTITION (year='2010');LOAD DATA LOCAL INPATH 'record/studentdata/2011/file3' OVERWRITE INTO TABLE studentTab PARTITION (year='2011'); |
There are 2 types of partition:
Hive Bucketing:
* we organize tables into buckets, reason is to enable more efficient queries and bucketing imposes extra structure on the table.
* Makes sampling more efficient.
Partitioning v/s Bucketing:
Both partitioning and bucketing slice the data for executing the query much more efficiently than non-sliced data.
In partitioning, number of slices will keep on changing as data is modified.
In bucketing, number of slices are fixed which are specified while creating the table.
Enable bucketing in hive:
To enable bucketing, need to set below:
set hive.enforce.bucketing=true;
syntax:
create table bucketed_users(id int, name string) clustered by (id) into 4 buckets;
CLUSTERED BY clause is to specify the columns to bucket on and the number of buckets.
HIVE BUCKETING EXAMPLE:
1. create external table cc_bucket(cdate string, issue string, product string, state string, zipcode string, cid string) partitioned by (product string, submitted_via string) clustered by (zipcode) into 3 buckets row format delimited fields terminated by ',' lines terminated by '\n' stored as textfile;
2. set hive.enforce.bucketing=true;
set hive.exec.dynamic.partition=true;
set hive.exec.dynamic.partition.mode=nonstrict;
3. insert overwrite table cc_bucket partition(product,submitted_via) select cdate,issue,state,zipcode,cid,product,submitted_via from consumercomp;
You can also crosscheck by visiting localhost:50070 i.e, hdfs browse directory.
1.static
2.Dynamic
Static: In DML/DDL involving multiple partitioning columns,the columns whose values are known at compile time.
Dynamic:Here the values are known at execution time, not at the compile time.
Static partition Example:
steps:
1. create table clientdetails(client_id int,client_name string)partitioned by(country string)row format delimited fields terminated by ',';
2. describe clientdetails;
3. create 2 input files to give the path while loading data.
4. load data inpath '/client_us' into table clientdetails partition(country="us");
5. load data inpath '/client_us' into table clientdetails partition(country="india");
6. select * from cientdetails;
Dynamic Partition Example:
Here insert,select queries dynamic partition columns must be specified LAST among columns.
Only allowes non-strict mode.
steps:
1. create external table consumercomp(cdate string, product string, issue string, state string, zipcode string, submitted_via string, cid string) row format delimited fields terminated by ',' lines terminated by '\n' stored as textfile;
2. load data local inpath '/path' into table consumercomp;
(there must be file to give a path while loading)
3.To view the data
select * from consumercomp;
create partition table:
4. create external table cc_partition(cdate string, product string, issue string, state string, zipcode string, cid string) partitioned by (submmitted_via) row format delimited fields terminated by ',' lines terminated by '\n' stored as textfile;
Two modes are available:
Strict mode: here at least one partition column has to be static.
Non strict mode
so, changing it to non strict mode.
i.e,
5. set hive.exec.dynamic.partition=true;
set hive.exec.dynamic.partition.mode=nonstrict;
6.Load data to partitioned table(cc_partition)
insert overwrite table cc_partition partition(submitted_via) select cdate, product, issue, state, zipcode, cid, submiited_via from cc_partition;
7. select * from cc_partition;
Hive Bucketing:
* we organize tables into buckets, reason is to enable more efficient queries and bucketing imposes extra structure on the table.
* Makes sampling more efficient.
Partitioning v/s Bucketing:
Both partitioning and bucketing slice the data for executing the query much more efficiently than non-sliced data.
In partitioning, number of slices will keep on changing as data is modified.
In bucketing, number of slices are fixed which are specified while creating the table.
Enable bucketing in hive:
To enable bucketing, need to set below:
set hive.enforce.bucketing=true;
syntax:
create table bucketed_users(id int, name string) clustered by (id) into 4 buckets;
CLUSTERED BY clause is to specify the columns to bucket on and the number of buckets.
HIVE BUCKETING EXAMPLE:
1. create external table cc_bucket(cdate string, issue string, product string, state string, zipcode string, cid string) partitioned by (product string, submitted_via string) clustered by (zipcode) into 3 buckets row format delimited fields terminated by ',' lines terminated by '\n' stored as textfile;
2. set hive.enforce.bucketing=true;
set hive.exec.dynamic.partition=true;
set hive.exec.dynamic.partition.mode=nonstrict;
3. insert overwrite table cc_bucket partition(product,submitted_via) select cdate,issue,state,zipcode,cid,product,submitted_via from consumercomp;
You can also crosscheck by visiting localhost:50070 i.e, hdfs browse directory.
nice blog very easy to understand.
ReplyDelete