Posts

Apache Hive

What Is Hive Hive is a data warehousing infrastructure based on  Apache Hadoop . Hadoop provides massive scale out and fault tolerance capabilities for data storage and processing on commodity hardware. Hive is designed to enable easy data summarization, ad-hoc querying and analysis of large volumes of data. What Hive Is NOT Hive is not designed for online transaction processing.  It is best used for traditional data warehousing tasks. Data Units In the order of granularity - Hive data is organized into: Databases : Namespaces function to avoid naming conflicts for tables, views, partitions, columns, and so on.  Databases can also be used to enforce security for a user or group of users. Tables : Homogeneous units of data which have the same schema.  Partitions : Each Table can have one or more partition Keys which determines how the data is stored. Partitions — apart from being storage units — also allow the user to efficiently identify the r...