9

Partitioning in Apache Hive

 2 years ago
source link: https://blog.knoldus.com/partitioning-in-apache-hive/
Go to the source link to view the article. You can view the picture content, updated content and better typesetting reading experience. If the link is broken, please click the button below to view the snapshot at that time.
Reading Time: 2 minutes

Partitions

Hive is a good tool for performing queries on large datasets, especially datasets that require full table scans. But quite often there are instances where users need to filter the data on specific column values.thats where Partitioning comes into play a partition is nothing but a directory which contains the chunk of data when we do partitioning, we create a partition for each unique value of the column

lets run a simple example to see what it is

syntax to create partition table is

create table tablename(colname type) partitioned by(colname type);

if hive.exec.dynamic.partition.mode is set to strict, then you need to do at least one static partition. In non-strict mode, all partitions are allowed to be dynamic

yyy

partition

here we create a table named emp info with two fields name and addresss we partitioned the table by column ID of type int and then we insert the value in this table

it’s important to consider the cardinality of the column that will be partitioned on. Selecting a column with high cardinality will result in fragmentation of data Do not over-partition the data. With too many small partitions, the task of recursively scanning the directories becomes more expensive than a full table scan of the table.

syntax for inserting values is

insert into partition values();

first we insert record with id=1 now insert another record with id=2

par

now got to /user/hive

/warehouse/default/empinfo directory in your hdfs

ppp.png

as we can see there are two partitions one with name id=1 and other one as id =2 now when a  select query is fired with where clause it will not scan the full table it will only scan the required partition

www.png

if you will tried  it with a non partition table with large dataset it will take more time in comparison because it will have to go through entire table scan

i hope this blog will be helpful happy coding


About Joyk


Aggregate valuable and interesting links.
Joyk means Joy of geeK