

Partitioning in Apache Hive
source link: https://blog.knoldus.com/partitioning-in-apache-hive/
Go to the source link to view the article. You can view the picture content, updated content and better typesetting reading experience. If the link is broken, please click the button below to view the snapshot at that time.

Partitions
Hive is a good tool for performing queries on large datasets, especially datasets that require full table scans. But quite often there are instances where users need to filter the data on specific column values.thats where Partitioning comes into play a partition is nothing but a directory which contains the chunk of data when we do partitioning, we create a partition for each unique value of the column
lets run a simple example to see what it is
syntax to create partition table is
create table tablename(colname type) partitioned by(colname type);
if hive.exec.dynamic.partition.mode
is set to strict
, then you need to do at least one static partition. In non-strict mode, all partitions are allowed to be dynamic
here we create a table named emp info with two fields name and addresss we partitioned the table by column ID of type int and then we insert the value in this table
it’s important to consider the cardinality of the column that will be partitioned on. Selecting a column with high cardinality will result in fragmentation of data Do not over-partition the data. With too many small partitions, the task of recursively scanning the directories becomes more expensive than a full table scan of the table.
syntax for inserting values is
insert into partition values();
first we insert record with id=1 now insert another record with id=2
now got to /user/hive
/warehouse/default/empinfo directory in your hdfs
as we can see there are two partitions one with name id=1 and other one as id =2 now when a select query is fired with where clause it will not scan the full table it will only scan the required partition
if you will tried it with a non partition table with large dataset it will take more time in comparison because it will have to go through entire table scan
i hope this blog will be helpful happy coding
Recommend
About Joyk
Aggregate valuable and interesting links.
Joyk means Joy of geeK