

How To Use Vectorized Reader In Hive
source link: https://blog.knoldus.com/how-to-use-vectorized-reader-in-hive/
Go to the source link to view the article. You can view the picture content, updated content and better typesetting reading experience. If the link is broken, please click the button below to view the snapshot at that time.

Reason For Writing This Blog is That I tried to use Vectorized Reader In Hive But Faced some problem with its documentation,thats why decided to write this block
Introduction
Vectorized query execution is a Hive feature that greatly reduces the CPU usage for typical query operations like scans, filters, aggregates, and joins. A standard query execution system processes one row at a time. This involves long code paths and significant metadata interpretation in the inner loop of execution. Vectorized query execution streamlines operations by processing a block of 1024 rows at a time. Within the block, each column is stored as a vector (an array of a primitive data type). Simple operations like arithmetic and comparisons are done by quickly iterating through the vectors in a tight loop, with no or very few function calls or conditional branches inside the loop
Enabling vectorized execution
To use vectorized query execution, you must store your data in ORC format Plus
set hive.vectorized.execution.enabled = true ;
How To Query
To use vectorized query execution, you must store your data in ORC format,
just follow the below steps
-
- start hive cli and create orc table with some data
hive> create table vectortable(id int) stored as orc; OK Time taken: 0.487 seconds hive>set hive.vectorized.execution.enabled = true; hive> insert into vectortable values(1); Query ID = hduser_20170713203731_09db3954-246b-4b23-8d34-1d9d7b62965c Total jobs = 1 Launching Job 1 out of 1 Number of reduce tasks is set to 0 since there's no reduce operator Job running in-process (local Hadoop) 2017-07-13 20:37:33,237 Stage-1 map = 100%, reduce = 0% Ended Job = job_local722393542_0002 Stage-4 is selected by condition resolver. Stage-3 is filtered out by condition resolver. Stage-5 is filtered out by condition resolver. Moving data to: hdfs://localhost:54310/user/hive/warehouse/vectortable/.hive-staging_hive_2017-07-13_20-37-31_172_3262390557269287245-1/-ext-10000 Loading data to table default.vectortable Table default.vectortable stats: [numFiles=1, numRows=1, totalSize=199, rawDataSize=4] MapReduce Jobs Launched: Stage-Stage-1: HDFS Read: 321 HDFS Write: 545 SUCCESS Total MapReduce CPU Time Spent: 0 msec OK Time taken: 2.672 seconds
- now query the table with explain command to see whether hive is using vectorized execution or not
- Note:When Fetch is used in the plan instead of Map, it do not vectorize. so first set hive.fetch.task.conversion=none (Big One To Catch)
hive> explain select id from vectortable where id>=1; OK STAGE DEPENDENCIES: Stage-1 is a root stage Stage-0 depends on stages: Stage-1 STAGE PLANS: Stage: Stage-1 Map Reduce Map Operator Tree: TableScan alias: vectortable Statistics: Num rows: 1 Data size: 4 Basic stats: COMPLETE Column stats: NONE Filter Operator predicate: (id >= 1) (type: boolean) Statistics: Num rows: 1 Data size: 4 Basic stats: COMPLETE Column stats: NONE Select Operator expressions: id (type: int) outputColumnNames: _col0 Statistics: Num rows: 1 Data size: 4 Basic stats: COMPLETE Column stats: NONE File Output Operator compressed: false Statistics: Num rows: 1 Data size: 4 Basic stats: COMPLETE Column stats: NONE table: input format: org.apache.hadoop.mapred.TextInputFormat output format: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe Execution mode: vectorized Stage: Stage-0 Fetch Operator limit: -1 Processor Tree: ListSink Time taken: 0.081 seconds, Fetched: 33 row(s)
as you can see in explain command Execution mode: vectorized is printed it is enable for the query
Recommend
About Joyk
Aggregate valuable and interesting links.
Joyk means Joy of geeK