博客
关于我
强烈建议你试试无所不能的chatGPT,快点击我
[Hive - LanguageManual] Archiving for File Count Reduction
阅读量:4326 次
发布时间:2019-06-06

本文共 4417 字,大约阅读时间需要 14 分钟。

Archiving for File Count Reduction

Note: Archiving should be considered an advanced command due to the caveats involved.

 

 

Overview

Due to the design of HDFS, the number of files in the filesystem directly affects the memory consumption(消费) in the namenode. While normally not a problem for small clusters, memory usage may hit the limits of accessible memory on a single machine when there are >50-100 million files. In such situations, it is advantageous(有利的) to have as few files as possible.

The use of  is one approach(途径) to reducing the number of files in partitions. (减少分区里面的文件数量)Hive has built-in support to convert files in existing partitions to a Hadoop Archive (HAR) so that a partition that may once have consisted of 100's of files can occupy just ~3 files (depending on settings). However, the trade-off(交易,权衡) is that queries may be slower due to the additional overhead in reading from the HAR. (但是读数据的时候可能会稍稍变慢)

Note that archiving does NOT compress the files – HAR is analogous to the Unix tar command.

Archiving 并非压缩文件,非常类似与Unix系统的tar命令 (按我的理解是:仅打包,不压缩

 

tar -zcvf /tmp/etc.tar.gz  /etc  <==打包后,以 gzip 压缩tar -jcvf /tmp/etc.tar.bz2 /etc  <==打包后,以 bzip2 压缩tar -zxvf /tmp/etc.tar.gz  解压tar -jxvf /tmp/etc.tar.bz2 解压

 

Settings

There are 3 settings that should be configured before archiving is used. (Example values are shown.)

hive> set hive.archive.enabled=
true
;
hive> set hive.archive.har.parentdir.settable=
true
;
hive> set har.partfile.size=
1099511627776
;

hive.archive.enabled controls whether archiving operations are enabled.

hive.archive.har.parentdir.settable informs Hive whether the parent directory can be set while creating the archive. In recent versions of Hadoop the -p option can specify the root directory of the archive. For example, if /dir1/dir2/file is archived with /dir1 as the parent directory, then the resulting archive file will contain the directory structure dir2/file. In older versions of Hadoop (prior to 2011), this option was not available and therefore Hive must be configured to accommodate(适应) this limitation.

har.partfile.size controls the size of the files that make up the archive. The archive will contain size_of_partition/har.partfile.size files, rounded up. Higher values mean fewer files, but will result in longer archiving times due to the reduced number of mappers.

Usage

Archive

Once the configuration values are set, a partition can be archived with the command:

ALTER TABLE table_name ARCHIVE PARTITION (partition_col = partition_col_value, partition_col = partiton_col_value, ...)

For example:

ALTER TABLE srcpart ARCHIVE PARTITION(ds=
'2008-04-08'
, hr=
'12'
)

Once the command is issued, a mapreduce job will perform the archiving. Unlike Hive queries, there is no output on the CLI to indicate process.

Unarchive

The partition can be reverted back to its original files with the unarchive command:

ALTER TABLE srcpart UNARCHIVE PARTITION(ds=
'2008-04-08'
, hr=
'12'
)

Cautions and Limitations 警告和限制

  • In some older versions of Hadoop, HAR had a few bugs that could cause data loss or other errors. Be sure that these patches are integrated into your version of Hadoop:

 (fixed in Hadoop 0.21.0)

 (fixed in Hadoop 0.22.0)

 (fixed in Hadoop 0.22.0)

 (fixed in Hadoop 0.23.0)

  • The HarFileSystem class still has a bug that has yet to be fixed:

 (moved to  in 2014)

Hive comes with the HiveHarFileSystem class that addresses some of these issues, and is by default the value for fs.har.impl. Keep this in mind if you're rolling your own version of HarFileSystem:

  • The default HiveHarFileSystem.getFileBlockLocations() has no locality. That means it may introduce higher network loads or reduced performance.
  • Archived partitions cannot be overwritten with INSERT OVERWRITE. The partition must be unarchived first.
  • If two processes attempt to archive the same partition at the same time, bad things could happen. (Need to implement concurrency support.)

Under the Hood

Internally, when a partition is archived, a HAR is created using the files from the partition's original location (such as /warehouse/table/ds=1). The parent directory of the partition is specified to be the same as the original location and the resulting archive is named 'data.har'. The archive is moved under the original directory (such as /warehouse/table/ds=1/data.har), and the partition's location is changed to point to the archive.

转载于:https://www.cnblogs.com/tmeily/p/4248040.html

你可能感兴趣的文章
RobotFramework自动化2-自定义关键字
查看>>
[置顶] 【cocos2d-x入门实战】微信飞机大战之三:飞机要起飞了
查看>>
BABOK - 需求分析(Requirements Analysis)概述
查看>>
第43条:掌握GCD及操作队列的使用时机
查看>>
Windows autoKeras的下载与安装连接
查看>>
CMU Bomblab 答案
查看>>
微信支付之异步通知签名错误
查看>>
2016 - 1 -17 GCD学习总结
查看>>
linux安装php-redis扩展(转)
查看>>
Vue集成微信开发趟坑:公众号以及JSSDK相关
查看>>
技术分析淘宝的超卖宝贝
查看>>
i++和++1
查看>>
react.js
查看>>
P1313 计算系数
查看>>
NSString的长度比较方法(一)
查看>>
Azure云服务托管恶意软件
查看>>
My安卓知识6--关于把项目从androidstudio工程转成eclipse工程并导成jar包
查看>>
旧的起点(开园说明)
查看>>
生产订单“生产线别”带入生产入库单
查看>>
crontab导致磁盘空间满问题的解决
查看>>