220.181.108.151 - - [31/Jan/2012:00:02:32 +0800] "GET /home.php?mod=space&uid=158&do=album&view=me&from=space HTTP/1.1" 200 8784 "-" "Mozilla/5.0 (compatible; Baiduspider/2.0; +http://www.baidu.com/search/spider.html)"208.115.113.82 - - [31/Jan/2012:00:07:54 +0800] "GET /robots.txt HTTP/1.1" 200 582 "-" "Mozilla/5.0 (compatible; Ezooms/1.0; ezooms.bot@gmail.com)"220.181.94.221 - - [31/Jan/2012:00:09:24 +0800] "GET /home.php?mod=spacecp&ac=pm&op=showmsg&handlekey=showmsg_3&touid=3&pmid=0&daterange=2&pid=398&tid=66 HTTP/1.1" 200 10070 "-" "Sogou web spider/4.0(+http://www.sogou.com/docs/help/webmasters.htm#07)"112.97.24.243 - - [31/Jan/2012:00:14:48 +0800] "GET /data/cache/style_2_common.css?AZH HTTP/1.1" 200 57752 "http://f.dataguru.cn/forum-58-1.html" "Mozilla/5.0 (iPhone; CPU iPhone OS 5_0_1 like Mac OS X) AppleWebKit/534.46 (KHTML, like Gecko) Mobile/9A406"一、Pig下载: 下载地址:http://www.apache.org/dyn/closer.cgi/pig 二、Pig安装: 解压 [grid@hadoop1 ~]$ tar -zxf pig-0.14.0.tar.gz 设置环境变量 [grid@hadoop1 ~]$ vi .bash_profile PIG_INSTALL=/home/grid/pig-0.14.0 PIG_CLASSPATH=/home/grid/hadoop-1.2.1/conf/ PATH=$PATH:$PIG_INSTALL/bin export PIG_INSTALL PATH PIG_CLASSPATH 设置JAVA_HOME 修改hosts文件 验证 [grid@hadoop1 ~]$ pig -help 连接到Hadoop集群 [grid@hadoop1 ~]$ pig grunt> ls hdfs://hadoop1:9000/user/grid/in <dir> hdfs://hadoop1:9000/user/grid/out <dir> 三、开始作业 加载数据 grunt> A = LOAD 'in/8/access_log.txt' USING PigStorage (' ') AS ( ip, page); grunt> DESCRIBE A; A: {ip: bytearray,page: bytearray} 去掉用不着的信息 grunt> B = FOREACH A GENERATE ip; 分组 grunt> C = GROUP B BY ip; grunt> DESCRIBE C; C: {group: bytearray,B: {(ip: bytearray)}} 统计 grunt> D = FOREACH C GENERATE group AS ip, COUNT(B) AS count; 查看结果 grunt> DUMP D; (127.0.0.1,2) (1.59.65.67,2) (112.4.2.19,9) (112.4.2.51,80) (60.2.99.33,42) (69.28.58.5,1) (69.28.58.6,9) (69.28.58.8,5) (1.193.3.227,3) (1.202.221.3,6) (117.136.9.4,6) (121.31.62.3,26) (182.204.8.4,59) (183.9.112.2,25) (221.12.37.6,25) (223.4.16.88,2) (27.9.110.75,122)