当前位置:文档之家› mahout下载和安装

mahout下载和安装

mahout下载和安装
mahout下载和安装

1:下载地址https://www.doczj.com/doc/d115006049.html,/lucene/mahout/

我下载的是:mahout-0.3.tar.gz 17-Mar-2010 02:12 47M 2:解压

tar -xvf mahout-0.3.tar.gz

3:配置环境

export HADOOP_HOME=/home/hadoopuser/hadoop-0.19.2 export HADOOP_CONF_DIR=/home/hadoopuser/hadoop-0.19.2/conf

4:使用看看先

bin/mahout --help

会列出很多可以用的算法

5:使用kmeans聚类看看先

bin/mahout kmeans --input /user/hive/warehouse/tmp_data/complex.seq --clusters 5 --output /home/hadoopuser/1.txt

关于 kmeans需要的参数等等通过如下命令可以查看:

bin/mahout kmeans --help

mahout下处理的文件必须是SequenceFile格式的,所以需要把txtfile转换成sequenceFile。

SequenceFile是hadoop中的一个类,允许我们向文件中写入二进制的键值对,具体介绍请看

eyjian写的https://www.doczj.com/doc/d115006049.html,/viewthread.php?tid=144&highlight=sequencefile

mahout中提供了一种将指定文件下的文件转换成sequenceFile的方式。

(You may find Tika (https://www.doczj.com/doc/d115006049.html,/tika) helpful in converting binary documents to text.)

使用方法如下:

$MAHOUT_HOME/bin/mahout seqdirectory \

--input --output \

<-c {UTF-8|cp1252|ascii...}> \

<-chunk 64> \

<-prefix >

举个例子:

bin/mahout seqdirectory --input /hive/hadoopuser/ --output /mahout/seq/ --charset UTF-8

运行kmeans的简单的例子:

1:将样本数据集放到hdfs中指定文件下,应该在testdata文件夹下

$HADOOP_HOME/bin/hadoop fs -put testdata

例如:

bin/hadoop fs -put /home/hadoopuser/mahout-0.3/test/synthetic_control.data /user/hadoopuser/testdata/

2:使用kmeans算法

$HADOOP_HOME/bin/hadoop jar $MAHOUT_HOME/examples/target/mahout-examples-$MAHOUT_VERSION.job

org.apache.mahout.clustering.syntheticcontrol.kmeans.Job

例如:

bin/hadoop jar /home/hadoopuser/mahout-0.3/mahout-examples-0.1.job org.apache.mahout.clustering.syntheticcontrol.kmeans.Job

3:使用canopy算法

$HADOOP_HOME/bin/hadoop jar $MAHOUT_HOME/examples/target/mahout-examples-$MAHOUT_VERSION.job

org.apache.mahout.clustering.syntheticcontrol.canopy.Job

例如:

bin/hadoop jar /home/hadoopuser/mahout-0.3/mahout-examples-0.1.job org.apache.mahout.clustering.syntheticcontrol.canopy.Job

4:使用dirichlet 算法

$HADOOP_HOME/bin/hadoop jar $MAHOUT_HOME/examples/target/mahout-examples-$MAHOUT_VERSION.job

org.apache.mahout.clustering.syntheticcontrol.dirichlet.Job

5:使用meanshift算法

meanshift : $HADOOP_HOME/bin/hadoop jar $MAHOUT_HOME/examples/target/mahout-examples-$MAHOUT_VERSION.job

org.apache.mahout.clustering.syntheticcontrol.meanshift.Job

6:查看一下结果吧

bin/mahout vectordump --seqFile /user/hadoopuser/output/data/part-00000

这个直接把结果显示在控制台上。

Get the data out of HDFS and have a look

All example jobs use testdata as input and output to directory output

Use bin/hadoop fs -lsr output to view all outputs

Output:

KMeans is placed into output/points

Canopy and MeanShift results are placed into output/clustered-points

英文参考链接:

https://www.doczj.com/doc/d115006049.html,/MAHOUT/syntheticcontroldata.html

TriJUG: Intro to Mahout Slides and Demo examples

First off, big thank you to TriJUG and all the attendees for allowing me to present Apache Mahout last night. Also a big thank you to Red Hat for providing a most excellent meeting space. Finally, to Manning Publications for providing vouchers for Taming Text and Mahout In Action for the end of the night raffle. Overall, I think it went well, but that’s not for me to judge. There were a lot of good questions and a good sized audience.

The slides for the Monday, Feb. 15 TriJUG talk are at: Intro to Mahout Slides(Intro Mahout (PDF)).

For the “ugly demos”, below is a history of the com mands I ran for setup, etc. Keep in mind that you can almost always run bin/mahout

–help to get syntax help for any given command.

Here’s the preliminary setup stuff I did:

1.Get and preprocess the Reuters content

per https://www.doczj.com/doc/d115006049.html,/lucene-boot-camp-preclass-trai

ning/

2.Create the sequence files: bin/mahout seqdirectory –input

/content/reuters/reuters-out –output

/content/reuters/seqfiles –charset UTF-8

3.Convert the Sequence Files to Sparse Vectors, using the Euclidean

norm and the TF weight (for LDA): bin/mahout seq2sparse –input /content/reuters/seqfiles –output

/content/reuters/seqfiles-TF –norm 2 –weight TF

4.Convert the Sequence Files to Sparse Vectors, using the Euclidean

norm and the TF-IDF weight (for Clustering): bin/mahout seq2sparse

–input/content/reuters/seqfiles –output

/content/reuters/seqfiles-TF-IDF –norm 2 –weight TFIDF

For Latent Dirichlet Allocation I then ran:

1../mahout lda

–input /content/reuters/seqfiles-TF/vectors/

–output /content/reuters/seqfiles-TF/lda-output

–numWords 34000 –numTopics 20

2../mahout org.apache.mahout.clustering.lda.LDAPrintTopics

–input /content/reuters/seqfiles-TF/lda-output/state-19 –dict /content/reuters/seqfiles-TF/dictionary.file-0

–words 10 –output

/content/reuters/seqfiles-TF/lda-output/topics

–dictionaryType sequencefile

For K-Means Clustering I ran:

1../mahout kmeans –input

/content/reuters/seqfiles-TFIDF/vectors/part-00000 –k 15 –output /content/reuters/seqfiles-TFIDF/output-kmeans –clusters

/content/reuters/seqfiles-TFIDF/output-kmeans/clusters 2.Print out the clusters: ./mahout clusterdump –seqFileDir

/Volumes/Content/grantingersoll/content/reuters/seqfiles-TFIDF/ output-kmeans/clusters-15/ –pointsDir

/Volumes/Content/grantingersoll/content/reuters/seqfiles-TFIDF/ output-kmeans/points/ –dictionary

/Volumes/Content/grantingersoll/content/reuters/seqfiles-TFIDF/ dictionary.file-0 –dictionaryType sequencefile –substring 20

For Frequent Pattern Mining:

1.Download http://fimi.cs.helsinki.fi/data/

2../mahout fpg -i /content/freqitemset/accidents.dat -o

patterns -k 50 -method mapreduce -g 10 -regex [\ ]

3../mahout seqdump –seqFile patterns/fpgrowth/part-r-00000

原文地址:

https://www.doczj.com/doc/d115006049.html,/2010/02/16/trijug-intro-to-mah out-slides-and-demo-examples/

相关主题
文本预览
相关文档 最新文档