1:下载地址https://www.doczj.com/doc/d115006049.html,/lucene/mahout/
我下载的是:mahout-0.3.tar.gz 17-Mar-2010 02:12 47M 2:解压
tar -xvf mahout-0.3.tar.gz
3:配置环境
export HADOOP_HOME=/home/hadoopuser/hadoop-0.19.2 export HADOOP_CONF_DIR=/home/hadoopuser/hadoop-0.19.2/conf
4:使用看看先
bin/mahout --help
会列出很多可以用的算法
5:使用kmeans聚类看看先
bin/mahout kmeans --input /user/hive/warehouse/tmp_data/complex.seq --clusters 5 --output /home/hadoopuser/1.txt
关于 kmeans需要的参数等等通过如下命令可以查看:
bin/mahout kmeans --help
mahout下处理的文件必须是SequenceFile格式的,所以需要把txtfile转换成sequenceFile。
SequenceFile是hadoop中的一个类,允许我们向文件中写入二进制的键值对,具体介绍请看
eyjian写的https://www.doczj.com/doc/d115006049.html,/viewthread.php?tid=144&highlight=sequencefile
mahout中提供了一种将指定文件下的文件转换成sequenceFile的方式。
(You may find Tika (https://www.doczj.com/doc/d115006049.html,/tika) helpful in converting binary documents to text.)
使用方法如下:
$MAHOUT_HOME/bin/mahout seqdirectory \
--input
<-c
<-chunk
<-prefix
举个例子:
bin/mahout seqdirectory --input /hive/hadoopuser/ --output /mahout/seq/ --charset UTF-8
运行kmeans的简单的例子:
1:将样本数据集放到hdfs中指定文件下,应该在testdata文件夹下
$HADOOP_HOME/bin/hadoop fs -put
例如:
bin/hadoop fs -put /home/hadoopuser/mahout-0.3/test/synthetic_control.data /user/hadoopuser/testdata/
2:使用kmeans算法
$HADOOP_HOME/bin/hadoop jar $MAHOUT_HOME/examples/target/mahout-examples-$MAHOUT_VERSION.job
org.apache.mahout.clustering.syntheticcontrol.kmeans.Job
例如:
bin/hadoop jar /home/hadoopuser/mahout-0.3/mahout-examples-0.1.job org.apache.mahout.clustering.syntheticcontrol.kmeans.Job
3:使用canopy算法
$HADOOP_HOME/bin/hadoop jar $MAHOUT_HOME/examples/target/mahout-examples-$MAHOUT_VERSION.job
org.apache.mahout.clustering.syntheticcontrol.canopy.Job
例如:
bin/hadoop jar /home/hadoopuser/mahout-0.3/mahout-examples-0.1.job org.apache.mahout.clustering.syntheticcontrol.canopy.Job
4:使用dirichlet 算法
$HADOOP_HOME/bin/hadoop jar $MAHOUT_HOME/examples/target/mahout-examples-$MAHOUT_VERSION.job
org.apache.mahout.clustering.syntheticcontrol.dirichlet.Job
5:使用meanshift算法
meanshift : $HADOOP_HOME/bin/hadoop jar $MAHOUT_HOME/examples/target/mahout-examples-$MAHOUT_VERSION.job
org.apache.mahout.clustering.syntheticcontrol.meanshift.Job
6:查看一下结果吧
bin/mahout vectordump --seqFile /user/hadoopuser/output/data/part-00000
这个直接把结果显示在控制台上。
Get the data out of HDFS and have a look
All example jobs use testdata as input and output to directory output
Use bin/hadoop fs -lsr output to view all outputs
Output:
KMeans is placed into output/points
Canopy and MeanShift results are placed into output/clustered-points
英文参考链接:
https://www.doczj.com/doc/d115006049.html,/MAHOUT/syntheticcontroldata.html
TriJUG: Intro to Mahout Slides and Demo examples
First off, big thank you to TriJUG and all the attendees for allowing me to present Apache Mahout last night. Also a big thank you to Red Hat for providing a most excellent meeting space. Finally, to Manning Publications for providing vouchers for Taming Text and Mahout In Action for the end of the night raffle. Overall, I think it went well, but that’s not for me to judge. There were a lot of good questions and a good sized audience.
The slides for the Monday, Feb. 15 TriJUG talk are at: Intro to Mahout Slides(Intro Mahout (PDF)).
For the “ugly demos”, below is a history of the com mands I ran for setup, etc. Keep in mind that you can almost always run bin/mahout
Here’s the preliminary setup stuff I did:
1.Get and preprocess the Reuters content
per https://www.doczj.com/doc/d115006049.html,/lucene-boot-camp-preclass-trai
ning/
2.Create the sequence files: bin/mahout seqdirectory –input
3.Convert the Sequence Files to Sparse Vectors, using the Euclidean
norm and the TF weight (for LDA): bin/mahout seq2sparse –input
4.Convert the Sequence Files to Sparse Vectors, using the Euclidean
norm and the TF-IDF weight (for Clustering): bin/mahout seq2sparse
–input
For Latent Dirichlet Allocation I then ran:
1../mahout lda
–input
–output
–numWords 34000 –numTopics 20
2../mahout org.apache.mahout.clustering.lda.LDAPrintTopics
–input
–words 10 –output
–dictionaryType sequencefile
For K-Means Clustering I ran:
1../mahout kmeans –input
/Volumes/Content/grantingersoll/content/reuters/seqfiles-TFIDF/ output-kmeans/clusters-15/ –pointsDir
/Volumes/Content/grantingersoll/content/reuters/seqfiles-TFIDF/ output-kmeans/points/ –dictionary
/Volumes/Content/grantingersoll/content/reuters/seqfiles-TFIDF/ dictionary.file-0 –dictionaryType sequencefile –substring 20
For Frequent Pattern Mining:
1.Download http://fimi.cs.helsinki.fi/data/
2../mahout fpg -i
patterns -k 50 -method mapreduce -g 10 -regex [\ ]
3../mahout seqdump –seqFile patterns/fpgrowth/part-r-00000
原文地址:
https://www.doczj.com/doc/d115006049.html,/2010/02/16/trijug-intro-to-mah out-slides-and-demo-examples/