IOS Press Supervised model-based visualization of high-dimensional data

格式：pdf
大小：797.10 KB
文档页数：15

下载文档原格式

商务智能：数据分析的管理视角_04

▪ Understand the objectives and benefits of business analytics and data mining
▪ Recognize the wide range of applications of data midardized data mining processes
Databases
Management Science & Information Systems
Copyright © 2014 Pearson Education, Inc.
Slide 4- 8
Data Mining Characteristics/Objectives
▪ Source of data for DM is often a consolidated data warehouse (not always!).
3. What was their implementation strategy? Why is it important to produce results as early as possible in data mining studies?
Copyright © 2014 Pearson Education, Inc.
Copyright © 2014 Pearson Education, Inc.
Slide 4- 6
Definition of Data Mining
▪ The nontrivial process of identifying valid, novel, potentially useful, and ultimately understandable patterns in data stored in structured databases.

资料科学基础英文版课件

Semi supervised learning is a type of machine learning that combines both labeled and unlabeled data for training
04
Spread: Standard deviation, variation
05
Shape: Skewness, kurtosis
06
Relationships: Correlation, regression
01
02
03
04
05
45%
50%
75%
85%
95%
Purpose: Draw conclusions about population from sample data
Overview of Data Collection and Sampling: In data science, data collection and sampling are important steps in obtaining data for analysis and modeling.
Data Science also helps in identifying risks, reducing uncertainties, and making better predictions about future trends and outcomes
The field of Data Science has evolved over the years, starting from the early days of data management and analysis to the current era of big data, machine learning, and artificial intelligence

data mining 1

Wang Daling
Northeastern University
2009-12-3 1
About Book
原版
Morgan Kaufmann Publishers, Inc.
影印版
高等教育出版社
中译版
机械工业出版社
2009-12-3
2
About Author
Professor (2001-) Department of Computer Science Univ. of Illinois at Urbana-Champaign Professor (1987-2001) Intelligent Database Systems Research Lab. School of Computing Science Simon Fraser University, Canada Ph.D. (1985) Department of Computing Science University Of Wisconsin-Madison Current Research Data mining, data warehousing, and knowledge discovery in databases Spatial and multi-media data mining WWW Technology: Weblog mining and Web Mining DNA data mining and bio-informatics Deductive and object_oriented databases
1991-1994 Workshops on Knowledge Discovery in Databases
Advances in Knowledge Discovery and Data Mining (U. Fayyad, G. Piatetsky-Shapiro, P. Smyth, and R. Uthurusamy, 1996)

数据挖掘导论第4课数据分类和预测

II.
Issues Regarding Classification and Prediction (1): Data Preparation
Data cleaning Preprocess data in order to reduce noise and handle missing values Relevance analysis (feature selection) Remove the irrelevant or redundant attributes Data transformation Generalize and/or normalize data
I.
Classification vs. Prediction
Classification predicts categorical class labels (discrete or nominal) classifies data (constructs a model) based on the training set and the values (class labels) in a classifying attribute and uses it in classifying new data Prediction models continuous-valued functions, i.e., predicts unknown or missing values Typical applications Credit approval Target marketing Medical diagnosis Fraud detection
Issues regarding classification and prediction (2): Evaluating classification methods

PAD维度下的深度情感关联模型研究

association model. Four groups of experimental schemes were compared according to whether the
predicted value was used as the weight and whether the genetic algorithm was used. The experimental
到了情感之间的相互关联，该模型使用多层限制玻
尔兹曼机与关联认知网络 (Interactive Cognitive
1.2
深度情感关联模型
PAD 情感维度预测
离散的情感类别只是对基本情感的分类，忽视
Network,ICN)，在 TYUT1.0 情感语料库和 CASIA 数据
了不同情感之间的关系。PAD 三维情感空间可以连
performance.
Keywords: PAD affective dimension predictive value；deep affective association model；Interactive
Cognitive Network；Genetic Algorithm
收稿日期：2021-02-21
稿件编号：202102087
基金项目：山西省自然科学基金面上项目（201901D111096）
作者简介：孙颖（1981—），女，山西太原人，博士，副教授。研究方向：语音情感识别。
- 47 -
《电子设计工程》2022 年第 7 期
情感识别是人工智能领域不可或缺的一部分，
现权值动态计算；使用遗传算法(GA)优化深度情感
6.39% ，and that with PAD predictive value and GA increased by 8.51% . The data show that the deep

Matlab的第三方工具箱大全

Matlab的第三方工具箱大全(按住CTRL点击连接就可以到达每个工具箱的主页面来下载了)Matlab Toolboxes∙ADCPtools - acoustic doppler current profiler data processing∙AFDesign - designing analog and digital filters∙AIRES - automatic integration of reusable embedded software∙Air-Sea - air-sea flux estimates in oceanography∙Animation - developing scientific animations∙ARfit - estimation of parameters and eigenmodes of multivariate autoregressive methods∙ARMASA - power spectrum estimation∙AR-Toolkit - computer vision tracking∙Auditory - auditory models∙b4m - interval arithmetic∙Bayes Net - inference and learning for directed graphical models∙Binaural Modeling - calculating binaural cross-correlograms of sound∙Bode Step - design of control systems with maximized feedback∙Bootstrap - for resampling, hypothesis testing and confidence interval estimation ∙BrainStorm - MEG and EEG data visualization and processing∙BSTEX - equation viewer∙CALFEM - interactive program for teaching the finite element method∙Calibr - for calibrating CCD cameras∙Camera Calibration∙Captain - non-stationary time series analysis and forecasting∙CHMMBOX - for coupled hidden Markov modeling using max imum likelihood EM ∙Classification - supervised and unsupervised classification algorithms∙CLOSID∙Cluster - for analysis of Gaussian mixture models for data set clustering∙Clustering - cluster analysis∙ClusterPack - cluster analysis∙COLEA - speech analysis∙CompEcon - solving problems in economics and finance∙Complex - for estimating temporal and spatial signal complexities∙Computational Statistics∙Coral - seismic waveform analysis∙DACE - kriging approximations to computer models∙DAIHM - data assimilation in hydrological and hydrodynamic models∙Data Visualization∙DBT - radar array processing∙DDE-BIFTOOL - bifurcation analysis of delay differential equations∙Denoise - for removing noise from signals∙DiffMan - solv ing differential equations on manifolds∙Dimensional Analysis -∙DIPimage - scientific image processing∙Direct - Laplace transform inversion via the direct integration method∙DirectSD - analysis and design of computer controlled systems with process-oriented models∙DMsuite - differentiation matrix suite∙DMTTEQ - design and test time domain equalizer design methods∙DrawFilt - drawing digital and analog filters∙DSFWAV - spline interpolation with Dean wave solutions∙DWT - discrete wavelet transforms∙EasyKrig∙Econometrics∙EEGLAB∙EigTool - graphical tool for nonsymmetric eigenproblems∙EMSC - separating light scattering and absorbance by extended multiplicative signal correction∙Engineering Vibration∙FastICA - fixed-point algorithm for ICA and projection pursuit∙FDC - flight dynamics and control∙FDtools - fractional delay filter design∙FlexICA - for independent components analysis∙FMBPC - fuzzy model-based predictive control∙ForWaRD - Fourier-wavelet regularized deconvolution∙FracLab - fractal analysis for signal processing∙FSBOX - stepwise forward and backward selection of features using linear regression∙GABLE - geometric algebra tutorial∙GAOT - genetic algorithm optimization∙Garch - estimating and diagnosing heteroskedasticity in time series models∙GCE Data - managing, analyzing and displaying data and metadata stored using the GCE data structure specification∙GCSV - growing cell structure visualization∙GEMANOVA - fitting multilinear ANOVA models∙Genetic Algorithm∙Geodetic - geodetic calculations∙GHSOM - growing hierarchical self-organizing map∙glmlab - general linear models∙GPIB - wrapper for GPIB library from National Instrument∙GTM - generative topographic mapping, a model for density modeling and data visualization∙GVF - gradient vector flow for finding 3-D object boundaries∙HFRadarmap - converts HF radar data from radial current vectors to total vectors ∙HFRC - importing, processing and manipulating HF radar data∙Hilbert - Hilbert transform by the rational eigenfunction expansion method∙HMM - hidden Markov models∙HMMBOX - for hidden Markov modeling using maximum likelihood EM∙HUTear - auditory modeling∙ICALAB - signal and image processing using ICA and higher order statistics∙Imputation - analysis of incomplete datasets∙IPEM - perception based musical analysisJMatLink - Matlab Java classesKalman - Bayesian Kalman filterKalman Filter - filtering, smoothing and parameter estimation (using EM) for linear dynamical systemsKALMTOOL - state estimation of nonlinear systemsKautz - Kautz filter designKrigingLDestimate - estimation of scaling exponentsLDPC - low density parity check codesLISQ - wavelet lifting scheme on quincunx gridsLKER - Laguerre kernel estimation toolLMAM-OLMAM - Levenberg Marquardt with Adaptive Momentum algorithm for training feedforward neural networksLow-Field NMR - for exponential fitting, phase correction of quadrature data and slicing LPSVM - Newton method for LP support vector machine for machine learning problems LSDPTOOL - robust control system design using the loop shaping design procedure LS-SVMlabLSVM - Lagrangian support vector machine for machine learning problemsLyngby - functional neuroimagingMARBOX - for multivariate autogressive modeling and cross-spectral estimation MatArray - analysis of microarray dataMatrix Computation- constructing test matrices, computing matrix factorizations, visualizing matrices, and direct search optimizationMCAT - Monte Carlo analysisMDP - Markov decision processesMESHPART - graph and mesh partioning methodsMILES - maximum likelihood fitting using ordinary least squares algorithmsMIMO - multidimensional code synthesisMissing - functions for handling missing data valuesM_Map - geographic mapping toolsMODCONS - multi-objective control system designMOEA - multi-objective evolutionary algorithmsMS - estimation of multiscaling exponentsMultiblock - analysis and regression on several data blocks simultaneously Multiscale Shape AnalysisMusic Analysis - feature extraction from raw audio signals for content-based music retrievalMWM - multifractal wavelet modelNetCDFNetlab - neural network algorithmsNiDAQ - data acquisition using the NiDAQ libraryNEDM - nonlinear economic dynamic modelsNMM - numerical methods in Matlab textNNCTRL - design and simulation of control systems based on neural networks NNSYSID - neural net based identification of nonlinear dynamic systemsNSVM - newton support vector machine for solv ing machine learning problems NURBS - non-uniform rational B-splinesN-way - analysis of multiway data with multilinear modelsOpenFEM - finite element developmentPCNN - pulse coupled neural networksPeruna - signal processing and analysisPhiVis- probabilistic hierarchical interactive visualization, i.e. functions for visual analysis of multivariate continuous dataPlanar Manipulator - simulation of n-DOF planar manipulatorsPRT ools - pattern recognitionpsignifit - testing hyptheses about psychometric functionsPSVM - proximal support vector machine for solving machine learning problems Psychophysics - vision researchPyrTools - multi-scale image processingRBF - radial basis function neural networksRBN - simulation of synchronous and asynchronous random boolean networks ReBEL - sigma-point Kalman filtersRegression - basic multivariate data analysis and regressionRegularization ToolsRegularization Tools XPRestore ToolsRobot - robotics functions, e.g. kinematics, dynamics and trajectory generation Robust Calibration - robust calibration in statsRRMT - rainfall-runoff modellingSAM - structure and motionSchwarz-Christoffel - computation of conformal maps to polygonally bounded regions SDH - smoothed data histogramSeaGrid - orthogonal grid makerSEA-MAT - oceanographic analysisSLS - sparse least squaresSolvOpt - solver for local optimization problemsSOM - self-organizing mapSOSTOOLS - solving sums of squares (SOS) optimization problemsSpatial and Geometric AnalysisSpatial RegressionSpatial StatisticsSpectral MethodsSPM - statistical parametric mappingSSVM - smooth support vector machine for solving machine learning problems STATBAG - for linear regression, feature selection, generation of data, and significance testingStatBox - statistical routinesStatistical Pattern Recognition - pattern recognition methodsStixbox - statisticsSVM - implements support vector machinesSVM ClassifierSymbolic Robot DynamicsTEMPLAR - wavelet-based template learning and pattern classificationTextClust - model-based document clusteringTextureSynth - analyzing and synthesizing visual texturesTfMin - continous 3-D minimum time orbit transfer around EarthTime-Frequency - analyzing non-stationary signals using time-frequency distributions Tree-Ring - tasks in tree-ring analysisTSA - uni- and multivariate, stationary and non-stationary time series analysisTSTOOL - nonlinear time series analysisT_Tide - harmonic analysis of tidesUTVtools - computing and modifying rank-revealing URV and UTV decompositions Uvi_Wave - wavelet analysisvarimax - orthogonal rotation of EOFsVBHMM - variation Bayesian hidden Markov modelsVBMFA - variational Bayesian mixtures of factor analyzersVMT- VRML Molecule Toolbox, for animating results from molecular dynamics experimentsVOICEBOXVRMLplot - generates interactive VRML 2.0 graphs and animationsVSVtools - computing and modifying symmetric rank-revealing decompositions WAFO - wave analysis for fatique and oceanographyWarpTB - frequency-warped signal processingWAVEKIT - wavelet analysisWaveLab - wavelet analysisWeeks - Laplace transform inversion via the Weeks methodWetCDF - NetCDF interfaceWHMT - wavelet-domain hidden Markov tree modelsWInHD - Wavelet-based inverse halftoning via deconvolutionWSCT - weighted sequences clustering toolkitXMLTree - XML parserYAADA - analyze single particle mass spectrum dataZMAP - quantitative seismicity analysis。

大数据分析课程教学大纲

learn the contents through a series of practical data analysis projects. In each project,
the students implement and experience the data analysis operations and process, then the teacher generalizes the knowledge, methods used in the project, and the
专业方向选修 A 组-服务领域
授课对象（Audience）
授课语言
(Language of Instruction) *开课院系（School）先修课程
（Prerequisite）授课教师
（Instructor）
工业工程全英文（English）
机动学院 (School of Mechanical Engineering)
Transactions
数据聚类方法
Data clustering
1
基因芯片样本分类
项目 5
Classifying
Microarray Samples
教学方式
作业及要基本要考查方式
求
求
数据降维方法
Data dimension
1
reduction
大数据分析及商务
智能技术介绍/学生项目报告
Big data analysis 4
model; time series data analysis and prediction, data classification methods; anomaly detection, data clustering methods, semi-supervised prediction model; data

Scikit-Learn工具库介绍

Scikit-Learn⼯具库介绍A Gentle Introduction to Scikit-Learn: A Python Machine Learning Library如果你正在寻找⼀个将机器学习的种种⾼深知识从象⽛塔中带⼊⽣产环境的⼯具，那么Scikit-Learn将是你的不⼆选择。

这篇⽂章将介绍⼀下Scikit-Learn⼯具库的使⽤⽅法，并且提供⼀些有帮助的资源给⼤家。

0x01 Scikit-Learn从何⽽来？Scikit-Learn来⾃于Google summer of code project，是由David Cournapeau在2007年开始开发的。

随后Matthieu Brucher加⼊，并将该项⽬来作为其理论研究的⼀部分。

2010年时INRIA参与并发布了第⼀个release版本（v0.1 beta）。

到⽬前Scikit-Learn已经发展为⼀个由, Google, 和赞助的项⽬，长期活跃的贡献者有30多⼈。

0x02 Scikit-Learn是什么？Scikit-Learn提供了⼀系列的监督与⾮监督学习算法的Python接⼝，它使⽤了简单⼜友好的BSD license，⿎励学术使⽤或商⽤。

Scikit-Learn类库是基于SciPy (Scientific Python)构建的，所以在使⽤时必须安装SciPy，其中包含：NumPy: Base n-dimensional array packageSciPy: Fundamental library for scientific computingMatplotlib: Comprehensive 2D/3D plottingIPython: Enhanced interactive consoleSympy: Symbolic mathematicsPandas: Data structures and analysis这些依赖的类库集⼀般被称为。

WEB MINING

A KDD Process
Steps of a KDD Process（1） Data Preprocessing
• Data cleaning
– Fill in missing values, smooth noisy data, identify or remove outliers, and resolve inconsistencies
• Data integration
– Integration of multiple databases, data cubes
• Data transformation • Data reduction
Why Data Preprocessing?
• No quality data, no quality mining results!
T a x a b le In c o m e 75K 50K 150K 90K 40K 80K
C heat ? ? ? ? ? ?
Yes No No Yes No No Yes No No No
Single Married Single Married
No Yes No Yes No No
Divorced 95K Married 60K
Data Mining Models and Tasks
Model Classifer
Patterns
Two Types of Data Mining
• Supervised
– Know specifically what we are looking for
• Who is likely to respond to our offer?
– Extraction of interesting (non-trivial, implicit, previously unknown and potentially useful) information or patterns from data in large databases

label matrix

label matrixLabel Matrix: An Ultimate GuideIntroductionIn the world of data analysis and machine learning, label matrix is an essential concept that plays a crucial role in various tasks such as classification, clustering, and supervised learning. This guide aims to delve into the depths of label matrix, explaining its definition, properties, and applications.What is a Label Matrix?A label matrix, also known as a target matrix or classification matrix, is a table-like data structure that represents the labels or categories to which data points belong. It is often used in supervised machine learning tasks, where the objective is to predict the label or category of unseen data based on a trained model.Properties of a Label Matrix1. Dimensions: A label matrix has two dimensions - rows and columns. The rows represent the unique data points or instances, while the columns represent the distinct labels or categories.2. Binary or Multiclass: A label matrix can be binary or multiclass, depending on the nature of the classification task. In binary classification, there are only two possible labels, often denoted as 0 or 1. On the other hand, multiclass classification involves more than two labels.3. Sparse or Dense: Label matrices can be sparse, meaning that a majority of the entries are empty or zero, or dense, where most of the entries have non-zero values. The sparsity of a label matrix depends on the distribution of the labels in the dataset.4. Class Imbalance: Class imbalance refers to the scenario where one or more labels have significantly more instances compared to others. This property is common in real-world datasets and can affect the model's performance. Handling class imbalance is crucial in machine learning tasks.Applications of Label Matrix1. Classification: The primary application of label matrix is in classification tasks. Given a labeled dataset, a model is trained using an algorithm such as logistic regression, support vector machines, or deep learning techniques. The label matrix is used during model training and evaluation, allowing the model to learn the relationship between the input features and the corresponding labels.2. Evaluation Metrics: Label matrix is essential in evaluating the performance of a classification model. Various evaluation metrics such as accuracy, precision, recall, and F1 score are calculated based on the values in the label matrix. These metrics provide insights into the model's predictive power and its ability to correctly classify different labels.3. Imbalanced Data Analysis: As mentioned earlier, label matrices often exhibit class imbalance. This property requires special attention to ensure the model performs well on minority classes. Various techniques such as oversampling, undersampling, and cost-sensitive learning can be applied to tackle class imbalance.4. Interpretation and Visualization: Label matrices can also be used for visualization and interpretation purposes.Techniques such as confusion matrices, heatmaps, and precision-recall curves can provide insights into the model's strengths and weaknesses in classifying different labels. These visualizations aid in identifying patterns and making informed decisions.Tips for Working with Label Matrices1. Preprocessing: Before working with label matrices, it is essential to preprocess the data. This may involve handling missing values, encoding categorical variables, and scaling numerical features. The quality of the label matrix greatly depends on the preprocessing steps performed.2. Model Selection: Choosing an appropriate model for a classification task is critical. Consider factors such as the dataset size, label imbalance, and the complexity of the problem. Different algorithms have their strengths and weaknesses, and it is crucial to select the one that suits the problem at hand.3. Cross-validation: To ensure the robustness of the model, it is advisable to use techniques like cross-validation. Cross-validation involves splitting the data into training and validation sets, allowing the model's performance to beevaluated on multiple partitions of the dataset. This technique helps in estimating the model's ability to generalize to unseen data.ConclusionLabel matrix is a fundamental concept in the field of data analysis and machine learning. It provides a structured representation of labels or categories associated with data points. Understanding the properties and applications of a label matrix is crucial for successful classification tasks, model evaluation, and handling class imbalance. By utilizing label matrix techniques, practitioners and researchers can enhance the accuracy and effectiveness of their machine learning models.。

1、下载文档前请自行甄别文档内容的完整性，平台不提供额外的编辑、内容补充、找答案等附加服务。
2、"仅部分预览"的文档,不可在线预览部分如存在完整性等问题,可反馈申请退款(可完整预览的文档不适用该条件!)。
3、如文档侵犯您的权益，请联系客服反馈,我们会尽快为您处理(人工客服工作时间：9:00-18:30)。

IntelligentDataAnalysis4(2000)213–227IOSPress

Supervisedmodel-basedvisualizationofhigh-dimensionaldata

PetriKontkanen,JussiLahtinen,PetriMyllym¨aki,TomiSilanderandHenryTirriComplexSystemsComputationGroup(CoSCo)1P.O.Box26,DepartmentofComputerScience,FIN-00014UniversityofHelsinki,Finland

Received8October1999Revised22November1999Accepted24December1999

Abstract.Whenhigh-dimensionaldatavectorsarevisualizedonatwo-orthree-dimensionaldisplay,thegoalisthattwovectorsclosetoeachotherinthemulti-dimensionalspaceshouldalsobeclosetoeachotherinthelow-dimensionalspace.Traditionally,closenessisdeﬁnedintermsofsomestandardgeometricdistancemeasure,suchastheEuclideandistance,basedonamoreorlessstraightforwardcomparisonbetweenthecontentsofthedatavectors.However,suchdistancesdonotgenerallyreﬂectproperlythepropertiesofcomplexproblemdomains,wherechangingonebitinavectormaycompletelychangetherelevanceofthevector.Whatismore,inreal-worldsituationsthesimilarityoftwovectorsisnotauniversalproperty:eveniftwovectorscanberegardedassimilarfromonepointofview,fromanotherpointofviewtheymayappearquitedissimilar.Inordertocapturetheserequirementsforbuildingapragmaticandﬂexiblesimilaritymeasure,weproposeadatavisualizationschemewherethesimilarityoftwovectorsisdeterminedindirectlybyusingaformalmodeloftheproblemdomain;inourcase,aBayesiannetworkmodel.Inthisscheme,twovectorsareconsideredsimilariftheyleadtosimilarpredictions,whengivenasinputtoaBayesiannetworkmodel.Theschemeissupervisedinthesensethatdiﬀerentperspectivescanbetakenintoaccountbyusingdiﬀerentpredictivedistributions,i.e.,bychangingwhatistobepredicted.Inaddition,themodelingframeworkcanalsobeusedforvalidatingtherationalityoftheresultingvisualization.Thismodel-basedvisualizationschemehasbeenimplementedandtestedonreal-worlddomainswithencouragingresults.

Keywords:Datavisualization,Multidimensionalscaling,Bayesiannetworks,Sammon’smapping

1IntroductionWhendisplayinghigh-dimensionaldataonatwo-orthree-dimensionaldevice,eachdatavectorhastobeprovidedwiththecorresponding2Dor3Dcoordinateswhichdetermineitsvisuallocation.Intraditionalstatisticaldataanalysis(see,e.g.,[8,6]),thistaskisknownasmul-tidimensionalscaling.Fromoneperspective,multidimensionalscalingcanbeseenasadatacompressionordatareductiontask,wherethegoalistoreplacetheoriginalhigh-dimensionaldatavectorswithmuchshortervectors,whilelosingaslittleinformationaspossible.Conse-quently,apragmaticallysensibledatareductionschemeissuchthattwovectorsclosetoeachotherinthemultidimensionalspacearealsoclosetoeachotherinthethree-dimensionalspace.Thisraisesthequestionofadistancemeasure—whatisameaningfuldeﬁnitionofsimilaritywhendealingwithhigh-dimensionalvectorsincomplexdomains?

1URL:http://www.cs.Helsinki.FI/research/cosco/,E-mail:cosco@cs.Helsinki.FI

1088-467X/00/$8.00©–2000IOSPress.Allrightsreserved214P.Kontkanenetal./Supervisedmodel-basedvisualizationofhigh-dimensionaldataTraditionally,similarityisdeﬁnedintermsofsomestandardgeometricdistancemeasure,suchastheEuclideandistance.However,suchdistancesdonotgenerallyproperlyreﬂectthepropertiesofcomplexproblemdomains,wherethedatatypicallyisnotcodedinageometricorspatialform.Inthistypeofdomains,changingonebitinavectormaytotallychangetherelevanceofthevector,andmakeitinsomesensequiteadiﬀerentvector,althoughgeometricallythediﬀerenceisonlyonebit.Inaddition,inreal-worldsituationsthesimilarityoftwovectorsisnotauniversalproperty,butdependsonthespeciﬁcfocusoftheuser—eveniftwovectorscanberegardedassimilarfromonepointofview,theymayappearquitedissimilarfromanotherpointofview.Inordertocapturetheserequirementsforbuildingapragmaticallysensiblesimilaritymeasure,inthispaperweproposeandanalyzeadatavisualizationschemewherethesimilarityoftwovectorsisnotdirectlydeﬁnedasafunctionofthecontentsofthevectors,butratherdeterminedindirectlybyusingaformalmodeloftheproblemdomain;inourcase,aBayesiannetworkmodeldeﬁningajointprobabilitydistributiononthedomain.Fromonepointofview,theBayesiannetworkmodelcanberegardedasamodulewhichtransformsthevectorsintoprobabilitydistributions,andthedistanceofthevectorsisthendeﬁnedinprobabilitydistributionspace,notinthespaceoftheoriginalvectors.Ithasbeensuggestedthattheresultsofmodel-basedvisualizationtechniquescanbecom-promisediftheunderlyingmodelisbad.Thisisofcoursetrue,butweclaimthatthesameholdsalsofortheso-calledmodel-freeapproaches,aseachoftheseapproachescanbeinterpretedasamodel-basedmethodwherethemodelisjustnotexplicitlyrecognized.Forexample,thestan-dardEuclideandistancemeasureisindirectlybasedonadomainmodelassumingindependentandnormallydistributedattributes.Thisisofcourseaverystrongassumptionthatdoesnotusuallyholdinpractice.Wearguethatbyusingformalmodelsitbecomespossibletorecog-nizeandrelaxtheunderlyingassumptionsthatdonothold,whichleadstobettermodels,andmoreover,tomorereasonableandreliablevisualizations.ABayesian(belief)network[20,21]isarepresentationofaprobabilitydistributionoverasetofrandomvariables,consistingofanacyclicdirectedgraph,wherethenodescorrespondtodomainvariables,andthearcsdeﬁneasetofindependenceassumptionswhichallowthejointprobabilitydistributionforadatavectortobefactorizedasaproductofsimpleconditionalprobabilities.Techniquesforlearningsuchmodelsfromsampledataarediscussedin[11].OneofthemainadvantagesoftheBayesiannetworkmodelisthefactthatwithcertaintechnicalassumptionsitispossibletomarginalize(integrate)overallparameterinstantiationsinordertoproducethecorrespondingpredictivedistribution[7,13].Asdemonstratedin,e.g.,[18],suchmarginalizationimprovesthepredictiveaccuracyoftheBayesiannetworkmodel,especiallyincaseswheretheamountofsampledataissmall.PracticalalgorithmsforperformingpredictiveinferenceingeneralBayesiannetworksarediscussedforexamplein[20,19,14,5].Inoursupervisedmodel-basedvisualizationscheme,twovectorsareconsideredsimilariftheyleadtosimilarpredictions,whengivenasinputtothesameBayesiannetworkmodel.Theschemeissupervised,sincediﬀerentperspectivesofdiﬀerentusersaretakenintoaccountbyusingdiﬀerentpredictivedistributions,i.e.,bychangingwhatistobepredicted.TheideaisrelatedtotheBayesiandistancemetricsuggestedin[17]asamethodfordeﬁningsimilarityinthecase-basedreasoningframework.ThereasonforustouseBayesiannetworksinthiscontextisthefactthatthisisthemodelfamilywearemostfamiliarwith,andaccordingtoourexperience,itproducesmoreaccuratepredictivedistributionsthanalternativetechniques.Nevertheless,thereisnoreasonwhysomeothermodelfamilycouldnotalsobeusedinthesuggestedvisualizationscheme,aslongasthemodelsaresuchthattheyproduceapredictivedistribution.TheprinciplesofthegenericvisualizationschemearedescribedinmoredetailinSection2.Visualizationsofhigh-dimensionaldatacanbeexploitedinﬁndingregularitiesincomplex