ui.korea.ac.kr

Chapter 8. Clustering Analysis 2008-04-05 Dept. of Industrial Systems & Information Engineering ui.korea.ac.kr

변수(Variable) 독립변수(Independent Variable) 통제변수(Control variable) 종속변수(Dependent Variable) Chapter - 8 - Clustering Analysis • 8.1 Statistical Method 8.1 Statistical Method 8.2 Basic Concept 8.3 CA/FA/MDS/DA 8.4 Clustering Analysis 8.5 Analysis Process 8.6 Cluster Decision Framework 8.7 Considerations 8.8 Validity 8.9 Hierarchical Clustering Method 8.10 K-Means Clustering Method 8.11 SAS & SPSS Program

척도(Scale) • 자로 재는 길이의 표준 •평가하거나 측정할 때 의거할 기준  불연속적, 이산적(Discrete) 측정 수준 • 명목척도(Nominal scale) • 서열척도(Ordinal scale) 연속적(Continuous) 측정 수준 • 등간척도(Interval scale) • 비율척도(Ratio scale) 구분 이유? 척도의 종류에 따른 분류 • 모수(Metric, Measurable, Quantitative) • 비모수(Non-metric, Categorical, Classified) Chapter - 8 - Clustering Analysis Chapter - 8 - • 8.1 Statistical Method 8.1 Statistical Method 8.2 Basic Concept 8.3 CA/FA/MDS/DA 8.4 Clustering Analysis 8.5 Analysis Process 8.6 Cluster Decision Framework 8.7 Considerations 8.8 Validity 8.9 Hierarchical Clustering Method 8.10 K-Means Clustering Method 8.11 SAS & SPSS Program

척도의 특징과 척도간의 관계 • 서열척도로부터 얻어진 자료로는 극히 제한된 분석방법을 적용할 수밖에 없으므로 되도록 등간척도 이상의 자료를 얻고자 노력함 • 척도에 따라 통계분석기법은 크게 모수통계와 비모수통계로 나누어짐 - 모수통계: 등간척도나 비율척도로 측정된 경우에 적용할 수 있는 기법 - 비모수통계: 기법은 변수가 명목척도나 서열척도로 측정된 경우에 적용할 수 있는 기법 Chapter - 8 - Clustering Analysis • 8.1 Statistical Method 8.1 Statistical Method 8.2 Basic Concept 8.3 CA/FA/MDS/DA 8.4 Clustering Analysis 8.5 Analysis Process 8.6 Cluster Decision Framework 8.7 Considerations 8.8 Validity 8.9 Hierarchical Clustering Method 8.10 K-Means Clustering Method 8.11 SAS & SPSS Program

 통계기법 분류의 기준 •변수의 수 •척도 • 분석 목적  변수의 수 따른 분류 • 단일변량통계 • 다변량통계  척도에 따른 분류 • 모수통계 • 미모수통계  분석의 성격에 따른 분류 • 종속관계에 관한 분석기법 : 종속변수로 독립변수로 나누어 독립변수의 변화가 종속변수의 변화에 어떻게 영향을 미치는가를 분석 하는 기법(분산분석, 회귀분석, 판별분석 등) • 상호의존관계에 관한 분석기법 : 변수전체를 이용해서 변수들의 상호관계를 파악하거나 변수들을 이용해서 대상집단들을 동질성을 가진 집단으로 분류하려는 목적  분석의 목적에 따른 분류 • 기술통계 • 추리통계 Chapter - 8 - Clustering Analysis • 8.1 Statistical Method 8.1 Statistical Method 8.2 Basic Concept 8.3 CA/FA/MDS/DA 8.4 Clustering Analysis 8.5 Analysis Process 8.6 Cluster Decision Framework 8.7 Considerations 8.8 Validity 8.9 Hierarchical Clustering Method 8.10 K-Means Clustering Method 8.11 SAS & SPSS Program

 통계기법 종류 Chapter - 8 - Clustering Analysis • 8.1 Statistical Method 8.1 Statistical Method 8.2 Basic Concept 8.3 CA/FA/MDS/DA 8.4 Clustering Analysis 8.5 Analysis Process 8.6 Cluster Decision Framework 8.7 Considerations 8.8 Validity 8.9 Hierarchical Clustering Method 8.10 K-Means Clustering Method 8.11 SAS & SPSS Program 변수의수 변수와의 관계 단일변량 통계분석 독립변수 종속변수 카이자승 비계량 비계량 종속관계에 대한 분석 통 계 분 석 단일변량 통계분석 분산분석 비계량 계량 판별분석 다변량 통계분석 로짓분석 계량 비계량 회귀분석 계량 계량 요인분석 상호의존관계에 대한분석 군집분석 MDS

 분류할 집단에 특정한 대상물을 배정하여 동일 집단의 대상물(응답자)이 유사성을 갖게 함으로써 집 단간의 차이를 명확하게 하는데 이용  대상자나 대상물을 유의미한 2차 집단으로 분류하기 위한 분석 기법  주어진 관찰치 주에서 유사한 것들을 몇몇의 집단으로 그룹화하여, 각 집단의 성격을 파악함으로써 데이터 전체의 구조에 대한 이해를 돕고자 하는 분석방법  군집 간의 이질성을 극대화하면서 군집 내에서의 대상물들의 동질성을 극대화하는 데 목적 The basic intuition behind C.A Within Cluster Variance Minimize Between Cluster Variance Chapter - 8 - Clustering Analysis • 8.2 Basic Concepts 8.1 Statistical Method 8.2 Basic Concept 8.3 CA/FA/MDS/DA 8.4 Clustering Analysis 8.5 Analysis Process 8.6 Cluster Decision Framework 8.7 Considerations 8.8 Validity 8.9 Hierarchical Clustering Method 8.10 K-Means Clustering Method 8.11 SAS & SPSS Program x2 * Main goal: maximize differences between clusters relative to variation within clusters x1 Within-cluster variation Between-cluster variation

Chapter - 8 - Clustering Analysis • 8.3 CA/FA/MDS/DA 8.1 Statistical Method 8.2 Basic Concept 8.3 CA/FA/MDS/DA 8.4 Clustering Analysis 8.5 Analysis Process 8.6 Cluster Decision Framework 8.7 Considerations 8.8 Validity 8.9 Hierarchical Clustering Method 8.10 K-Means Clustering Method 8.11 SAS & SPSS Program

 Cluster Analysis •관찰이나 실험 등을 통해 얻은 객체들을 분류 - P개의 변수로 구성된 N개의 객체들은 P차원공간에 흩어진 N개의 점 - 비상사성(Similarity/Dissimilarity )이 존재 •군집분석은 군집의 개수, 내용, 구조 등이 완전히 모르는 상태에서 특성파악 •군집의 유형 - 상호 배반적(Disjoint) 군집 – 여러 군집 중 하나에만 속함 - 계보적 (Hierarchical) 군집 – 한 군집이 다른 군집에 포함되나 군집 간의 중복은 허용하지 않음 - 중복(Overlapping)군집 – 두개 이상의 군집이 한 객체에 동시에 소속 - 퍼지(Fuzzy)군집 – 확률이나 자격으로 지표규정(상호배반, 계보적, 중복 어느 형태가 가능함) Chapter - 8 - Clustering Analysis • 8.4 Clustering Analysis 8.1 Statistical Method 8.2 Basic Concept 8.3 CA/FA/MDS/DA 8.4 Clustering Analysis 8.5 Analysis Process 8.6 Cluster Decision Framework 8.7 Considerations 8.8 Validity 8.9 Hierarchical Clustering Method 8.10 K-Means Clustering Method 8.11 SAS & SPSS Program Variables X1 X2 X3 X4 Objects O1 O2 O3 O4 O5  Possible bases for segmentation - Dimensions that are outputs of factor analysis. - Exploratory research. - Price sensitivities - Heavy-light users - Demographic variables - Psychographic variables

 목적 : 전체 데이터를 군집을 통해 잘 구분하는 것  군집분석 원칙 • 동일한 군집의 개체 – 유사한 성격 갖도록 • 서로 다른 군집에 속한 개체사이 – 다른 성격 갖도록 군집 형성  군집분석을 위한 가정 • 표본의 통계량으로부터 모집단의 모수를 추정하는 것이 아니며, 주어진 자료의 구조를 파악하여 기술하므로 기술통계기법임 • 경우에 따라 다중공선성(Multicollinearity)이 결과에 크게 영향을 미칠 수 있음 Chapter - 8 - Clustering Analysis • 8.4 Clustering Analysis 8.1 Statistical Method 8.2 Basic Concept 8.3 CA/FA/MDS/DA 8.4 Clustering Analysis 8.5 Analysis Process 8.6 Cluster Decision Framework 8.7 Considerations 8.8 Validity 8.9 Hierarchical Clustering Method 8.10 K-Means Clustering Method 8.11 SAS & SPSS Program 수입, 상표 충성도를 기준으로 고객 세분화

Chapter - 8 - Clustering Analysis • 8.4 Clustering Analysis 8.1 Statistical Method 8.2 Basic Concept 8.3 CA/FA/MDS/DA 8.4 Clustering Analysis 8.5 Analysis Process 8.6 Cluster Decision Framework 8.7 Considerations 8.8 Validity 8.9 Hierarchical Clustering Method 8.10 K-Means Clustering Method 8.11 SAS & SPSS Program  Cluster Method • 고려되는 변수가 세 개 이하의 경우 산점도 등을 활용(P =< 3): Visual Examination • 일반적으로 많이 이용되는 군집 방법(변수의 수가 늘어남  주관적 판단의 상이성 문제) - 계보적 군집방법(HCA): 밀접 한 객체를 단계적으로 군집형성 - 최적분리 군집방법(K-MCA) • 유사성(Similarity)과 거리(Distance)의 척도 - 유사성의 척도: 두 객체의 유사성은 일반적으로 두 객체에 대한 변수들 사이의 상관계수를 많이 사용함 - 거리의 척도: 두 객체의 비유사성(Dissimilarity)의 척도  Determine Similarity Measures • Correlational Measures • Distance Measures (Euclidian, City-Block) • Impact of Unstandardized Data

Chapter - 8 - 예) CASS 맥주의 신선도 5, 부드러움 정도를 6으로, HITE 맥주의 신선도를 5, 부드러움 7 로 평가  거리는 =1 Clustering Analysis • 8.5 Analysis Process 8.1 Statistical Method 8.2 Basic Concept 8.3 CA/FA/MDS/DA 8.4 Clustering Analysis 8.5 Analysis Process 8.6 Cluster Decision Framework 8.7 Considerations 8.8 Validity 8.9 Hierarchical Clustering Method 8.10 K-Means Clustering Method 8.11 SAS & SPSS Program 군집분석의 절차 1. 변수의 선정:어떠한 특성에 대한 측정치의 차이를 비교할 것인가? - 회귀변수나 중요변수 선정(그렇지 않으면 오류 발생) - 변수제거의 기준과 통계적 유의성 검정 곤란 2. 거리척도 선정:어떻게 유사성의 차이를 측정할 것인가?(유사성 측정방법) • 객체의 비유사성(Similarity / Dissimilarity)의 척도 거리 - Euclidean Distance - Square Euclidean Distance - Mahalanobis Distance: - Minkowski Distance: [일반적으로 두 개체에 대한 변수들 사이의 상관계수를 많이 사용]

Chapter - 8 - Clustering Analysis • 8.5 Analysis Process 8.1 Statistical Method 8.2 Basic Concept 8.3 CA/FA/MDS/DA 8.4 Clustering Analysis 8.5 Analysis Process 8.6 Cluster Decision Framework 8.7 Considerations 8.8 Validity 8.9 Hierarchical Clustering Method 8.10 K-Means Clustering Method 8.11 SAS & SPSS Program 군집분석의 절차 3. 군집분석 진행 방법 결정: Two Types of Algorithms • Hierarchical Algorithms Agglomerative (build-up) methods - Results from earlier stage are always nested within the results at later stages  Divisive methods - Start with one big cluster and break it apart  Dendrograms or Tree Graphs - Read left to right…or vice-versa • Nonhierarchical Algorithms  = K-평균 군집방법(K-means clustering method) • 4. 분석 결과의 타당성 검토 • 5. 결과 해석  군집들간의 특성차이를 가장 유의하게 보여주는 군집의 수를 택하고 각 군집별로 적절한 네이밍 부여 및 결과해석

Chapter - 8 - Clustering Analysis • 8.6 Cluster Decision Framework 8.1 Statistical Method 8.2 Basic Concept 8.3 CA/FA/MDS/DA 8.4 Clustering Analysis 8.5 Analysis Process 8.6 Cluster Decision Framework 8.7 Considerations 8.8 Validity 8.9 Hierarchical Clustering Method 8.10 K-Means Clustering Method 8.11 SAS & SPSS Program Algorithm? Research Problem Research Design Hierarchical Combination Non Hierarchical Metric Non Metric How many Clusters formed? Similarity Measure Yes Cluster Respecification? Pattern or Proximity? Associations No Correlation Distance Interpret Clusters Assumptions Validate and Profile

Chapter - 8 - Clustering Analysis • 8.7 Consideration 8.1 Statistical Method 8.2 Basic Concept 8.3 CA/FA/MDS/DA 8.4 Clustering Analysis 8.5 Analysis Process 8.6 Cluster Decision Framework 8.7 Consideration 8.8 Validity 8.9 Hierarchical Clustering Method 8.10 K-Means Clustering Method 8.11 SAS & SPSS Program  군집의 개수와 분석방법 선택 • 군집수 결정 - 계보적 군집방법 이용 - 계보적 군집방법은 자료의 계보적 구조가 주요관심 - Tree구조를 이용하여 병합되는 과정에서 거리의 상대적인 변화로 군집 수 판단 - 그러나 만족할만한 타당성 미흡 • 분석방법 선택 - 해석상의 어려움 봉착 - 고려사항: 1.수학상, 계산상의 문제점. 2.기본적 가정과 그들의 의미. 3.변수의 특성 - 2개의 변수: 산포도  일차적 군집탐색 (주관성이 높지만 수리적인 문제해결) - 계보적 군집방법: 자료전체가 어떤 계보를 지니고 있는 때 사용 (예, 생물표본:종-속-과) - 최적분리방법: 비교적 큰 자료를 처리, 그러나 특이값에 영향을 받음(사전에 세밀한 검토)

Chapter - 8 - Clustering Analysis • 8.8 Validity 8.1 Statistical Method 8.2 Basic Concept 8.3 CA/FA/MDS/DA 8.4 Clustering Analysis 8.5 Analysis Process 8.6 Cluster Decision Framework 8.7 Consideration 8.8 Validity 8.9 Hierarchical Clustering Method 8.10 K-Means Clustering Method 8.11 SAS & SPSS Program  군집분석의 타당성 검토 (매우 중요함) • 타당성과 신뢰성 없는 군집분석은 수용 불가능 - 신뢰성과 타당성 점검 어려움 • 군집분석에 사용될 변수들의 관측척도가 서로 다를 경우 분석 전에 표준화 객체들의 비 선형형태의 분포(집단 사이의 분리가 없음) - 현실적으로 군집분석 어려움 • 해결방안 1. 이상치에 민감한 결과-군집분석 수행 전에 이상치 존재여부 파악 필요 2. 군집의 안전성 검토 - 주어진 자료를 임으로 2부분으로 분리각 부분을 독립적으로 군집 시켜봄 - 군집분석 알고리즘에 의한 얻어진 군집 안전성 검토몇 개의 변수제거 후 군집에 미치는 영향 고찰 3. 동일한 자료를 여러 가지 군집방법을 적용하여 유사성 검토

Chapter - 8 - Clustering Analysis • 8.9 Hierarchical Clustering Method 8.1 Statistical Method 8.2 Basic Concept 8.3 CA/FA/MDS/DA 8.4 Clustering Analysis 8.5 Analysis Process 8.6 Cluster Decision Framework 8.7 Consideration 8.8 Validity 8.9 Hierarchical Clustering Method 8.10 K-Means Clustering Method 8.11 SAS & SPSS Program  Hierarchical Clustering Method; HCA  병합적 방법(Agglomerative Hierarchical Method; AHM): Polythetic Method 가까운 대상 끼리 순차적으로 객체들을 묶어나가는 방법 (과정에서 자연스럽게 모든 변수를 고려하는 방법)  분할적(Divisive): Monothetic Method 전체 대상을 하나의 군집으로 출발하여 객체를 분할해 나가는 방법  거리계산 방법  최단 연결법(Single Linkage Method)  최장 연결법(Complete Linkage Method)  평균 연결법(Average Linkage Method)  중심 연결법(Centroid Linkage Method) 중위수 연결법(Median Linkage Method)  Ward 의 방법 AHM Objects N Divisive

Chapter - 8 - Clustering Analysis • 8.9 Hierarchical Clustering Method 8.1 Statistical Method 8.2 Basic Concept 8.3 CA/FA/MDS/DA 8.4 Clustering Analysis 8.5 Analysis Process 8.6 Cluster Decision Framework 8.7 Consideration 8.8 Validity 8.9 Hierarchical Clustering Method 8.10 K-Means Clustering Method 8.11 SAS & SPSS Program  Methods of Clustering Minimum Distance (single linkage) Maximum Distance (Complete linkage) Average Distance (Average linkage) - the most common

Chapter - 8 - Clustering Analysis • 8.9 Hierarchical Clustering Method 8.1 Statistical Method 8.2 Basic Concept 8.3 CA/FA/MDS/DA 8.4 Clustering Analysis 8.5 Analysis Process 8.6 Cluster Decision Framework 8.7 Consideration 8.8 Validity 8.9 Hierarchical Clustering Method 8.10 K-Means Clustering Method 8.11 SAS & SPSS Program  Other AgglometricMethods of Clustering Ward’s method Centroid method c.g c.g

Chapter - 8 - Clustering Analysis • 8.9 Hierarchical Clustering Method 8.1 Statistical Method 8.2 Basic Concept 8.3 CA/FA/MDS/DA 8.4 Clustering Analysis 8.5 Analysis Process 8.6 Cluster Decision Framework 8.7 Consideration 8.8 Validity 8.9 Hierarchical Clustering Method 8.10 K-Means Clustering Method 8.11 SAS & SPSS Program •  Example: Single Linkage Method 1. 가장 가까운 두 군집 병합: d13=1.0최소값 나머지 객체들과의 거리 계산 2. d24=3.0 최소값나머지 객체들과의 거리 계산 3. 군집 (2, 4)와 5를 묶어 군집 (2, 4, 5)를 이룸 최종적으로 한 군집을 이룸

Chapter - 8 - Clustering Analysis • 8.9 Hierarchical Clustering Method 8.1 Statistical Method 8.2 Basic Concept 8.3 CA/FA/MDS/DA 8.4 Clustering Analysis 8.5 Analysis Process 8.6 Cluster Decision Framework 8.7 Consideration 8.8 Validity 8.9 Hierarchical Clustering Method 8.10 K-Means Clustering Method 8.11 SAS & SPSS Program •  Example: Single Linkage Method Dendrogram

Chapter - 8 - Clustering Analysis • 8.9 Hierarchical Clustering Method 8.1 Statistical Method 8.2 Basic Concept 8.3 CA/FA/MDS/DA 8.4 Clustering Analysis 8.5 Analysis Process 8.6 Cluster Decision Framework 8.7 Consideration 8.8 Validity 8.9 Hierarchical Clustering Method 8.10 K-Means Clustering Method 8.11 SAS & SPSS Program  Centroid Linkage Method (N=5, Variable=2) Variable 1. D=유클리드 거리의 제곱 Subject 2. d12=1.0 최소이므로 개체 1과 2를 묶어 군집(1,2)를 만듬 군집 (1, 2)의 중심(Centroid)는 각 변수에 대한 이들의 평균값 3. 반복수행 Variable Subject

Chapter - 8 - Clustering Analysis • 8.9 Hierarchical Clustering Method 8.1 Statistical Method 8.2 Basic Concept 8.3 CA/FA/MDS/DA 8.4 Clustering Analysis 8.5 Analysis Process 8.6 Cluster Decision Framework 8.7 Consideration 8.8 Validity 8.9 Hierarchical Clustering Method 8.10 K-Means Clustering Method 8.11 SAS & SPSS Program  Centroid Linkage Method (N=5, Variable=2) Dendrogram

Chapter - 8 - Clustering Analysis • 8.9 Hierarchical Clustering Method 8.1 Statistical Method 8.2 Basic Concept 8.3 CA/FA/MDS/DA 8.4 Clustering Analysis 8.5 Analysis Process 8.6 Cluster Decision Framework 8.7 Consideration 8.8 Validity 8.9 Hierarchical Clustering Method 8.10 K-Means Clustering Method 8.11 SAS & SPSS Program  Single Linkage Method: 수리적 계산이 빠름(컴퓨터 처리시간 빠름), 고리현상으로 부적절한 현상  SAS Code: Proc Cluster  Method=Single  Complete Linkage Method: 고립된 군집을 찾는데 유리, 군집의 응집성에 중점 SAS Code: Proc Cluster  Method=Complete  Average Linkage Method SAS Code: Proc Cluster  Method=Average  Centroid Linkage Method: 특이 값에 영향을 받지 않음  SAS Code: Proc Cluster  Method=Centroid  Median Linkage Method  SAS Code: Proc Cluster  Method=Median  Word Method  SAS Code: Proc Cluster  Method=Word

Chapter - 8 - Clustering Analysis • 8.10 K-Means Clustering Method 8.1 Statistical Method 8.2 Basic Concept 8.3 CA/FA/MDS/DA 8.4 Clustering Analysis 8.5 Analysis Process 8.6 Cluster Decision Framework 8.7 Consideration 8.8 Validity 8.9 Hierarchical Clustering Method 8.10 K-Means Clustering Method 8.11 SAS & SPSS Program K-군집분석 = 순차적 군집화 방법(Sequential Threshold Method) • 비계보적 군집분석에서 많이 사용됨 • 객체를 그룹화하여 K개 군집으로 만드는 방법 • K는 분석 전에 설정하거나 분석과정에서 결정 • 다시 할당되는 객체가 없을 때 까지 반복 수행 K개의 군집으로 나눔 개별군집 평균/중심 계산 각 객체와 군집 중심 거리계산 군집할당

Chapter - 8 - Clustering Analysis • 8.11 SAS & SPSS Program 8.1 Statistical Method 8.2 Basic Concept 8.3 CA/FA/MDS/DA 8.4 Clustering Analysis 8.5 Analysis Process 8.6 Cluster Decision Framework 8.7 Consideration 8.8 Validity 8.9 Hierarchical Clustering Method 8.10 K-Means Clustering Method 8.11 SAS & SPSS Program Hierarchical Clustering Method – SAS  고려백화점은 쇼핑고객에 대한 성향에 근거하여 고객들을 군집하려 하고 있다. 과거의 연구결과를 근거로 하여 6개의 변수를 측정하기로 하였다(Subject 10). (X1) 쇼핑은 흥미 없음 (X2) 쇼핑은 당신의 소득에 영향을 끼침 (X3) 쇼핑하면서 외식을 즐김 (X4) 쇼핑시 최고 제품을 구입하기위한 노력 (X5) 쇼핑에 괸심이 없음 (X6) 쇼핑시 가격비교를 통해 많은 돈 절약 7 Likert Scale 적극 동의 안함(1)--------------보통(4)----------------적극 동의(7)

Chapter - 8 - Clustering Analysis • 8.11 SAS & SPSS Program 8.1 Statistical Method 8.2 Basic Concept 8.3 CA/FA/MDS/DA 8.4 Clustering Analysis 8.5 Analysis Process 8.6 Cluster Decision Framework 8.7 Consideration 8.8 Validity 8.9 Hierarchical Clustering Method 8.10 K-Means Clustering Method 8.11 SAS & SPSS Program Hierarchical Clustering Method

Chapter - 8 - Clustering Analysis • 8.11 SAS & SPSS Program 8.1 Statistical Method 8.2 Basic Concept 8.3 CA/FA/MDS/DA 8.4 Clustering Analysis 8.5 Analysis Process 8.6 Cluster Decision Framework 8.7 Consideration 8.8 Validity 8.9 Hierarchical Clustering Method 8.10 K-Means Clustering Method 8.11 SAS & SPSS Program Hierarchical Clustering Method – SAS Code DATA QUEST; INPUT X1-X6; CARDS; 6 4 7 3 2 3 2 3 1 4 5 4 7 2 6 4 1 3 4 6 4 5 3 6 1 3 2 2 6 4 6 4 6 3 3 4 5 3 6 3 3 4 7 3 7 4 1 4 2 4 3 3 6 3 3 5 3 6 4 6 ; RUN; PROC CLUSTER STD METHOD=CENTROIDTREE=TWO; VAR X1-X6; RUN; PROC TREE DATA= TWO HORIZONTAL; RUN; DATA QUEST; INPUT X1-X6; CARDS; 0.06 40 7 3 2 3 0.02 30 1 4 5 4 0.07 20 6 4 1 3 0.04 60 4 5 3 6 0.01 30 2 2 6 4 0.06 40 6 3 3 4 0.05 30 6 3 3 4 0.07 30 7 4 1 4 0.02 40 3 3 6 3 0.03 50 3 6 4 6 ; RUN; PROC STANDARD MENA=0 STD=1 OUT=TWO; PROC CLUSTER OUT=TWO METHOD=CENTROID TREE=TWO; VAR X1-X6; RUN; METHOD=SINGLE (최단연결법) METHOD=COMPLETE(최장연결법) METOD=AVERAGE(평균연결법) 변수의 평균 0, 분산 1로 표준화

Chapter - 8 - Clustering Analysis • 8.11 SAS & SPSS Program 8.1 Statistical Method 8.2 Basic Concept 8.3 CA/FA/MDS/DA 8.4 Clustering Analysis 8.5 Analysis Process 8.6 Cluster Decision Framework 8.7 Consideration 8.8 Validity 8.9 Hierarchical Clustering Method 8.10 K-Means Clustering Method 8.11 SAS & SPSS Program Hierarchical Clustering Method – SAS 결과  표준화결과는 동일함 Centroid Hierarchical Cluster Analysis The data have been standardized to mean 0 and variance 1 Root-Mean-Square Total-Sample Standard Deviation = 1 Root-Mean-Square Distance Between Observations = 3.464102 Number Frequency Normalized of of New Centroid Clusters --Clusters Joined-- Cluster Distance Tie 9 OB6 OB7 2 0.281052 8 OB1 CL9 3 0.361764 7 OB3 OB8 2 0.385276 6 OB4 OB10 2 0.428126 5 OB5 OB9 2 0.476894 4 OB2 CL5 3 0.490703 3 CL8 CL7 5 0.497510 2 CL3 CL4 8 1.016941 1 CL2 CL6 10 1.029886

Chapter - 8 - 1 2 3 4 5 6 7 8 9 10 1 6 7 3 8 5 9 2 4 10 Clustering Analysis • 8.11 SAS & SPSS Program 8.1 Statistical Method 8.2 Basic Concept 8.3 CA/FA/MDS/DA 8.4 Clustering Analysis 8.5 Analysis Process 8.6 Cluster Decision Framework 8.7 Consideration 8.8 Validity 8.9 Hierarchical Clustering Method 8.10 K-Means Clustering Method 8.11 SAS & SPSS Program Hierarchical Clustering Method – SAS Code

Chapter - 8 - Clustering Analysis • 8.11 SAS & SPSS Program 8.1 Statistical Method 8.2 Basic Concept 8.3 CA/FA/MDS/DA 8.4 Clustering Analysis 8.5 Analysis Process 8.6 Cluster Decision Framework 8.7 Consideration 8.8 Validity 8.9 Hierarchical Clustering Method 8.10 K-Means Clustering Method 8.11 SAS & SPSS Program Hierarchical Clustering Method – SAS Code  결과 및 해석 – Dendrogram으로 군집화 설명 6과7이 처음으로 묶이고 다음에 3과8이 묶이며, 5와 9가 묶이며 마지막 단계에서 4와 10이 묶임을 알 수 있다. 3집단으로 나눈다면, (6,7,1,3,8), (5, 9, 2), (4, 10) 2집단으로 나눈다면, (6,7,1,3,8,5,9,2), (4, 10)  군집1: 쇼핑 애호가 쇼핑의 흥미(6.20), 쇼핑하면서 외식을 즐김(6.40), 쇼핑에 관심 없음(2.00)  군집2: 냉담한 소비자 쇼핑에 흥미, 쇼핑하면서 외식을 즐김 낮음, 쇼핑에 관심 높음  군집3: 경제적인 소비자군 쇼핑은 가계에 악영향, 쇼핑시 최고 상품을 구입하기 위해 노력, 가격비교에 의하여 많은 돈을 절약

Chapter - 8 - Clustering Analysis • 8.11 SAS & SPSS Program 8.1 Statistical Method 8.2 Basic Concept 8.3 CA/FA/MDS/DA 8.4 Clustering Analysis 8.5 Analysis Process 8.6 Cluster Decision Framework 8.7 Consideration 8.8 Validity 8.9 Hierarchical Clustering Method 8.10 K-Means Clustering Method 8.11 SAS & SPSS Program Hierarchical Clustering Method – SAS Code  결과 및 해석 – 표준화와의 차이 9 OB6 OB7 2 0.281052 8 OB1 CL9 3 0.361764 7 OB3 OB8 2 0.385276 6 OB4 OB10 2 0.428126 5 OB5 OB9 2 0.476894 4 OB2 CL5 3 0.490703 3 CL8 CL7 5 0.497510 2 CL3 CL4 8 1.016941 1 CL2 CL6 10 1.029886 9 OB1 OB6 2 0.101674 8 OB2 OB5 2 0.143790 7 OB7 OB8 2 0.143794 6 CL9 OB9 3 0.292047 5 CL8 CL7 4 0.359483 4 CL6 CL5 7 0.593705 3 OB4 OB10 2 0.595757 2 CL4 OB3 8 0.860206 1 CL2 CL3 10 1.336713 표준화 결과 비표준화 결과

Chapter - 8 - Clustering Analysis • 8.11 SAS & SPSS Program 8.1 Statistical Method 8.2 Basic Concept 8.3 CA/FA/MDS/DA 8.4 Clustering Analysis 8.5 Analysis Process 8.6 Cluster Decision Framework 8.7 Consideration 8.8 Validity 8.9 Hierarchical Clustering Method 8.10 K-Means Clustering Method 8.11 SAS & SPSS Program Hierarchical Clustering Method – SPSS  Hierarchical (versus K-means)  Cluster - Cases  Display - Stats / plots  Stats - agglomeration schedule (distance between clusters) - proximity matrix  Cluster Membership (none, single, range, from -- to --- clusters  Plot (Dendograms or icicle plots)  Method (****) - Cluster method - Measure (interval, counts, binary) - Transform Values or Measures  Save

Chapter - 8 - Clustering Analysis • 8.11 SAS & SPSS Program 8.1 Statistical Method 8.2 Basic Concept 8.3 CA/FA/MDS/DA 8.4 Clustering Analysis 8.5 Analysis Process 8.6 Cluster Decision Framework 8.7 Consideration 8.8 Validity 8.9 Hierarchical Clustering Method 8.10 K-Means Clustering Method 8.11 SAS & SPSS Program Hierarchical Clustering Method – SPSS 데이터 입력 분석방법 결정 AnalyzeClassify

Chapter - 8 - Clustering Analysis • 8.11 SAS & SPSS Program 8.1 Statistical Method 8.2 Basic Concept 8.3 CA/FA/MDS/DA 8.4 Clustering Analysis 8.5 Analysis Process 8.6 Cluster Decision Framework 8.7 Consideration 8.8 Validity 8.9 Hierarchical Clustering Method 8.10 K-Means Clustering Method 8.11 SAS & SPSS Program Hierarchical Clustering Method – SPSS Statistics… Plots… Method Save 변수 지정 & 옵션선택

Chapter - 8 - Clustering Analysis • 8.11 SAS & SPSS Program 8.1 Statistical Method 8.2 Basic Concept 8.3 CA/FA/MDS/DA 8.4 Clustering Analysis 8.5 Analysis Process 8.6 Cluster Decision Framework 8.7 Consideration 8.8 Validity 8.9 Hierarchical Clustering Method 8.10 K-Means Clustering Method 8.11 SAS & SPSS Program Hierarchical Clustering Method – SPSS Statistics… Plots… Method Save

Chapter - 8 - Clustering Analysis • 8.11 SAS & SPSS Program 8.1 Statistical Method 8.2 Basic Concept 8.3 CA/FA/MDS/DA 8.4 Clustering Analysis 8.5 Analysis Process 8.6 Cluster Decision Framework 8.7 Consideration 8.8 Validity 8.9 Hierarchical Clustering Method 8.10 K-Means Clustering Method 8.11 SAS & SPSS Program Hierarchical Clustering Method – SPSS [ 최적의 군집수 결정 ] [ Dendrogram ]

Chapter - 8 - Clustering Analysis • 8.11 SAS & SPSS Program 8.1 Statistical Method 8.2 Basic Concept 8.3 CA/FA/MDS/DA 8.4 Clustering Analysis 8.5 Analysis Process 8.6 Cluster Decision Framework 8.7 Consideration 8.8 Validity 8.9 Hierarchical Clustering Method 8.10 K-Means Clustering Method 8.11 SAS & SPSS Program Non-Hierarchical Clustering Method

Chapter - 8 - Clustering Analysis • 8.11 SAS & SPSS Program 8.1 Statistical Method 8.2 Basic Concept 8.3 CA/FA/MDS/DA 8.4 Clustering Analysis 8.5 Analysis Process 8.6 Cluster Decision Framework 8.7 Consideration 8.8 Validity 8.9 Hierarchical Clustering Method 8.10 K-Means Clustering Method 8.11 SAS & SPSS Program K-Means Clustering– SAS Code DATA QUEST; INPUT X1-X6; CARDS; 6 4 7 3 2 3 2 3 1 4 5 4 7 2 6 4 1 3 4 6 4 5 3 6 1 3 2 2 6 4 6 4 6 3 3 4 5 3 6 3 3 4 7 3 7 4 1 4 2 4 3 3 6 3 3 5 3 6 4 6 ; RUN; PROC STANDARD MEAN=0 STD=1 OUT=TWO; PROC FASTCLUS DATA=TWO LIST MAXCLUSTERS=3MAXITER=10; VAR X1-X6; RUN; DATA QUEST; INPUT X1-X6; CARDS; 0.06 40 7 3 2 3 0.02 30 1 4 5 4 0.07 20 6 4 1 3 0.04 60 4 5 3 6 0.01 30 2 2 6 4 0.06 40 6 3 3 4 0.05 30 6 3 3 4 0.07 30 7 4 1 4 0.02 40 3 3 6 3 0.03 50 3 6 4 6 ; RUN; PROC STANDARD MEAN=0 STD=1 OUT=TWO; PROC FASTCLUS DATA=TWO LIST MAXCLUSTERS=3 MAXITER=10; VAR X1-X6; RUN; 척도가 다름 각 객체에 할당된 군집번호와 객체와 마지막 군집 Seed사이의 거리 최대 군집 지정 Seed를 계산하기 위한 최대 반복 수

Chapter - 8 - Clustering Analysis • 8.11 SAS & SPSS Program 8.1 Statistical Method 8.2 Basic Concept 8.3 CA/FA/MDS/DA 8.4 Clustering Analysis 8.5 Analysis Process 8.6 Cluster Decision Framework 8.7 Consideration 8.8 Validity 8.9 Hierarchical Clustering Method 8.10 K-Means Clustering Method 8.11 SAS & SPSS Program K-Means Clustering– Results FASTCLUS Procedure: Replace=FULL Radius=0 Maxclusters=3 Maxiter=10 Initial Seeds Cluster X1 X2 X3 X4 X5 X6 ------------------------------------------------------------------------------------------------ 1 -0.13553 1.98361 -0.23009 1.12117 -0.21764 1.72648 2 -1.49079 -0.60371 -1.15045 -1.46615 1.41468 -0.09087 3 1.21974 -1.46615 0.69027 0.25873 -1.30586 -0.99954 Minimum Distance Between Initial Seeds = 4.694619 Relative Change in Cluster Seeds Iteration Criterion 1 2 3 ------------------------------------------------------------- 1 0.6465 0.1580 0.2174 0.3084 2 0.4157 0 0 0 Convergence criterion is satisfied.

Chapter - 8 - Clustering Analysis • 8.11 SAS & SPSS Program 8.1 Statistical Method 8.2 Basic Concept 8.3 CA/FA/MDS/DA 8.4 Clustering Analysis 8.5 Analysis Process 8.6 Cluster Decision Framework 8.7 Consideration 8.8 Validity 8.9 Hierarchical Clustering Method 8.10 K-Means Clustering Method 8.11 SAS & SPSS Program K-Means Clustering–표준화와 비표준화 차이 Cluster Listing Obs Cluster Distance from Seed ------------------------------------------ 1 3 0.98828 2 2 1.13323 3 3 1.44798 4 1 0.74154 5 2 1.02068 6 3 1.03211 7 3 0.95115 8 3 0.96568 9 2 0.98229 10 1 0.74154 Criterion Based on Final Seeds = 0.41566 Cluster Listing Obs Cluster Distance from Seed ---------------------------------------- 1 1 5.6886 2 1 6.7350 3 3 6.7495 4 2 5.0744 5 1 6.5544 6 1 4.7917 7 3 3.6818 8 3 3.4960 9 1 4.4227 10 2 5.0744 Criterion Based on Final Seeds = 2.1834 비표준화 결과 표준화 결과 군집1 (1, 2, 5, 6, 9) 군집2 (4, 10) 군집3 (3, 7, 8) 군집1 (4, 10) 군집2 (2, 5, 9) 군집3(1, 3, 6, 7, 8)

ui.korea.ac.kr

ui.korea.ac.kr

Presentation Transcript

ui.korea.ac.kr