1 / 30

Preparing the Data

Preparing the Data. What is Data?. Attributes. Kumpulan obyek data dan atributnya Atribut adalah property atau karakteristik suatu obyek Contoh : warna mata , temperature, dll Atribut dikenal sebagai variable, field, ataupun karakteristik Kumpulan dari atribut menggambarkan obyek

yorick
Télécharger la présentation

Preparing the Data

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Preparing the Data

  2. What is Data? Attributes • Kumpulan obyek data danatributnya • Atributadalahproperty ataukarakteristiksuatuobyek • Contoh: warnamata, temperature, dll • Atributdikenalsebagai variable, field, ataupunkarakteristik • Kumpulan dariatributmenggambarkanobyek • Obyekdikenaljugasebagai record, point, case, sample, entitas Objects

  3. Attribute Values • Nilaiatributadalah angka2 atau symbol2 ygdiassignkesuatuatribut • Perbedaanantaraatributdannilaiatribut • Atributygsamadapatdipetakkankenilaiatributygbeda • Misal: ketinggiandapatdiukurdalam feet atau meter • Atributygbedadapatdipetakankehimpunannilaiygsama • Contoh: nilaiatributuntuk ID dan age adalah integer • Tetapi property nilaiatributdapatberbeda: • ID tidakmempunyaibatasannilaimaksimumdan minimum

  4. Attribute Types • Ada jenis2 atributygberbeda: • Nominal • Contoh: nomor ID, warnamata, kode pos • Ordinal • Rangking/ tingkatan (contoh rasa darikripikkentangdalamskala 1-10), grade, tinggidalam {tinggi, sedang, rendah} • Interval • Contoh: tanggalkalender, temperature dalam Celsius atau Fahrenheit • Ratio • Contoh: temperature dalam Kelvin, panjang, waktu, jumlah

  5. Properties of Attribute Values /1 • Jenisatributtergantungpadapropertiberikutygmanadiamiliki • Distinctness: =  • Order: < > • Addition: + - • Multiplication: * / • Nominal attribute: distinctness • Ordinal attribute: distinctness & order • Interval attribute: distinctness, order & addition • Ratio attribute: all 4 properties

  6. Attribute Type Description Examples Operations Nominal The values of a nominal attribute are just different names, i.e., nominal attributes provide only enough information to distinguish one object from another. (=, ) zip codes, employee ID numbers, eye color, sex: {male, female} mode, entropy, contingency correlation, 2 test Ordinal The values of an ordinal attribute provide enough information to order objects. (<, >) hardness of minerals, {good, better, best}, grades, street numbers median, percentiles, rank correlation, run tests, sign tests Interval For interval attributes, the differences between values are meaningful, i.e., a unit of measurement exists. (+, - ) calendar dates, temperature in Celsius or Fahrenheit mean, standard deviation, Pearson's correlation, t and F tests Ratio For ratio variables, both differences and ratios are meaningful. (*, /) temperature in Kelvin, monetary quantities, counts, age, mass, length, electrical current geometric mean, harmonic mean, percent variation Properties of Attribute Values /2

  7. Attribute Level Transformation Comments Nominal Any permutation of values If all employee ID numbers were reassigned, would it make any difference? Ordinal An order preserving change of values, i.e., new_value = f(old_value) where f is a monotonic function. An attribute encompassing the notion of good, better best can be represented equally well by the values {1, 2, 3} or by { 0.5, 1, 10}. Interval new_value =a * old_value + b where a and b are constants Thus, the Fahrenheit and Celsius temperature scales differ in terms of where their zero value is and the size of a unit (degree). Ratio new_value = a * old_value Length can be measured in meters or feet. Properties of Attribute Values / 3

  8. Discrete and Continuous Attributes • Discrete Attribute • Mempunyaihimpunannilaiterbatasatautakterbatas • Contoh: zip codes, himpunankatadalamkumpulandokumen • Seringdirepresentasikansbg variable integer • Note: binary attributes  special case • Continuos Attribute • Memiliki angka2 real sebagainilaiatribut • Contoh: temperatur, tinggiatauberat • Dapatdiukurdandirepresentasikanmenggunakansejumlah digit terbatas • Cirikhasnyadirepresentasikansebagai variable pecahan

  9. Asymmetric Attributes • Hanyakeberadaannya (non zero attribute value) diperhatikan • Contoh: • Kata-katamunculdidokumen • Item-item munculditransaksi customer

  10. Types of data sets • Record • Data Matrix • Document Data • Transaction Data • Graph • World Wide Web • Molecular Structures • Ordered • Spatial Data • Temporal Data • Sequential Data • Genetic Sequence Data

  11. Important characteristics of structured data • Dimensionality • Sparsity • Hanyamenghitungkemunculan • Resolution • Pola2 bergantungskala

  12. Record Data • Data ygberisikumpulan record, ygmanamasing-masingberisisuatuhimpunanatribut yang ditentukan.

  13. Data Matrix • Jikaobjek data mempunyaikumpulanatributnumerikygditentukan , kemudian data objekdapatdipandangsebagaititikdalamruang multidimensional, dimanasetiapdimensimerepresentasiansuatuatribut yang berbeda. • Seperti data set dapatdirepresentasikandengansuatumatrikm denganndimanaadam baris, satudarisetiapobjekdann kolom, satuuntuksetiapatribut.

  14. Document Data • Setiap document menjadisuatu ‘term’ vector, • Setiap term adalahkomponen (atribut) dari vector • Nilaisetiapkomponenadalahbanyaknyawaktuygberhubungan terms terdapatdalam document

  15. Transaction Data • Jenisspesialdari data rekord , dimana • Setiap record (transaksi) mencangkupkumpulan item-item • Contoh: Tokopenjualanbahanmakanan. Sejumlahprodukdibeli customer selamaperjalananpembelianmerupakansuatutransaksi, namunprodukygdibelimerupakan item

  16. Graph Data • Contoh: Generic graph and HTML Links

  17. Chemical Data • Benzene Molecule: C6H6

  18. Ordered Data /1 Items/Events • Sequence of transaction An element of the sequence

  19. Ordered Data /2 • Genomic sequence data

  20. Ordered Data /3 • Spatio-Temporal data Average Monthly Temperature of land and ocean

  21. Data Quality • Jenismasalahapakualitas data? • Bagaimanakitadapatmendeteksimasalahdengan data? • Apaygdapatkitalakukantentangmasalahini? • Contohmasalahkualitas data: • Noise & outliers • Missing Values • Duplicate data

  22. Noise • Mengacupadamodifikasinilai original • Contoh: distorsisuaraseseorangketikaberbicara Two Sine Waves Two Sine Waves + Noise

  23. Outliers /1 • Outliers adalahobyek data dengankarakteristikberbedadengankebanyakan data obyek lain dalam data set.

  24. Outliers /2 • Contoh: suatu data set merepresentasikangambaranumurdengan 20 nilaiygberbeda, • Age = {3, 56, 23, 39, 156, 52, 41, 22, 9, 28, 139, 31, 55, 20, -67, 37, 11, 55, 45, 37} • Maka parameter statistikaygberhubungan: • Mean = 39.9 • Standard deviation = 45.65 Jikakitamemilihnilai threshold untukdistribusi normal data : Theshold = Mean ± 2 x Standard Deviation makaseluruh data ygdiluar range [-54.1, 131.2] adalah potential outliers. Dan olehkarena age >0, mungkinmengurangi range menjadi [0, 131.2]. Sehinggaada outlier berdasarkankriteriaygdiberikan: 156, 139dan -67 Dengankemungkinanygtinggi, dapatdisimpulkan 3 data tersebutadamistypo (data ygdimasukkandenganpenambahan digit atautanda ‘-’)

  25. Missing Values • Beberapaalasan missing values: • Informasitidakterkumpul (misal: orang2 menolakmemberikan info umurdanberatmereka) • Atributmungkintidakdapatdiaplikasikan je semuakasus (misal: pendapatantidakdapatdiaplikasikanke anak2) • Mengatasi missing values: • Eliminasiobyek data • Mengestimasi missing value selamaanalisis • Menggantidengansemuanilaikemungkinan (pembobotanolehkemungkinannya)

  26. Duplicate Data • Data set mungkinterdapatobyek data yang duplikat, atauhampirduplikasidariyg lain • Isuutamadenganmenggabungkansumberyg berbeda2 • Contoh: orangygsamadenganberbagai email address • Data cleaning • Prosesperlakuandenganisu data duplikasi

  27. Data Preprocessing: Why is Needed? • Data diduniariilcenderungkotor • Incompete: kekurangannilaiatribut, kurangatributtttygmenarik, atauhanyaberupakumpulan data • Noise: berisi errors atau outliers • Inconsistent: berisiberbeda format dalam code dannama • Data ygtidakberkualitas, tidakada hasil2 mining ygberkualitas • Keputusankualitasharusdidasarkanpada data kualitas • Data warehouse memerlukanintegritaskonsistendari data kualitas

  28. Major task in Data Preprocessing • Data Cleaning • Data Integration • Data Transformation • Data Reduction • Data Discretisation

  29. Forms of Data Preprocessing

  30. Transforming Data • Centering • Mengurangisetiap data dengan rata2 darisetiapatribut • Normalization • Hasildari centering dibagidengan standard deviasi • Scaling • Merubah data sehinggaberadadalamskalatertentu

More Related