A Methodology for Direct and Indirect Discrimination Prevention in Data Mining

A Methodology for Direct and IndirectDiscrimination Prevention in Data Mining Presented By: RuchaBhutada Guided By: Prof. M. R. Wanjari

Outline: • Introduction • Challenges • Discrimination analysis • Why discrimination • Papers read • Findings of the base paper • Future plans

Introduction: • Data mining is an increasingly important technology for extracting useful knowledge hidden in large collections of data. • Some Negative social perceptions can also be mined, like: • Potential Privacy invasion • Potential discrimination • If the training datasets are biased in what regards discriminatory attributes like gender, race, religion, discriminatory decisions may follow.

Challenges: • Direct and indirect discrimination instead of only direct discrimination • To find a good tradeoff between discrimination removal and the quality of the resulting training data sets and data mining models.

Why this topic: • It’s an extension to association rule mining. And a novel application of association rule mining in social environment. • It is more than obvious that most people do not want to be discriminated on any of the sensitive issues. • Can be useful in deriving discrimination free rule base for decision making systems like insurance, loan, job etc.

Example: • U.S. federal laws prohibit discrimination on the basis of: Race , Color, Religion, Nationality, Marital status, Age • In a number of settings: • Credit/insurance scoring • Sale, rental, and financing of housing • Personnel selection and wage • Access to public accommodations, education, nursing homes, adoptions, and health care.

Papers read:

Discussion On Findings Of Base Paper • Discrimination is unfair or unequal treatment of people based on membership to a category or a minority, without regard to individual merit • Discrimination can be either direct or indirect: • Direct discrimination occurs when decisions are made based on sensitive attributes. • Indirect discrimination occurs when decisions are made based on non-sensitive attributes which are strongly correlated with biased sensitive ones.

Approach: • Anti-discrimination techniques have been introduced in data mining: - Discrimination discovery: Consists of supporting the discovery of discriminatory decisions hidden, either directly or indirectly, in a dataset of historical decision records. - Discrimination Prevention: Consists of inducing patterns that do not lead to discriminatory decisions even if the original data sets are biased.

Approach: (cont’d) • Preprocessing approach • Data sets: collection of data objects • Item , An item set, • The support of an item set, supp(X), is the fraction of records that contain the item set X. We say that a rule X C is completely supported by a record if both X and C appear in the record. • The confidence of a rule, conf(X C), measures how often the class item C appears in records that contain X. Hence, if supp(X) > 0 then • Support and confidence range over [0,1].

Approach: (cont’d): • A frequent classification rule is a classification rule with support and confidence greater than respective specified lower bounds. • The negated item set, i.e., not of X is an item set with the same attributes as X, but the attributes in not of X take any value except those taken by attributes in X.

Approach: (cont’d): • Potentially Discriminatory and Nondiscriminatory Classification Rules • Let DIs be the set of predetermined discriminatory items in DB (eg. DI={foreign worker= yes, Race= black, Gender= female}). Frequent classification rules in FR fall into one of the following two classes: • (FR stands for frequent classification rule) • A classification rule X→C is potentially discriminatory (PD) when X = A,B with A subset of DI, a nonempty discriminatory item set and B a nondiscriminatory item set. For example, {foreign worker= yes, city = NYC}→Hire = no. • A classification rule X→C is potentially nondiscriminatory (PND) when X = D,B is a nondiscriminatory item set. For example,{zip = 10451,City = NYC}→Hire = no or {Experience = low, City = NYC}→ Hire = no. • The word “potentially” means that a PD rule could probably lead to discriminatory decisions. Also, a PND rule could lead to discriminatory decisions in combination with some background knowledge;

Approach: (cont’d) • Direct Discrimination Measure • Definition 1. Let A,B→C be a classification rule such that conf(B→C>0). The extended lift of a rule is The idea here is to evaluate the discrimination of a rule as the gain of confidence due to the presence of the discriminatory items • Definition 2. Let αε R be a fixed threshold and let A be a discriminatory item set. A PD classification rule c = A,B →C is a α protective w r t elift if elift (c) < α. Otherwise, c is α discriminatory.

Approach: (cont’d) • Indirect Discrimination Measure: • Definition 3. A PND classification rule r: D, B →C is a redlining rule if it could yield an α discriminatory rule r’ : A,B→C in combination with currently available background knowledge rules of the form rb1 : A,B→D and rb2 : D,B→A, where A is a discriminatory item set. • For example: {zip= 10451, city= NYC} →Hire= no.

Approach: (cont’d) • Data Transformation for Direct Discrimination: • Direct Rule Protection: • converts α discriminatory rule into an αprotective rule • Data transformation for indirect Discrimination: • Indirect Rule Protection: - Turns into redlining rule into non redlining

Data sets: • Adult data set: This data set consists of 48,842 records, split into a “train” part with 32,561 records and a “test” part with 16,281 records. The data set has 14 attributes (without class attribute). • German credit data set: We also used the German Credit data set. This data set consists of 1,000 records and 20 attributes (without class attribute) of bank account holders. This is a well-known real-life data set, containing both numerical and categorical attributes.

Result: (table 1) Ghost cost (GC). This measure quantifies the percentage of the rules among those extractable from the transformed data set that were no extractable from the original data set (side effect of the transformation process). • Misses cost (MC). This measure quantifies the percentage of rules among those extractable from the original data set that cannot be extracted from the transformed data set (side effect of the transformation process).

Result: (table 2)

. Result: (table 3 and 4) Tables 3 and 4 shows that lower information loss in terms of the GC measure in the Adult data set than in the German Credit data set.

Future plans: • This can be implemented in Indian Scenario • To check the corruption • Gender discrimination

References: • S. Hajian, J. Domingo-Ferrer, and A. Martı´nez-Balleste´, “Rule Protection for Indirect Discrimination Prevention in Data Mining,” Proc. Eighth Int’l Conf. Modeling Decisions for Artificial Intelligence (MDAI ’11), pp. 211-222, 2011. • D. Pedreschi, S. Ruggieri, and F. Turini, “Discrimination-Aware Data Mining,” Proc. 14th ACM Int’l Conf. Knowledge Discovery and Data Mining (KDD ’08), pp. 560-568, 2008. • S. Ruggieri, D. Pedreschi, and F. Turini, “Data Mining for Discrimination Discovery,” ACM Trans. Knowledge Discovery from Data, vol. 4, no. 2, article 9, 2010. • S. Ruggieri, D. Pedreschi, and F. Turini, “DCUBE: Discrimination • Discovery in Databases,” Proc. ACM Int’l Conf. Management of Data (SIGMOD ’10), pp. 1127-1130, 2010.

THANK YOU…!!!

A Methodology for Direct and Indirect Discrimination Prevention in Data Mining

A Methodology for Direct and Indirect Discrimination Prevention in Data Mining

Presentation Transcript

Direct and Indirect Objects

Distinguish between Direct and Indirect Data Sources

Direct and indirect characterization

DIRECT AND INDIRECT SPEECH

Direct and Indirect Speech

Direct Discrimination

DIRECT AND INDIRECT SPEECH

Direct and Indirect Objects

Competitors: Direct and Indirect

Direct and Indirect Objects

Direct And Indirect Speech

Direct and Indirect Representations

Direct and Indirect Objects

Direct / Indirect

Direct and Indirect Objects

Direct and Indirect Objects

Direct and Indirect Objects

Direct and Indirect Speech

Direct and Indirect Conflict

DIRECT AND INDIRECT SPEECH

Direct and indirect discrimination

DIRECT AND INDIRECT SPEECH