Page 40 - Demo

P. 40

                                
                                    %u062c%u0645%u064a%u0639 %u0627%u0644%u062d%u0642%u0648%u0642 %u0645%u062d%u0641%u0648%u0638%u0629 %u0640 %u0627%u0625%u0644%u0639%u062a%u062f%u0627%u0621 %u0639%u0649%u0644 %u062d%u0642 %u0627%u0645%u0644%u0624%u0644%u0641 %u0628%u0627%u0644%u0646%u0633%u062e %u0623%u0648 %u0627%u0644%u0637%u0628%u0627%u0639%u0629 %u064a%u0639%u0631%u0636 %u0641%u0627%u0639%u0644%u0647 %u0644%u0644%u0645%u0633%u0627%u0626%u0644%u0629 %u0627%u0644%u0642%u0627%u0646%u0648%u0646%u064a%u062940range e- Discretization:- the raw values of a numeric attribute are replaced by interval label or conceptual label III. Data Cleaning: Data cleaning is a form of data preprocessing, and it includes several problems that require cleaning, and they are presented hereunder: 1-Missing and incomplete values: - lacking attribute values, lacking certain attributes of interest, or containing only aggregate data we can overcome this problem by: a- Ignore the tuple: This is usually done when the class label is missing. This method is not very effective, unless the tuple contains several attributes with missing values. It is especially poor when the percentage of missing values per attribute varies considerably. By ignoring the tuple, we do not make use of the remaining attributes%u2019 values in the tuple. Such data could have been useful to the task at hand. b- Fill in the missing value manually: In general, this approach is time consuming and may not be feasible given a large data set with many missing values. c- Use a global constant to fill in the missing value: Replace all missing attribute values by the same constant such as a label like %u201cUnknown%u201d or %u2212%u221e. d- Use a measure of central tendency for the attribute: we can use the mean, the median, the mode to fill in the missing value. e- Use the most probable value to fill in the missing value: This may be determined with regression, inference-based tools using a Bayesian formalism, or decision tree induction. 2- Noisy Data: - Noise is a random error or variance in a measured variable, to smooth out the data to remove noise: a- Binning: we can do this by first sorting data and partitioning into equal-frequency bins, then one can smooth by bin means, smooth by bin median, smooth by bin boundaries
34 35 36 37 38 39 40 41 42 43 44