Data Mining: Concepts and Techniques (2nd edition). Jiawei Han and integrates top-down and bottom-up traversal of FP-trees in pattern-growth mining, was. Data Mining: Concepts and Techniques, Second Edition Joe Celko's Data and Databases: Concepts in Practice This book is printed on acid-free paper. Information Modeling and Relational Databases, 2nd Edition. Terry Halpin . Data mining: concepts and techniques / Jiawei Han, Micheline Kamber, Jian Pei.
|Language:||English, Spanish, German|
|Genre:||Politics & Laws|
|ePub File Size:||20.50 MB|
|PDF File Size:||13.74 MB|
|Distribution:||Free* [*Free Regsitration Required]|
Request PDF | On Jan 1, , Jiawei Han and others published Data Mining Concepts and Techniques (2nd Edition) Download citation popular open source free suite of machine learning software written in Java, developed at the University of Waikato, New Zealand available under the GNU General Public License. Data Mining: Concepts and Techniques (2nd edition) Jiawei Han and Micheline Kamber Morgan Kaufmann Publishers, Bibliographic Notes for Chapter 2. Data Mining: Concepts and Techniques 2nd Edition Solution Manual .. structure on which data can be generalized (rolled-up) or specialized (drilled-down). (c ) Containing one free item and other items the sum of whose prices is at least.
Our ability to generate and collect data has been increasing rapidly. Not only are all of our business, scientific, and government transactions now computerized, but the widespread use of digital cameras, publication tools, and bar codes also generate data. On the collection side, scanned text and image platforms, satellite remote sensing systems, and the World Wide Web have flooded us with a tremendous amount of data. This explosive growth has generated an even more urgent need for new techniques and automated tools that can help us transform this data into useful information and knowledge. Like the first edition, voted the most popular data mining book by KD Nuggets readers, this book explores concepts and techniques for the discovery of patterns hidden in large data sets, focusing on issues relating to their feasibility, usefulness, effectiveness, and scalability. However, since the publication of the first edition, great progress has been made in the development of new data mining methods, systems, and applications.
Thesis, University of Sydney, Visualizing Data. Hobart Press, Ten Lectures on Wavelets. Capital City Press, Probability and Statistics for Engineering and the Science 4th ed.
Duxbury Press, Dasu and T. Exploratory Data Mining and Data Cleaning. Dasu, T. Johnson, S. Muthukrishnan, and V. Mining database structure; or how to build a data quality browser. Dash and H. Feature selection methods for classification. Intelligent Data Analysis, 1: Dash, H. Liu, and J. Dimensionality reduction of unsupervised data. An Introduction to Generalized Linear Models 2nd ed.
Chapman and Hall, Devore and R. The Exploration and Analysis of Data. Methods for Reducing Costs and Increasing Profits. Finkel and J. A data structure for retrieval on composite keys. ACTA Informatica, 4: Fayyad and K. Multi-interval discretization of continuous-values attributes for classification learning. Joint Conf. Freedman, R. Pisani, and R. Statistics 3rd ed. A recursive partitioning decision rule for nonparametric classifiers.
IEEE Trans. Galhardas, D. Florescu, D. Shasha, E. Simon, and C. Declarative data cleaning: Language, model, and algorithms. Gaede and O. Multidimensional access methods. ACM Comput. Guyon, N. Matic, and V. Discoverying informative patterns and data cleaning.
Fayyad, G. Piatetsky-Shapiro, P. Smyth, and R. A dynamic index structure for spatial searching. Han and Y.
Dynamic generation and refinement of concept hierarchies for knowledge discovery in databases. Harinarayan, A. Rajaraman, and J. Implementing data cubes efficiently. The World According to Wavelets. Peters, Classification Algorithms. John and P. Static versus dynamic sampling for data mining.
Johnson and D. Applied Multivariate Statistical Analysis 5th ed. Prentice Hall, Discretization of numeric attributes. Kohavi and G. Wrappers for feature subset selection. Artificial Intelligence, L Kennedy, Y. Lee, B. Van Roy, C. Reed, and R. Kivinen and H. The power of sampling in knowledge discovery. Liu and H. Motoda eds.
Feature Extraction, Construction, and Selection: A Data Mining Per- spective. Liu, F. Hussain, C. Tan, and M. An enabling technique. Data Mining and Knowledge Discovery, 6: Enterprise Knowledge Management: The Data Quality Approach. Morgan Kaufmann, Liu and R. Feature selection and discretization of numeric attributes.
Langley, H. Simon, G. Bradshaw, and J.
Scientific Discovery: Computational Explorations of the Creative Processes. MIT Press, Muralikrishna and D. Equi-depth histograms for extimating selectivity factors for multi- dimensional queries. Neter, M. Kutner, C. Nachtsheim, and L. Applied Linear Statistical Models 4th ed. Irwin, Data Quality: The Accuracy Dimension.
Learning DNF by decision trees. Probabilistic Reasoning in Intelligent Systems. Morgan Kauffman, Poosala and Y. Selectivity estimation without the attribute value independence assump- tion.
Press, S. Teukolosky, W. Vetterling, and B. Numerical Recipes in C: If the predicted value for a data point differs greatly from the given value, then the given value may be consider an outlier.
Outlier detection based on clustering techniques may be more reliable. Because clustering is unsupervised, we do not need to make any assumptions regarding the data distribution e.
In contrast, regression prediction methods require us to make some assumptions of the data distribution, which may be inaccurate due to insufficient data. Recent applications pay special attention to spatiotemporal data streams. A spatiotemporal data stream contains spatial information that changes over time, and is in the form of stream data, i. Sequences of sensor images of a geographical region along time.
The climate images from satellites. Data that describe the evolution of natural phenomena, such as forest coverage, forest fire, and so on. The knowledge that can be mined from spatiotemporal data streams really depends on the application.
However, one unique type of knowledge about stream data is the patterns of spatial change with respect to the time. For example, the changing of the traffic status of several highway junctions in a city, from the early morning to rush hours and back to off-peak hours, can show clearly where the traffic comes from and goes to and hence, would help the traffic officer plan effective alternative lanes in order to reduce the traffic load. As another example, a sudden appearance of a point in the spectrum space image may indicate that a new planet is being formed.
The changing of humidity, temperature, and pressure in climate data may reveal patterns of how a new typhoon is created. One major challenge is how to deal with the continuing large-scale data.
Since the data keep flowing in and each snapshot of data is usually huge e. Some aggregation or compression techniques may have to be applied, and old raw data may have to be dropped. Mining under such aggregated or lossy data is challenging. In addition, some patterns may occur with respect to a long time period, but it may not be possible to keep the data for such a long duration.
Thus, these patterns may not be uncovered. The spatial data sensed may not be so accurate, so the algorithms must have high tolerance with respect to noise. Take mining space images as the application. We seek to observe whether any new planet is being created or any old planet is disappearing. This is a change detection problem. Since the image frames keep coming, that is, f1 ,. The algorithm can be sketched as follows.
If yes, report a planet appearance if an unmatched planet appears in the new frame or a planet disappearance if an unmatched planet appears in the old frame. In fact, matching between two frames may not be easy because the earth is rotating and thus, the sensed data may have slight variations.
Some advanced techniques from image processing may be applied. The overall skeleton of the algorithm is simple. Each new incoming image frame is only compared with the previous one, satisfying the time and resource constraint. The reported change would be useful since it is [old: Describe the differences between the following approaches for the integration of a data mining system with a database or data warehouse system: State which approach you think is the most popular, and why.
The differences between the following architectures for the integration of a data mining system with a database or data warehouse system are as follows. The data mining system uses sources such as flat files to obtain the initial data set to be mined since no database system or data warehouse system functions are implemented as part of the process.
Thus, this architecture represents a poor design choice. The data mining system is not integrated with the database or data warehouse system beyond their use as the source of the initial data set to be mined and possible use in storage of the results. Thus, this architecture can take advantage of the flexibility, efficiency, and features such as indexing that the database and data warehousing systems may provide.
However, it is difficult for loose coupling to achieve high scalability and good performance with large data sets as many such systems are memory-based. Some of the data mining primitives, such as aggregation, sorting, or precompu- tation of statistical functions, are efficiently implemented in the database or data warehouse system for use by the data mining system during mining-query processing.
Also, some frequently used intermedi- ate mining results can be precomputed and stored in the database or data warehouse system, thereby enhancing the performance of the data mining system.
The database or data warehouse system is fully integrated as part of the data mining system and thereby provides optimized data mining query processing. Thus, the data mining subsystem is treated as one functional component of an information system. This is a highly desirable architecture as it facilitates efficient implementations of data mining functions, high system performance, and an integrated information processing environment.
From the descriptions of the architectures provided above, it can be seen that tight coupling is the best alternative without respect to technical or implementation issues. However, as much of the technical in- frastructure needed in a tightly coupled system is still evolving, implementation of such a system is non- trivial. Therefore, the most popular architecture is currently semitight coupling as it provides a compromise between loose and tight coupling.
Describe three challenges to data mining regarding data mining methodology and user interaction issues. Challenges to data mining regarding data mining methodology and user interaction issues include the following: Below are the descriptions of the first three challenges mentioned: Different users are interested in different kinds of knowledge and will require a wide range of data analysis and knowledge discovery tasks such as data characterization, discrimination, association, classification, clustering, trend and deviation analysis, and similarity analysis.
Each of these tasks will use the same database in different ways and will require different data mining techniques. Interactive mining, with the use of OLAP operations on a data cube, allows users to focus the search for patterns, providing and refining data mining requests based on returned results. The user can then interactively view the data and discover patterns at multiple granularities and from different angles.
Background knowledge, or information regarding the domain under study such as integrity constraints and deduction rules, may be used to guide the discovery process and allow discovered patterns to be expressed in concise terms and at different levels of abstraction. This helps to focus and speed up a data mining process or judge the interestingness of discovered patterns.
What are the major challenges of mining a huge amount of data such as billions of tuples in comparison with mining a small amount of data such as a few hundred tuple data set? One challenge to data mining regarding performance issues is the efficiency and scalability of data mining algorithms.
Data mining algorithms must be efficient and scalable in order to effectively extract information from large amounts of data in databases within predictable and acceptable running times. Another challenge is the parallel, distributed, and incremental processing of data mining algorithms. The need for parallel and distributed data mining algorithms has been brought about by the huge size of many databases, the wide distribution of data, and the computational complexity of some data mining methods.
Due to the high cost of some data mining processes, incremental data mining algorithms incorporate database updates without the need to mine the entire data again from scratch. Students must research their answers for this question. Major data mining challenges for two applications, that of data streams and bioinformatics, are addressed here. Data Stream Data stream analysis presents multiple challenges. First, data streams are continuously flowing in and out as well as changing dynamically.
The data analysis system that will successfully take care of this type of data needs to be in real time and able to adapt to changing patterns that might emerge. Another major challenge is that the size of stream data can be huge or even infinite. Because of this size, only a single or small number of scans are typically allowed. For further details on mining data stream, please consult Chapter 8.
Bioinformatics The field of bioinformatics encompasses many other subfields like genomics, proteomics, molecular biology, and chemi-informatics. Each of these individual subfields has many research challenges.
Some of the major challenges of data mining in the field of bioinformatics are outlined as follows. Due to limitations of space, some of the terminology used here may not be explained. Biological data are growing at an exponential rate.
It has been estimated that genomic and proteomic data are doubling every 12 months. Most of these data are scattered around in unstructured and nonstandard forms in various different databases throughout the research community. Many of the biological experiments do not yield exact results and are prone to errors because it is very difficult to model exact biological conditions and processes. For example, the structure of a protein is not rigid and is dependent on its environment.
Hence, the structures determined by nuclear magnetic resonance NMR or crystallography experiments may not represent the exact structure of the protein. Since these experiments are performed in parallel by many institutions and scientists, they may each yield slightly different structures.
The consolidation and validation of these conflicting data is a difficult challenge. These have become very popular in the past few years. However, due to concerns of Intellectual Property, a great deal of useful biological information is buried in proprietary databases within large pharmaceutical companies. Most of the data generated in the biological research community is from experiments. Most of the results are published, but they are seldom recorded in databases with the experiment details who, when, how, etc.
Hence, a great deal of useful information is buried in published and unpublished literature. This has given rise to the need for the development of text mining systems. For example, many experimental results regarding protein interactions have been published.
Mining this information may provide crucial insight into biological pathways and help predict potential interactions. The extraction and development of domain-specific ontologies is also another related research challenge. The major steps in a drug discovery phase include target identification, target validation, lead discovery, and lead optimization. The most time- consuming step is the lead discovery phase. In this step, large databases of compounds are needed to be mined to identify potential lead candidates that will suitably interact with the potential target.
Currently, due to the lack of effective data mining systems, this step involves many trial-and-error iterations of wet lab or protein assay experiments. These experiments are highly time-consuming and costly. Hence, one of the current challenges in bioinformatics includes the development of intelligent and computational data mining systems that can eliminate false positives and generate more true positives before the wet lab experimentation stage.
The docking problem is an especially tricky problem, because it is governed by many physical interactions at the molecular level. The main problem is the large solution space generated by the complex interactions at the molecular level.
The molecular docking problem remains a fairly unsolved problem. Other related research areas include protein classification systems based on structure and function. A great deal of progress has been made in the past decade regarding the development of algorithms for the analysis of genomic data. Statistical and other methods are available.
A large research community in data mining is focusing on adopting these pattern analysis and classification methods for mining microarray and gene expression data. Chapter 2 Data Preprocessing 2. Data quality can be assessed in terms of accuracy, completeness, and consistency.
Propose two other dimensions of data quality. Other dimensions that can be used to assess the quality of data include timeliness, believability, value added, interpretability and accessability, described as follows: Data must be available within a time frame that allows it to be useful for decision making. Data values must be within the range of possible results in order to be useful for decision making. Data must provide additional value in terms of information that offsets the cost of collecting and accessing it.
Data must not be so complex that the effort to understand the information it provides exceeds the benefit of its analysis. Data must be accessible so that the effort to collect it does not exceed the benefit from its use. Suppose that the values for a given set of data are grouped into intervals. The intervals and corresponding frequencies are as follows.
P Using Equation 2. Give three additional commonly used statistical measures i. Data dispersion, also known as variance analysis, is the degree to which numeric data tend to spread and can be characterized by such statistical measures as mean deviation, measures of skewness, and the coefficient of variation. The mean deviation is defined as the arithmetic mean of the absolute deviations from the means and is calculated as: This value will be greater for distributions with a larger spread.
A common measure of skewness is: The coefficient of variation is the standard deviation expressed as a percentage of the arithmetic mean and is calculated as: Note that all of the input values used to calculate these three statistical measures are algebraic measures. Thus, the value for the entire database can be efficiently calculated by partitioning the database, computing the values for each of the separate partitions, and then merging theses values into an algebraic equation that can be used to calculate the value for the entire database.
The measures of dispersion described here were obtained from: Statistical Methods in Research and Produc- tion, fourth ed. Davies and Peter L. NY, Suppose that the data for analysis includes the attribute age. The age values for the data tuples are in increasing order 13, 15, 16, 16, 19, 20, 20, 21, 22, 22, 25, 25, 25, 25, 30, 33, 33, 35, 35, 35, 35, 36, 40, 45, 46, 52, What is the median? This data set has two values that occur with the same highest frequency and is, therefore, bimodal. The modes values occurring with the greatest frequency of the data are 25 and The midrange average of the largest and smallest values in the data set of the data is: The first quartile corresponding to the 25th percentile of the data is: The third quartile corre- sponding to the 75th percentile of the data is: The five number summary of a distribution consists of the minimum value, first quartile, median value, third quartile, and maximum value.
It provides a good summary of the shape of the distribution and for this data is: Omitted here. Please refer to Figure 2. A quantile plot is a graphical method used to show the approximate percentage of values below or equal to the independent variable in a univariate distribution. Thus, it displays quantile information for all the data, where the values measured for the independent variable are plotted against their corresponding quantile. A quantile-quantile plot however, graphs the quantiles of one univariate distribution against the corre- sponding quantiles of another univariate distribution.
Both axes display the range of values measured for their corresponding distribution, and points are plotted that correspond to the quantile values of the two distributions. Points that lie above such a line indicate a correspondingly higher value for the distribution plotted on the y-axis than for the distribution plotted on the x-axis at the same quantile.
The opposite effect is true for points lying below this line. In many applications, new data sets are incrementally added to the existing large data sets. Thus an important consideration for computing descriptive data summary is whether a measure can be computed efficiently in incremental manner. Use count, standard deviation, and median as examples to show that a distributive or algebraic measure facilitates efficient incremental computation, whereas a holistic measure does not.
This is a distributive measure and is easily updated for incremental additions. If we store the sum of the squared existing values and the count of the existing values, we can easily generate the new standard deviation using the formula provided in the book. We simply need to calculate the squared sum of the new numbers, add that to the existing squared sum, update the count of the numbers, and plug that into the calculation to obtain the new standard deviation.
All of this is done without looking at the whole data set and is thus easy to compute. To accurately calculate the median, we have to look at every value in the dataset. When we add a new value or values, we have to sort the new set and then find the median based on that new sorted set. This is much harder and thus makes the incremental addition of new values difficult. In real-world data, tuples with missing values for some attributes are a common occurrence.
Describe various methods for handling this problem. The various methods for handling the problem of missing values in data tuples include: This is usually done when the class label is missing assuming the mining task involves classification or description. This method is not very effective unless the tuple contains several attributes with missing values. It is especially poor when the percentage of missing values per attribute varies considerably.
In general, this approach is time-consuming and may not be a reasonable task for large data sets with many missing values, especially when the value to be filled in is not easily determined. Use this value to replace any missing values for income. For example, if classifying customers according to credit risk, replace the missing value with the average income value for customers in the same credit risk category as that of the given tuple. This may be determined with regression, inference-based tools using Bayesian formalism, or decision tree induction.
For example, using the other customer attributes in the data set, we can construct a decision tree to predict the missing values for income.
Using the data for age given in Exercise 2. Illustrate your steps. Comment on the effect of this technique for the given data. The following steps are required to smooth the above data using smoothing by bin means with a bin depth of 3.
Sort the data. This step is not required here as the data are already sorted. Partition the data into equal-frequency bins of size 3. Calculate the arithmetic mean of each bin.
Replace each of the values in each bin by the arithmetic mean calculated for the bin. Bin 1: Values that fall outside of the set of clusters may be considered outliers. Alternatively, a combination of computer and human inspection can be used where a predetermined data distribution is implemented to allow the computer to identify possible outliers. These possible outliers can then be verified by human inspection with much less effort than would be required to verify the entire initial data set.
Other methods that can be used for data smoothing include alternate forms of binning such as smooth- ing by bin medians or smoothing by bin boundaries. Alternatively, equal-width bins can be used to implement any of the forms of binning, where the interval range of values in each bin is constant. Methods other than binning include using regression techniques to smooth the data by fitting it to a function such as through linear or multiple regression.
Classification techniques can be used to imple- ment concept hierarchies that can smooth the data by rolling-up lower level concepts to higher-level concepts. Discuss issues to consider during data integration. Data integration involves combining data from multiple sources into a coherent data store. Issues that must be considered during such integration include: The metadata from the different data sources must be integrated in order to match up equivalent real-world entities.
This is referred to as the entity identification problem. Derived attributes may be redundant, and inconsistent attribute naming may also lead to redundancies in the resulting data set. Duplications at the tuple level may occur and thus need to be detected and resolved.
Free Data Mining eBooks - bvifacts.info
Differences in representation, scaling, or encod- ing may cause the same real-world entity attribute values to differ in the data sources being integrated.
Are these two variables positively or negatively correlated? For the variable age the mean is See Figure 2. The correlation coefficient is 0. The variables are positively correlated. What are the value ranges of the following normalization methods? Use the two methods below to normalize the following group of data: For readability, let A be the attribute age.
Using Equation 2. Given the data, one may prefer decimal scaling for normalization because such a transformation would maintain the data distribution and be intuitive to interpret, while still allowing mining on specific age groups.
As such values may be present in future data, this method is less appropriate. This type of transformation may not be as intuitive to the user in comparison with decimal scaling. Use a flow chart to summarize the following procedures for attribute subset selection: Suppose a group of 12 sales price records has been sorted as follows: Partition them into three bins by each of the following methods. Stepwise forward selection.
Stepwise backward elimination. Propose several methods for median approximation. Analyze their respective complexity under different parameter settings and decide to what extent the real value can be approximated. Moreover, suggest a heuristic strategy to balance between accuracy and complexity and then apply it to all methods you have given.
This question can be dealt with either theoretically or empirically, but doing some experiments to get the result is perhaps more interesting. Given are some data sets sampled from different distributions, e. The former two distributions are symmetric, whereas the latter two are skewed. For example, if using Equation 2. Obviously, the error incurred will be decreased as k becomes larger; however, the time used in the whole procedure will also increase. The product of error made and time used are good optimality measures.
A combination of forward selection and backward elimination. In practice, this parameter value can be chosen to improve system performance. There are also other approaches for median approximation. The student may suggest a few, analyze the best trade-off point, and compare the results from the different approaches. A possible such approach is as follows: Hierarchically divide the whole data set into intervals: This iterates until the width of the subregion reaches a predefined threshold, and then the median approximation formula as above stated is applied.
In this way, we can confine the median to a smaller area without globally partitioning all of data into shorter intervals, which would be expensive. The cost is proportional to the number of intervals. However, there is no commonly accepted subjective similarity measure. Using different similarity measures may deduce different results. Nonetheless, some apparently different similarity measures may be equivalent after some transformation. Suppose we have the following two-dimensional data set: A1 A2 x1 1.
Use Euclidean distance on the transformed data to rank the data points. An equiwidth histogram of width 10 for age. Using these definitions we obtain the distance from each point to the query point. Based on the cosine similarity, the order is x1 , x3 , x4 , x2 , x5. After normalizing the data we have: Conceptually, it is the length of the vector.
Examples of sampling: Based on the Euclidean distance of the normalized points, the order is x1 , x3 , x4 , x2 , x5 , which is the same as the cosine similarity order. ChiMerge [Ker92] is a supervised, bottom-up i. Perform data discretization for each of the four numerical attributes using the ChiMerge method.
Let the stopping criteria be: You need to write a small program to do this to avoid clumsy numerical computation. Submit your simple analysis and your test results: The basic algorithm of chiMerge is: The final intervals are: Sepal length: Sepal width: Petal length: Petal width: The split points are: Propose an algorithm, in pseudocode or in your favorite programming language, for the following: Also, an alternative binning method could be implemented, such as smoothing by bin modes.
The user can again specify more meaningful names for the concept hierarchy levels generated by reviewing the maximum and minimum values of the bins with respect to background knowledge about the data.
Robust data loading poses a challenge in database systems because the input data are often dirty. In many cases, an input record may have several missing values and some records could be contaminated i.
Work out an automated data cleaning and loading algorithm so that the erroneous data will be marked and contaminated data will not be mistakenly inserted into the database during data loading.
We can, for example, use the data in the database to construct a decision tree to induce missing values for a given attribute, and at the same time have human-entered rules on how to correct wrong data types. An Overview 3. State why, for the integration of multiple heterogeneous information sources, many companies in industry prefer the update-driven approach which constructs and uses data warehouses , rather than the query-driven approach which applies wrappers and integrators.
Describe situations where the query-driven approach is preferable over the update-driven approach. For decision-making queries and frequently-asked queries, the update-driven approach is more preferable.
This is because expensive data integration and aggregate computation are done before query processing time. For the data collected in multiple heterogeneous databases to be used in decision-making processes, any semantic heterogeneity problems among multiple databases must be analyzed and solved so that the data can be integrated and summarized.
If the query-driven approach is employed, these queries will be translated into multiple often complex queries for each individual database. The translated queries will compete for resources with the activities at the local sites, thus degrading their performance. In addition, these queries will generate a complex answer set, which will require further filtering and integration. Thus, the query-driven approach is, in general, inefficient and expensive.
The update-driven approach employed in data warehousing is faster and more efficient since most of the queries needed could be done off-line. This is also the case if the queries rely on the current data because data warehouses do not contain the most current information. Briefly compare the following concepts.
You may use an example to explain your point s. AN OVERVIEW The snowflake schema and fact constellation are both variants of the star schema model, which consists of a fact table and a set of dimension tables; the snowflake schema contains some normalized dimension tables, whereas the fact constellation contains a set of fact tables that share some common dimension tables.
A starnet query model is a query model not a schema model , which consists of a set of radial lines emanating from a central point. Each step away from the center represents the stepping down of a concept hierarchy of the dimension. The starnet query model, as suggested by its name, is used for querying and provides users with a global view of OLAP operations. Data transformation is the process of converting the data from heterogeneous sources to a unified data warehouse format or semantics.
Refresh is the function propagating the updates from the data sources to the warehouse. An enterprise warehouse provides corporate-wide data integration, usually from one or more operational systems or external information providers, and is cross-functional in scope, whereas the data mart is confined to specific selected subjects such as customer, item, and sales for a marketing data mart.
An enterprise warehouse typically contains detailed data as well as summarized data, whereas the data in a data mart tend to be summarized. The implementation cycle of an enterprise warehouse may take months or years, whereas that of a data mart is more likely to be measured in weeks. A virtual warehouse is a set of views over operational databases. For efficient query processing, only some of the possible summary views may be materialized.
A virtual warehouse is easy to build but requires excess capacity on operational database servers. Suppose that a data warehouse consists of the three dimensions time, doctor, and patient, and the two measures count and charge, where charge is the fee that a doctor charges a patient for a visit.
Three classes of schemas popularly used for modeling data warehouses are the star schema, the snowflake schema, and the fact constellations schema. A star schema is shown in Figure 3. The operations to be performed are: A star schema for data warehouse of Exercise 3. Suppose that a data warehouse for Big University consists of the following four dimensions: When at the lowest conceptual level e.
At higher conceptual levels, avg grade stores the average grade for the given combination. A snowflake schema is shown in Figure 3. A snowflake schema for data warehouse of Exercise 3. The specific OLAP operations to be performed are: Suppose that a data warehouse consists of the four dimensions, date, spectator, location, and game, and the two measures, count and charge, where charge is the fare that a spectator pays when watching a game on a given date.
Spectators may be students, adults, or seniors, with each category having its own charge rate. Taking this cube as an example, briefly discuss advan- tages and problems of using a bitmap index structure. Bitmap indexing is advantageous for low-cardinality domains.
For example, in this cube, if dimension location is bitmap indexed, then comparison, join, and aggregation operations over location are then reduced to bit arithmetic, which substantially reduces the processing time. For dimensions with high cardinality, such as date in this example, the vector used to represent the bitmap index could be very long.
For example, a year collection of data could result in date records, meaning that every tuple in the fact table would require bits or approximately bytes to hold the bitmap index.
Briefly describe the similarities and the differences of the two models, and then analyze their advantages and disadvantages with regard to one another. Give your opinion of which might be more empirically useful and state the reasons behind your answer. They are similar in the sense that they all have a fact table, as well as some dimensional tables. The major difference is that some dimension tables in the snowflake schema are normalized, thereby further splitting the data into additional tables.
The advantage of the star schema is its simplicity, which will enable efficiency, but it requires more space. For the snowflake schema, it reduces some redundancy by sharing common tables: However, it is less efficient and the saving of space is negligible in comparison with the typical magnitude of the fact table.
Therefore, empirically, the star schema is better simply because efficiency typically has higher priority over space as long as the space requirement is not too huge. Another option is to use a snowflake schema to maintain dimensions, and then present users with the same data collapsed into a star .
References for the answer to this question include: Understand the difference between star and snowflake schemas in OLAP. Snowflake Schemas. Design a data warehouse for a regional weather bureau. The weather bureau has about 1, probes, which are scattered throughout various land and ocean locations in the region to collect basic weather data, including air pressure, temperature, and precipitation at each hour. All data are sent to the central station, which has collected such data for over 10 years.
Your design should facilitate efficient querying and on-line analytical processing, and derive general weather patterns in multidimensional space. Since the weather bureau has about 1, probes scattered throughout various land and ocean locations, we need to construct a spatial data warehouse so that a user can view weather patterns on a map by month, by region, and by different combinations of temperature and precipitation, and can dynamically drill down or roll up along any dimension to explore desired patterns.
The star schema of this weather spatial data warehouse can be constructed as shown in Figure 3. A star schema for a weather spatial data warehouse of Exercise 3. To construct this spatial data warehouse, we may need to integrate spatial data from heterogeneous sources and systems. Fast and flexible on-line analytical processing in spatial data warehouses is an important factor.
Free Data Mining eBooks
There are three types of dimensions in a spatial data cube: We distinguish two types of measures in a spatial data cube: A nonspatial data cube contains only nonspatial dimensions and numerical measures. If a spatial data cube contains spatial dimensions but no spatial measures, then its OLAP operations such as drilling or pivoting can be implemented in a manner similar to that of nonspatial data cubes. If a user needs to use spatial measures in a spatial data cube, we can selectively precompute some spatial measures in the spatial data cube.
Which portion of the cube should be selected for materialization depends on the utility such as access frequency or access priority , sharability of merged regions, and the balanced overall cost of space and on-line computation. A popular data warehouse implementation is to construct a multidimensional database, known as a data cube. Unfortunately, this may often generate a huge, yet very sparse multidimensional matrix. Present an example illustrating such a huge and sparse data cube.
For the telephone company, it would be very expensive to keep detailed call records for every customer for longer than three months. Therefore, it would be beneficial to remove that information from the database, keeping only the total number of calls made, the total minutes billed, and the amount billed, for example. The resulting computed data cube for the billing database would have large amounts of missing or removed data, resulting in a huge and sparse data cube.
Regarding the computation of measures in a data cube: Describe how to compute it if the cube is partitioned into many chunks. PN Hint: The three categories of measures are distributive, algebraic, and holistic. Pn Hint: The variance function is algebraic. If the cube is partitioned into many chunks, the variance can be computed as follows: Read in the chunks one by one, keeping track of the accumulated 1 number of tuples, 2 sum of xi 2 , and 3 sum of xi.
Use the formula as shown in the hint to obtain the variance. For each cuboid, use 10 units to register the top 10 sales found so far. Read the data in each cubiod once. If the sales amount in a tuple is greater than an existing one in the top list, insert the new sales amount from the new tuple into the list, and discard the smallest one in the list.
The computation of a higher level cuboid can be performed similarly by propagation of the top cells of its corresponding lower level cuboids.
Suppose that we need to record three measures in a data cube: Design an efficient computation and storage method for each measure given that the cube allows data to be deleted incrementally i.
For min, keep the hmin val, counti pair for each cuboid to register the smallest value and its count. For each deleted tuple, if its value is greater than min val, do nothing. Otherwise, decrement the count of the corresponding node. If a count goes down to zero, recalculate the structure. For each deleted node N , decrement the count and subtract value N from the sum. For median, keep a small number, p, of centered values e.
Each removal may change the count or remove a centered value. If the median no longer falls among these centered values, recalculate the set. Otherwise, the median can easily be calculated from the above set. The generation of a data warehouse including aggregation ii.
Roll-up iii. Drill-down iv. Incremental updating Which implementation techniques do you prefer, and why? A ROLAP technique for implementing a multiple dimensional view consists of intermediate servers that stand in between a relational back-end server and client front-end tools, thereby using a relational or extended-relational DBMS to store and manage warehouse data, and OLAP middleware to support missing pieces.
A MOLAP implementation technique consists of servers, which support multidimen- sional views of data through array-based multidimensional storage engines that map multidimensional views directly to data cube array structures. The fact tables can store aggregated data and the data at the abstraction levels indicated by the join keys in the schema for the given data cube.
In generating a data warehouse, the MOLAP technique uses multidimensional array structures to store data and multiway array aggregation to compute the data cubes. To roll-up on a dimension using the summary fact table, we look for the record in the table that contains a generalization on the desired dimension. For example, to roll-up the date dimension from day to month, select the record for which the day field contains the special value all.
The value of the measure field, dollars sold, for example, given in this record will contain the subtotal for the desired roll-up. To perform a roll-up in a data cube, simply climb up the concept hierarchy for the desired dimension. For example, one could roll-up on the location dimension from city to country, which is more general. To drill-down on a dimension using the summary fact table, we look for the record in the table that contains a generalization on the desired dimension.
For example, to drill-down on the location dimension from country to province or state, select the record for which only the next lowest field in the concept hierarchy for location contains the special value all. In this case, the city field should contain the value all. The value of the measure field, dollars sold, for example, given in this record will contain the subtotal for the desired drill-down.
To perform a drill-down in a data cube, simply step down the concept hierarchy for the desired dimension. For example, one could drill-down on the date dimension from month to day in order to group the data by day rather than by month.
Incremental updating OLAP: To perform incremental updating, check whether the corresponding tuple is in the summary fact table. If not, insert it into the summary table and propagate the result up. Otherwise, update the value and propagate the result up. If not, insert it into the cuboid and propagate the result up. If the data are sparse and the dimensionality is high, there will be too many cells due to exponential growth and, in this case, it is often desirable to compute iceberg cubes instead of materializing the complete cubes.
Suppose that a data warehouse contains 20 dimensions, each with about five levels of granularity. How would you design a data cube structure to efficiently support this preference? How would you support this feature? An efficient data cube structure to support this preference would be to use partial materialization, or selected computation of cuboids. By computing only the proper subset of the whole set of possible cuboids, the total amount of storage space required would be minimized while maintaining a fast response time and avoiding redundant computation.
Since the user may want to drill through the cube for only one or two dimensions, this feature could be supported by computing the required cuboids on the fly. Since the user may only need this feature infrequently, the time required for computing aggregates on those one or two dimensions on the fly should be acceptable.
A data cube, C, has n dimensions, and each dimension has exactly p distinct values in the base cuboid. Assume that there are no concept hierarchies associated with the dimensions.
This is the maximum number of distinct tuples that you can form with p distinct values per dimensions. You need at least p tuples to contain p distinct values per dimension. In this case no tuple shares any value on any dimension. The minimum number of cells is when each cuboid contains only p cells, except for the apex, which contains a single cell. What are the differences between the three main types of data warehouse usage: Information processing involves using queries to find and report useful information using crosstabs, tables, charts, or graphs.
Analytical processing uses basic OLAP operations such as slice-and-dice, drill-down, roll-up, and pivoting on historical data in order to provide multidimensional analysis of data warehouse data.
Data mining uses knowledge discovery to find hidden patterns and associations, constructing analytical models, performing classification and prediction, and presenting the mining results using visualization tools. The motivations behind OLAP mining are the following: The high quality of data i. The available information processing infrastructure surrounding data warehouses means that comprehensive information processing and data analysis infrastructures will not need to be constructed from scratch.
On-line selection of data mining functions allows users who may not know what kinds of knowledge they would like to mine the flexibility to select desired data mining functions and dynamically swap data mining tasks. Assume a base cuboid of 10 dimensions contains only three base cells: The measure of the cube is count. A closed cube is a data cube consisting of only closed cells. How many closed cells are in the full cube?
Briefly describe these three methods i. Note that the textbook adopts the application worldview of a data cube as a lattice of cuboids, where a drill-down moves from the apex all cuboid, downward in the lattice.
Star-Cubing works better than BUC for highly skewed data sets. The closed-cube and shell-fragment approaches should be explored. Here, we have two cases, which represent two possible extremes, 1.
Customers who viewed this item also viewed
The k tuples are organized like the following: However, this scheme is not effective if we keep dimension A and instead drop B, because obviously there would still be k tuples remaining, which is not desirable. It seems that case 2 is always better. A heuristic way to think this over is as follows: Obviously, this can generate the most number of cells: We assume that we can always do placement as proposed, disregarding the fact that dimensionality D and the cardinality ci of each dimension i may place some constraints.
The same assumption is kept throughout for this question. If we fail to do so e. The question does not mention how cardinalities of dimensions are set. To answer this question, we have a core observation: Minimum case: The distinct condition no longer holds here, since c tuples have to be in one identical base cell now. Thus, we can put all k tuples in one base cell, which results in 2D cells in all.
Maximum case: We will replace k with b kc c and follow the procedure in part b , since we can get at most that many base cells in all. From the analysis in part c , we will not consider the threshold, c, as long as k can be replaced by a new value. Considering the number of closed cells, 1 is the minimum if we put all k tuples together in one base cell. How can we reach this bound? We assume that this is the case. We also assume that cardinalities cannot be increased as in part b to satisfy the condition.
Suppose that a base cuboid has three dimensions A, B, C, with the following number of cells: Suppose that each dimension is evenly partitioned into 10 por- tions for chunking. The complete lattice is shown in Figure 4. A complete lattice for the cube of Exercise 4. The total size of the computed cube is as follows. The total amount of main memory space required for computing the 2-D planes is: Often, the aggregate measure value of many cells in a large data cuboid is zero, resulting in a huge, yet sparse, multidimensional matrix.
Note that you need to explain your data structures in detail and discuss the space needed, as well as how to retrieve data from your structures. Give the reasoning behind your new design. A way to overcome the sparse matrix problem is to use multiway array aggregation. The first step consists of partitioning the array-based cube into chunks or subcubes that are small enough to fit into the memory available for cube computation. Each of these chunks is first compressed to remove cells that do not contain any valid data, and is then stored as an object on disk.
The second step involves computing the aggregates by visiting cube cells in an order that minimizes the number of times that each cell must be revisited, thereby reducing memory access and storage costs. By first sorting and computing the planes of the data cube according to their size in ascending order, a smaller plane can be kept in main memory while fetching and computing only one chunk at a time for a larger plane.
In order to handle incremental data updates, the data cube is first computed as described in a. Subsequently, only the chunk that contains the cells with the new data is recomputed, without needing to recompute the entire cube.
This is because, with incremental updates, only one chunk at a time can be affected. The recomputed value needs to be propagated to its corresponding higher-level cuboids. Thus, incremental data updates can be performed efficiently. When computing a cube of high dimensionality, we encounter the inherent curse of dimensionality problem: Compute the number of nonempty aggregate cells.
Comment on the storage space and time required to compute these cells. If the minimum support count in the iceberg condition is two, how many aggregate cells will there be in the iceberg cube? Show the cells. However, even with iceberg cubes, we could still end up having to compute a large number of trivial uninteresting cells i. Suppose that a database has 20 tuples that map to or cover the two following base cells in a dimensional base cuboid, each with a cell count of Let the minimum support be How many distinct aggregate cells will there be like the following: What are the cells?
We subtract 1 because, for example, a1 , a2 , a3 ,. These four cells are: They are 4: There are only three distinct cells left: Propose an algorithm that computes closed iceberg cubes efficiently. We base our answer on the algorithm presented in the paper: Let the cover of a cell be the set of base tuples that are aggregated in the cell. Cells with the same cover can be grouped in the same class if they share the same measure. Each class will have an upper bound, which consists of the most specific cells in the class, and a lower bound, which consists of the most general cells in the class.
The set of closed cells correspond to the upper bounds of all of the distinct classes that compose the cube. We can compute the classes by following a depth-first search strategy: Let the cells making up this bound be u1 , u2 , Finding the upper bounds would depend on the measure.
Incorporating iceberg conditions is not difficult. Show the BUC processing tree which shows the order in which the BUC algorithm explores the lattice of a data cube, starting from all for the construction of the above iceberg cube. We know that dimensions should be processed in the order of decreasing cardinality, that is, use the most discriminating dimensions first in the hope that we can prune the search space as quickly as possible.
In this case we should then compute the cube in the order D, C, B, A. The order in which the lattice is traversed is presented in Figure 4. BUC processing order for Exercise 4. Discuss how you might extend the Star-Cubing algorithm to compute iceberg cubes where the iceberg condition tests for avg that is no bigger than some value, v.
Instead of using average we can use the bottom-k average of each cell, which is antimonotonic. To reduce the amount of space required to check the bottom-k average condition, we can store a few statistics such as count and sum for the base tuples that fall between a certain range of v e. This is analogous to the optimization presented in Section 4. A flight data warehouse for a travel agent consists of six dimensions: Starting with the base cuboid [traveller, departure, departure time, arrival, arrival time, f light], what specific OLAP operations e.
Outline an efficient cube computation method based on common sense about flight data distribution. The OLAP operations are: There are two constraints: Use an iceberg cubing algorithm, such as BUC. Use binning plus min sup to prune the computation of the cube. Implementation project There are four typical data cube computation methods: Find another student who has implemented a different algorithm on the same platform e.
An iceberg condition: Output i. The set of computed cuboids that satisfy the iceberg condition, in the order of your output gener- ation; ii.