Data Mining: Concepts and Techniques, 3rd Edition. Jiawei Han, Micheline Kamber, Jian Pei. Database Modeling and Design: Logical Design, 5th Edition. The increasing volume of data in modern business and science calls for more complex and sophisticated tools. Although advances in data mining technology. Read "Data Mining: Concepts and Techniques" by Jiawei Han available from Rakuten Kobo. Sign up today and get $5 off your first purchase. Data Mining.
|Language:||English, Spanish, Arabic|
|ePub File Size:||24.53 MB|
|PDF File Size:||12.82 MB|
|Distribution:||Free* [*Free Regsitration Required]|
Data Mining: Concepts and Techniques, Second Edition. Jiawei Han and Micheline Kamber. Querying XML: XQuery, XPath, and SQL/XML in context. Jiawei Han and Micheline Kamber. Data Mining: Concepts and Techniques,. The Morgan Kaufmann Series in Data Management Systems, Jim Gray, Series. Purchase Data Mining: Concepts and Techniques - 3rd Edition. Print Book & E- Book. eBook ISBN: Imprint: Morgan Kaufmann. Published.
Skip to main content. Log In Sign Up. Data Mining: Concepts and Techniques 2nd Edition Solution Manual. Kabure Tirenga.
Witold Pedrycz. Big Data. Min Chen.
XML Data Mining. Andrea Tagarelli.
Data Mining: Concepts, Models, Methods, and Algorithms - PDF Free Download
Michael D. Scott Klein. Shan Suthaharan. Programming Pig. Alan Gates. Tools and Algorithms for the Construction and Analysis of Systems. Axel Legay. Business Intelligence. An Introduction to Description Logic.
Franz Baader. Lectures on Runtime Verification. Ezio Bartocci. Large-Scale Data Analytics. Aris Gkoulalas-Divanis. Demand-Driven Associative Classification. Adriano Veloso. Social Media Mining. Reza Zafarani. Advanced Backend Code Optimization. Sid Touati. Handbook of Constraint Programming. Francesca Rossi. Developing Essbase Applications. Cameron Lackpour. Nigel P. Knowledge Management and Acquisition for Intelligent Systems.
Hayato Ohwada. Data Science with Java. Michael R. Keng Siau. Provable Security. Man-Ho Au. Baji Shaik. Tijl De Bie. Agus Kurniawan. Ion Bica. Information Reuse and Integration in Academia and Industry. Hong Gao. Databases Theory and Applications. Zi Huang. Junhu Wang. Advances in K-means Clustering. Junjie Wu. Mining Heterogeneous Information Networks. Yizhou Sun. Jian Pei. Know It All. Soumen Chakrabarti. Frequent Pattern Mining. Link Mining: Models, Algorithms, and Applications.
Christos Faloutsos. How to write a great review. The review must be at least 50 characters long.
What is Kobo Super Points?
The title should be at least 4 characters long. Your display name should be at least 2 characters long. At Kobo, we try to ensure that published reviews do not contain rude or profane language, spoilers, or any of our reviewer's personal information. You submitted the following rating and review.
We'll publish them on our site once we've reviewed them. Continue shopping. Item s unavailable for purchase. Please review your cart. You can remove the unavailable item s now or we'll automatically remove it at Checkout. Remove FREE. Financial Evolution: Concepts and Techniques , Jiawei Han and Micheline Kamber About data mining and data warehousing Mining of Massive Datasets , Jure Leskovec, Anand Rajaraman, Jeff Ullman The focus of this book is provide the necessary tools and knowledge to manage, manipulate and consume large chunks of information into databases.
The Elements of Statistical Learning , Trevor Hastie, Robert Tibshirani, Jerome Friedman This is a conceptual book in terms of data mining and prediction with a statistical point of view.
Covers many machine learning subjects too. An Introduction to Statistical Learning: The exploratory techniques of the data are discussed using the R programming language. Data Science for Business , Foster Provost, Tom Fawcett An introduction to data sciences principles and theory, explaining the necessary analytical thinking to approach these kind of problems. It discusses various data mining techniques to explore information. Modeling With Data This book focus some processes to solve analytical problems applied to data.
In particular explains you the theory to create tools for exploring big datasets of information. The opposite effect is true for points lying below this line. In many applications, new data sets are incrementally added to the existing large data sets.
Thus an important consideration for computing descriptive data summary is whether a measure can be computed efficiently in incremental manner. Use count, standard deviation, and median as examples to show that a distributive or algebraic measure facilitates efficient incremental computation, whereas a holistic measure does not.
This is a distributive measure and is easily updated for incremental additions. If we store the sum of the squared existing values and the count of the existing values, we can easily generate the new standard deviation using the formula provided in the book. We simply need to calculate the squared sum of the new numbers, add that to the existing squared sum, update the count of the numbers, and plug that into the calculation to obtain the new standard deviation.
All of this is done without looking at the whole data set and is thus easy to compute. To accurately calculate the median, we have to look at every value in the dataset.
Free Data Mining eBooks
When we add a new value or values, we have to sort the new set and then find the median based on that new sorted set. This is much harder and thus makes the incremental addition of new values difficult. In real-world data, tuples with missing values for some attributes are a common occurrence. Describe various methods for handling this problem. The various methods for handling the problem of missing values in data tuples include: This is usually done when the class label is missing assuming the mining task involves classification or description.
This method is not very effective unless the tuple contains several attributes with missing values. It is especially poor when the percentage of missing values per attribute varies considerably. In general, this approach is time-consuming and may not be a reasonable task for large data sets with many missing values, especially when the value to be filled in is not easily determined.
Use this value to replace any missing values for income. For example, if classifying customers according to credit risk, replace the missing value with the average income value for customers in the same credit risk category as that of the given tuple. This may be determined with regression, inference-based tools using Bayesian formalism, or decision tree induction.
For example, using the other customer attributes in the data set, we can construct a decision tree to predict the missing values for income. Using the data for age given in Exercise 2. Illustrate your steps. Comment on the effect of this technique for the given data.
The following steps are required to smooth the above data using smoothing by bin means with a bin depth of 3. Sort the data.
This step is not required here as the data are already sorted. Partition the data into equal-frequency bins of size 3. Calculate the arithmetic mean of each bin. Replace each of the values in each bin by the arithmetic mean calculated for the bin.
Bin 1: Values that fall outside of the set of clusters may be considered outliers. Alternatively, a combination of computer and human inspection can be used where a predetermined data distribution is implemented to allow the computer to identify possible outliers.
These possible outliers can then be verified by human inspection with much less effort than would be required to verify the entire initial data set. Other methods that can be used for data smoothing include alternate forms of binning such as smooth- ing by bin medians or smoothing by bin boundaries. Alternatively, equal-width bins can be used to implement any of the forms of binning, where the interval range of values in each bin is constant.
Methods other than binning include using regression techniques to smooth the data by fitting it to a function such as through linear or multiple regression. Classification techniques can be used to imple- ment concept hierarchies that can smooth the data by rolling-up lower level concepts to higher-level concepts.
Discuss issues to consider during data integration. Data integration involves combining data from multiple sources into a coherent data store. Issues that must be considered during such integration include: The metadata from the different data sources must be integrated in order to match up equivalent real-world entities. This is referred to as the entity identification problem. Derived attributes may be redundant, and inconsistent attribute naming may also lead to redundancies in the resulting data set.
Duplications at the tuple level may occur and thus need to be detected and resolved. Differences in representation, scaling, or encod- ing may cause the same real-world entity attribute values to differ in the data sources being integrated. Are these two variables positively or negatively correlated? For the variable age the mean is See Figure 2.
The correlation coefficient is 0. The variables are positively correlated. What are the value ranges of the following normalization methods? Use the two methods below to normalize the following group of data: For readability, let A be the attribute age. Using Equation 2. Given the data, one may prefer decimal scaling for normalization because such a transformation would maintain the data distribution and be intuitive to interpret, while still allowing mining on specific age groups.
As such values may be present in future data, this method is less appropriate. This type of transformation may not be as intuitive to the user in comparison with decimal scaling. Use a flow chart to summarize the following procedures for attribute subset selection: Suppose a group of 12 sales price records has been sorted as follows: Partition them into three bins by each of the following methods. Stepwise forward selection. Stepwise backward elimination.
Propose several methods for median approximation. Analyze their respective complexity under different parameter settings and decide to what extent the real value can be approximated. Moreover, suggest a heuristic strategy to balance between accuracy and complexity and then apply it to all methods you have given. This question can be dealt with either theoretically or empirically, but doing some experiments to get the result is perhaps more interesting.
Given are some data sets sampled from different distributions, e. The former two distributions are symmetric, whereas the latter two are skewed. For example, if using Equation 2. Obviously, the error incurred will be decreased as k becomes larger; however, the time used in the whole procedure will also increase.
The product of error made and time used are good optimality measures. A combination of forward selection and backward elimination. In practice, this parameter value can be chosen to improve system performance.
There are also other approaches for median approximation. The student may suggest a few, analyze the best trade-off point, and compare the results from the different approaches. A possible such approach is as follows: Hierarchically divide the whole data set into intervals: This iterates until the width of the subregion reaches a predefined threshold, and then the median approximation formula as above stated is applied.
In this way, we can confine the median to a smaller area without globally partitioning all of data into shorter intervals, which would be expensive. The cost is proportional to the number of intervals. However, there is no commonly accepted subjective similarity measure.
Using different similarity measures may deduce different results. Nonetheless, some apparently different similarity measures may be equivalent after some transformation. Suppose we have the following two-dimensional data set: A1 A2 x1 1. Use Euclidean distance on the transformed data to rank the data points. An equiwidth histogram of width 10 for age. Using these definitions we obtain the distance from each point to the query point. Based on the cosine similarity, the order is x1 , x3 , x4 , x2 , x5.
After normalizing the data we have: Conceptually, it is the length of the vector. Examples of sampling: Based on the Euclidean distance of the normalized points, the order is x1 , x3 , x4 , x2 , x5 , which is the same as the cosine similarity order. ChiMerge [Ker92] is a supervised, bottom-up i. Perform data discretization for each of the four numerical attributes using the ChiMerge method. Let the stopping criteria be: You need to write a small program to do this to avoid clumsy numerical computation.
Submit your simple analysis and your test results: The basic algorithm of chiMerge is: The final intervals are: Sepal length: Sepal width: Petal length: Petal width: The split points are: Propose an algorithm, in pseudocode or in your favorite programming language, for the following: Also, an alternative binning method could be implemented, such as smoothing by bin modes.
The user can again specify more meaningful names for the concept hierarchy levels generated by reviewing the maximum and minimum values of the bins with respect to background knowledge about the data. Robust data loading poses a challenge in database systems because the input data are often dirty. In many cases, an input record may have several missing values and some records could be contaminated i.
Work out an automated data cleaning and loading algorithm so that the erroneous data will be marked and contaminated data will not be mistakenly inserted into the database during data loading.
We can, for example, use the data in the database to construct a decision tree to induce missing values for a given attribute, and at the same time have human-entered rules on how to correct wrong data types.
An Overview 3. State why, for the integration of multiple heterogeneous information sources, many companies in industry prefer the update-driven approach which constructs and uses data warehouses , rather than the query-driven approach which applies wrappers and integrators. Describe situations where the query-driven approach is preferable over the update-driven approach.
For decision-making queries and frequently-asked queries, the update-driven approach is more preferable. This is because expensive data integration and aggregate computation are done before query processing time.
For the data collected in multiple heterogeneous databases to be used in decision-making processes, any semantic heterogeneity problems among multiple databases must be analyzed and solved so that the data can be integrated and summarized.
If the query-driven approach is employed, these queries will be translated into multiple often complex queries for each individual database. The translated queries will compete for resources with the activities at the local sites, thus degrading their performance.
In addition, these queries will generate a complex answer set, which will require further filtering and integration. Thus, the query-driven approach is, in general, inefficient and expensive. The update-driven approach employed in data warehousing is faster and more efficient since most of the queries needed could be done off-line. This is also the case if the queries rely on the current data because data warehouses do not contain the most current information.
Briefly compare the following concepts. You may use an example to explain your point s. AN OVERVIEW The snowflake schema and fact constellation are both variants of the star schema model, which consists of a fact table and a set of dimension tables; the snowflake schema contains some normalized dimension tables, whereas the fact constellation contains a set of fact tables that share some common dimension tables.
A starnet query model is a query model not a schema model , which consists of a set of radial lines emanating from a central point. Each step away from the center represents the stepping down of a concept hierarchy of the dimension. The starnet query model, as suggested by its name, is used for querying and provides users with a global view of OLAP operations.
Data transformation is the process of converting the data from heterogeneous sources to a unified data warehouse format or semantics. Refresh is the function propagating the updates from the data sources to the warehouse.
An enterprise warehouse provides corporate-wide data integration, usually from one or more operational systems or external information providers, and is cross-functional in scope, whereas the data mart is confined to specific selected subjects such as customer, item, and sales for a marketing data mart. An enterprise warehouse typically contains detailed data as well as summarized data, whereas the data in a data mart tend to be summarized.
The implementation cycle of an enterprise warehouse may take months or years, whereas that of a data mart is more likely to be measured in weeks. A virtual warehouse is a set of views over operational databases. For efficient query processing, only some of the possible summary views may be materialized.
A virtual warehouse is easy to build but requires excess capacity on operational database servers. Suppose that a data warehouse consists of the three dimensions time, doctor, and patient, and the two measures count and charge, where charge is the fee that a doctor charges a patient for a visit. Three classes of schemas popularly used for modeling data warehouses are the star schema, the snowflake schema, and the fact constellations schema.
A star schema is shown in Figure 3. The operations to be performed are: A star schema for data warehouse of Exercise 3. Suppose that a data warehouse for Big University consists of the following four dimensions: When at the lowest conceptual level e. At higher conceptual levels, avg grade stores the average grade for the given combination.
A snowflake schema is shown in Figure 3. A snowflake schema for data warehouse of Exercise 3. The specific OLAP operations to be performed are: Suppose that a data warehouse consists of the four dimensions, date, spectator, location, and game, and the two measures, count and charge, where charge is the fare that a spectator pays when watching a game on a given date.
Spectators may be students, adults, or seniors, with each category having its own charge rate. Taking this cube as an example, briefly discuss advan- tages and problems of using a bitmap index structure. Bitmap indexing is advantageous for low-cardinality domains.
For example, in this cube, if dimension location is bitmap indexed, then comparison, join, and aggregation operations over location are then reduced to bit arithmetic, which substantially reduces the processing time. For dimensions with high cardinality, such as date in this example, the vector used to represent the bitmap index could be very long. For example, a year collection of data could result in date records, meaning that every tuple in the fact table would require bits or approximately bytes to hold the bitmap index.
Briefly describe the similarities and the differences of the two models, and then analyze their advantages and disadvantages with regard to one another.
Give your opinion of which might be more empirically useful and state the reasons behind your answer. They are similar in the sense that they all have a fact table, as well as some dimensional tables. The major difference is that some dimension tables in the snowflake schema are normalized, thereby further splitting the data into additional tables.
The advantage of the star schema is its simplicity, which will enable efficiency, but it requires more space. For the snowflake schema, it reduces some redundancy by sharing common tables: However, it is less efficient and the saving of space is negligible in comparison with the typical magnitude of the fact table.
Therefore, empirically, the star schema is better simply because efficiency typically has higher priority over space as long as the space requirement is not too huge. Another option is to use a snowflake schema to maintain dimensions, and then present users with the same data collapsed into a star . References for the answer to this question include: Understand the difference between star and snowflake schemas in OLAP.
Snowflake Schemas. Design a data warehouse for a regional weather bureau. The weather bureau has about 1, probes, which are scattered throughout various land and ocean locations in the region to collect basic weather data, including air pressure, temperature, and precipitation at each hour. All data are sent to the central station, which has collected such data for over 10 years.
Your design should facilitate efficient querying and on-line analytical processing, and derive general weather patterns in multidimensional space.
Since the weather bureau has about 1, probes scattered throughout various land and ocean locations, we need to construct a spatial data warehouse so that a user can view weather patterns on a map by month, by region, and by different combinations of temperature and precipitation, and can dynamically drill down or roll up along any dimension to explore desired patterns.
The star schema of this weather spatial data warehouse can be constructed as shown in Figure 3. A star schema for a weather spatial data warehouse of Exercise 3. To construct this spatial data warehouse, we may need to integrate spatial data from heterogeneous sources and systems. Fast and flexible on-line analytical processing in spatial data warehouses is an important factor. There are three types of dimensions in a spatial data cube: We distinguish two types of measures in a spatial data cube: A nonspatial data cube contains only nonspatial dimensions and numerical measures.
If a spatial data cube contains spatial dimensions but no spatial measures, then its OLAP operations such as drilling or pivoting can be implemented in a manner similar to that of nonspatial data cubes. If a user needs to use spatial measures in a spatial data cube, we can selectively precompute some spatial measures in the spatial data cube. Which portion of the cube should be selected for materialization depends on the utility such as access frequency or access priority , sharability of merged regions, and the balanced overall cost of space and on-line computation.
A popular data warehouse implementation is to construct a multidimensional database, known as a data cube. Unfortunately, this may often generate a huge, yet very sparse multidimensional matrix. Present an example illustrating such a huge and sparse data cube.
For the telephone company, it would be very expensive to keep detailed call records for every customer for longer than three months. Therefore, it would be beneficial to remove that information from the database, keeping only the total number of calls made, the total minutes billed, and the amount billed, for example. The resulting computed data cube for the billing database would have large amounts of missing or removed data, resulting in a huge and sparse data cube. Regarding the computation of measures in a data cube: Describe how to compute it if the cube is partitioned into many chunks.
PN Hint: The three categories of measures are distributive, algebraic, and holistic. Pn Hint: The variance function is algebraic. If the cube is partitioned into many chunks, the variance can be computed as follows: Read in the chunks one by one, keeping track of the accumulated 1 number of tuples, 2 sum of xi 2 , and 3 sum of xi. Use the formula as shown in the hint to obtain the variance.
For each cuboid, use 10 units to register the top 10 sales found so far. Read the data in each cubiod once.
If the sales amount in a tuple is greater than an existing one in the top list, insert the new sales amount from the new tuple into the list, and discard the smallest one in the list. The computation of a higher level cuboid can be performed similarly by propagation of the top cells of its corresponding lower level cuboids.
Suppose that we need to record three measures in a data cube: Design an efficient computation and storage method for each measure given that the cube allows data to be deleted incrementally i.
For min, keep the hmin val, counti pair for each cuboid to register the smallest value and its count. For each deleted tuple, if its value is greater than min val, do nothing. Otherwise, decrement the count of the corresponding node. If a count goes down to zero, recalculate the structure. For each deleted node N , decrement the count and subtract value N from the sum.
For median, keep a small number, p, of centered values e. Each removal may change the count or remove a centered value. If the median no longer falls among these centered values, recalculate the set. Otherwise, the median can easily be calculated from the above set. The generation of a data warehouse including aggregation ii. Roll-up iii. Drill-down iv. Incremental updating Which implementation techniques do you prefer, and why? A ROLAP technique for implementing a multiple dimensional view consists of intermediate servers that stand in between a relational back-end server and client front-end tools, thereby using a relational or extended-relational DBMS to store and manage warehouse data, and OLAP middleware to support missing pieces.
A MOLAP implementation technique consists of servers, which support multidimen- sional views of data through array-based multidimensional storage engines that map multidimensional views directly to data cube array structures.
The fact tables can store aggregated data and the data at the abstraction levels indicated by the join keys in the schema for the given data cube. In generating a data warehouse, the MOLAP technique uses multidimensional array structures to store data and multiway array aggregation to compute the data cubes.
To roll-up on a dimension using the summary fact table, we look for the record in the table that contains a generalization on the desired dimension. For example, to roll-up the date dimension from day to month, select the record for which the day field contains the special value all. The value of the measure field, dollars sold, for example, given in this record will contain the subtotal for the desired roll-up.
To perform a roll-up in a data cube, simply climb up the concept hierarchy for the desired dimension. For example, one could roll-up on the location dimension from city to country, which is more general.
To drill-down on a dimension using the summary fact table, we look for the record in the table that contains a generalization on the desired dimension.
For example, to drill-down on the location dimension from country to province or state, select the record for which only the next lowest field in the concept hierarchy for location contains the special value all. In this case, the city field should contain the value all. The value of the measure field, dollars sold, for example, given in this record will contain the subtotal for the desired drill-down. To perform a drill-down in a data cube, simply step down the concept hierarchy for the desired dimension.
For example, one could drill-down on the date dimension from month to day in order to group the data by day rather than by month.
Incremental updating OLAP: To perform incremental updating, check whether the corresponding tuple is in the summary fact table. If not, insert it into the summary table and propagate the result up. Otherwise, update the value and propagate the result up. If not, insert it into the cuboid and propagate the result up. If the data are sparse and the dimensionality is high, there will be too many cells due to exponential growth and, in this case, it is often desirable to compute iceberg cubes instead of materializing the complete cubes.
Suppose that a data warehouse contains 20 dimensions, each with about five levels of granularity. How would you design a data cube structure to efficiently support this preference?
How would you support this feature? An efficient data cube structure to support this preference would be to use partial materialization, or selected computation of cuboids. By computing only the proper subset of the whole set of possible cuboids, the total amount of storage space required would be minimized while maintaining a fast response time and avoiding redundant computation.
Since the user may want to drill through the cube for only one or two dimensions, this feature could be supported by computing the required cuboids on the fly. Since the user may only need this feature infrequently, the time required for computing aggregates on those one or two dimensions on the fly should be acceptable. A data cube, C, has n dimensions, and each dimension has exactly p distinct values in the base cuboid.
Assume that there are no concept hierarchies associated with the dimensions. This is the maximum number of distinct tuples that you can form with p distinct values per dimensions. You need at least p tuples to contain p distinct values per dimension. In this case no tuple shares any value on any dimension.
The minimum number of cells is when each cuboid contains only p cells, except for the apex, which contains a single cell. What are the differences between the three main types of data warehouse usage: Information processing involves using queries to find and report useful information using crosstabs, tables, charts, or graphs. Analytical processing uses basic OLAP operations such as slice-and-dice, drill-down, roll-up, and pivoting on historical data in order to provide multidimensional analysis of data warehouse data.
Data mining uses knowledge discovery to find hidden patterns and associations, constructing analytical models, performing classification and prediction, and presenting the mining results using visualization tools. The motivations behind OLAP mining are the following: The high quality of data i. The available information processing infrastructure surrounding data warehouses means that comprehensive information processing and data analysis infrastructures will not need to be constructed from scratch.
On-line selection of data mining functions allows users who may not know what kinds of knowledge they would like to mine the flexibility to select desired data mining functions and dynamically swap data mining tasks. Assume a base cuboid of 10 dimensions contains only three base cells: The measure of the cube is count.
A closed cube is a data cube consisting of only closed cells. How many closed cells are in the full cube? Briefly describe these three methods i. Note that the textbook adopts the application worldview of a data cube as a lattice of cuboids, where a drill-down moves from the apex all cuboid, downward in the lattice. Star-Cubing works better than BUC for highly skewed data sets. The closed-cube and shell-fragment approaches should be explored. Here, we have two cases, which represent two possible extremes, 1.
The k tuples are organized like the following: However, this scheme is not effective if we keep dimension A and instead drop B, because obviously there would still be k tuples remaining, which is not desirable. It seems that case 2 is always better. A heuristic way to think this over is as follows: Obviously, this can generate the most number of cells: We assume that we can always do placement as proposed, disregarding the fact that dimensionality D and the cardinality ci of each dimension i may place some constraints.
The same assumption is kept throughout for this question. If we fail to do so e. The question does not mention how cardinalities of dimensions are set. To answer this question, we have a core observation: Minimum case: The distinct condition no longer holds here, since c tuples have to be in one identical base cell now. Thus, we can put all k tuples in one base cell, which results in 2D cells in all. Maximum case: We will replace k with b kc c and follow the procedure in part b , since we can get at most that many base cells in all.
From the analysis in part c , we will not consider the threshold, c, as long as k can be replaced by a new value. Considering the number of closed cells, 1 is the minimum if we put all k tuples together in one base cell. How can we reach this bound? We assume that this is the case.
We also assume that cardinalities cannot be increased as in part b to satisfy the condition. Suppose that a base cuboid has three dimensions A, B, C, with the following number of cells: Suppose that each dimension is evenly partitioned into 10 por- tions for chunking. The complete lattice is shown in Figure 4. A complete lattice for the cube of Exercise 4. The total size of the computed cube is as follows. The total amount of main memory space required for computing the 2-D planes is: Often, the aggregate measure value of many cells in a large data cuboid is zero, resulting in a huge, yet sparse, multidimensional matrix.
Note that you need to explain your data structures in detail and discuss the space needed, as well as how to retrieve data from your structures. Give the reasoning behind your new design. A way to overcome the sparse matrix problem is to use multiway array aggregation. The first step consists of partitioning the array-based cube into chunks or subcubes that are small enough to fit into the memory available for cube computation. Each of these chunks is first compressed to remove cells that do not contain any valid data, and is then stored as an object on disk.
The second step involves computing the aggregates by visiting cube cells in an order that minimizes the number of times that each cell must be revisited, thereby reducing memory access and storage costs. By first sorting and computing the planes of the data cube according to their size in ascending order, a smaller plane can be kept in main memory while fetching and computing only one chunk at a time for a larger plane. In order to handle incremental data updates, the data cube is first computed as described in a.
Subsequently, only the chunk that contains the cells with the new data is recomputed, without needing to recompute the entire cube. This is because, with incremental updates, only one chunk at a time can be affected.
The recomputed value needs to be propagated to its corresponding higher-level cuboids. Thus, incremental data updates can be performed efficiently. When computing a cube of high dimensionality, we encounter the inherent curse of dimensionality problem: Compute the number of nonempty aggregate cells.
Comment on the storage space and time required to compute these cells. If the minimum support count in the iceberg condition is two, how many aggregate cells will there be in the iceberg cube? Show the cells. However, even with iceberg cubes, we could still end up having to compute a large number of trivial uninteresting cells i.
Suppose that a database has 20 tuples that map to or cover the two following base cells in a dimensional base cuboid, each with a cell count of Let the minimum support be How many distinct aggregate cells will there be like the following: What are the cells?
We subtract 1 because, for example, a1 , a2 , a3 ,. These four cells are: They are 4: There are only three distinct cells left: Propose an algorithm that computes closed iceberg cubes efficiently.
We base our answer on the algorithm presented in the paper: Let the cover of a cell be the set of base tuples that are aggregated in the cell. Cells with the same cover can be grouped in the same class if they share the same measure. Each class will have an upper bound, which consists of the most specific cells in the class, and a lower bound, which consists of the most general cells in the class.
The set of closed cells correspond to the upper bounds of all of the distinct classes that compose the cube. We can compute the classes by following a depth-first search strategy: Let the cells making up this bound be u1 , u2 , Finding the upper bounds would depend on the measure. Incorporating iceberg conditions is not difficult.