B.OBULIRAJ
B.E CSE
DATA MINING
INTRODUCTION
Data
mining, a branch of computer science and artificial
intelligence, is the process of extracting
patterns from data. Data mining is seen as an increasingly important tool by
modern business to transform data into business intelligence giving an informational advantage. It is currently used in
a wide range of profiling practices, such as marketing, surveillance, fraud detection, and scientific discovery.
DEFINITION
Data
mining, the extraction of hidden predictive information from large databases,
is a powerful new technology with great potential to help companies focus on
the most important information in their data warehouses.
Data mining commonly involves
four classes of tasks:
·
Clustering - is the task of discovering groups
and structures in the data that are in some way or another "similar",
without using known structures in the data.
·
Classification - is the task of generalizing known
structure to apply to new data. For example, an email program might attempt to
classify an email as legitimate or spam. Common algorithms include decision
tree learning, nearest
neighbour, naive
Bayesian classification, neural
networks and support
vector machines.
·
Association
rule learning - Searches for relationships between variables. For
example a supermarket might gather data on customer purchasing habits. Using
association rule learning, the supermarket can determine which products are
frequently bought together and use this information for marketing purposes.
This is sometimes referred to as market basket analysis.
THE FOUNDATION OF DATA MINING
Data mining techniques are the result of a
long process of research and product development. This evolution began when
business data was first stored on computers, continued with improvements in
data access, and more recently, generated technologies that allow users to
navigate through their data in real time. Data mining takes this evolutionary
process beyond retrospective data access and navigation to prospective and
proactive information delivery. Data mining is ready for application in the
business community because it is supported by three technologies that are now
sufficiently mature:
·
Massive data collection
·
Powerful multiprocessor
computers
·
Data mining algorithms
THE
SCOPE OF DATA MINING
Data mining derives its name from the
similarities between searching for valuable business information in a large
database — for example, finding linked products in gigabytes of store scanner
data — and mining a mountain for a vein of valuable ore. Both processes require
either sifting through an immense amount of material, or intelligently probing
it to find exactly where the value resides. Given databases of sufficient size
and quality, data mining technology can generate new business opportunities by
providing these capabilities:
·
Automated prediction of trends
and behaviors. Data mining automates the process of finding
predictive information in large databases. Questions that traditionally
required extensive hands-on analysis can now be answered directly from the data
— quickly. A typical example of a predictive problem is targeted marketing.
Data mining uses data on past promotional mailings to identify the targets most
likely to maximize return on investment in future mailings. Other predictive
problems include forecasting bankruptcy and other forms of default, and
identifying segments of a population likely to respond similarly to given
events.
·
Automated discovery of
previously unknown patterns. Data mining tools sweep
through databases and identify previously hidden patterns in one step. An
example of pattern discovery is the analysis of retail sales data to identify
seemingly unrelated products that are often purchased together. Other pattern
discovery problems include detecting fraudulent credit card transactions and
identifying anomalous data that could represent data entry keying errors.
Data mining techniques can
yield the benefits of automation on existing software and hardware platforms,
and can be implemented on new systems as existing platforms are upgraded and
new products developed. When data mining tools are implemented on high
performance parallel processing systems, they can analyze massive databases in
minutes. Faster processing means that users can automatically experiment with
more models to understand complex data. High speed makes it practical for users
to analyze huge quantities of data. Larger databases, in turn, yield improved
predictions.
Databases can be larger in
both depth and breadth:
·
More columns.
Analysts must often limit the number of variables they examine when doing
hands-on analysis due to time constraints. Yet variables that are discarded
because they seem unimportant may carry information about unknown patterns.
High performance data mining allows users to explore the full depth of a
database, without preselecting a subset of variables.
·
More rows.
Larger samples yield lower estimation errors and variance, and allow users to
make inferences about small but important segments of a population.
A recent Gartner Group
Advanced Technology Research Note listed data mining and artificial
intelligence at the top of the five key technology areas that "will
clearly have a major impact across a wide range of industries within the next 3
to 5 years."2 Gartner also listed parallel architectures and data mining
as two of the top 10 new technologies in which companies will invest during the
next 5 years. According to a recent Gartner HPC Research Note, "With the
rapid advance in data capture, transmission and storage, large-systems users
will increasingly need to implement new and innovative ways to mine the after-market
value of their vast stores of detail data, employing MPP [massively parallel
processing] systems to create new sources of business advantage (0.9
probability)."3
The most commonly used
techniques in data mining are:
·
Artificial neural networks:
Non-linear predictive models that learn through training and resemble
biological neural networks in structure.
·
Decision trees:
Tree-shaped structures that represent sets of decisions. These decisions
generate rules for the classification of a dataset. Specific decision tree
methods include Classification and Regression Trees (CART) and Chi Square
Automatic Interaction Detection (CHAID) .
·
Genetic algorithms:
Optimization techniques that use processes such as genetic combination,
mutation, and natural selection in a design based on the concepts of evolution.
·
Nearest neighbor method: A
technique that classifies each record in a dataset based on a combination of
the classes of the k record(s) most similar to it in a historical dataset
(where k ³ 1). Sometimes called the k-nearest neighbor technique.
·
Rule induction: The
extraction of useful if-then rules from data based on statistical significance.
Many of these
technologies have been in use for more than a decade in specialized analysis
tools that work with relatively small volumes of data. These capabilities are
now evolving to integrate directly with industry-standard data warehouse and
OLAP platforms. The appendix to this white paper provides a glossary of data
mining terms.
ARCHITECTURE FOR DATA MINING
To best apply these advanced
techniques, they must be fully integrated with a data warehouse as well as
flexible interactive business analysis tools. Many data mining tools currently
operate outside of the warehouse, requiring extra steps for extracting, importing,
and analyzing the data. Furthermore, when new insights require operational
implementation, integration with the warehouse simplifies the application of
results from data mining. The resulting analytic data warehouse can be applied
to improve business processes throughout the organization, in areas such as
promotional campaign management, fraud detection, new product rollout, and so
on.
The ideal starting point is a data
warehouse containing a combination of internal data tracking all customer
contact coupled with external market data about competitor activity. Background
information on potential customers also provides an excellent basis for
prospecting. This warehouse can be implemented in a variety of relational
database systems: Sybase, Oracle, Redbrick, and so on, and should be optimized
for flexible and fast data access.
An OLAP (On-Line Analytical
Processing) server enables a more sophisticated end-user business model to be
applied when navigating the data warehouse. The multidimensional structures
allow the user to analyze the data as they want to view their business –
summarizing by product line, region, and other key perspectives of their
business. The Data Mining Server must be integrated with the data warehouse and
the OLAP server to embed ROI-focused business analysis directly into this
infrastructure. An advanced, process-centric metadata template defines the data
mining objectives for specific business issues like campaign management,
prospecting, and promotion optimization. Integration with the data warehouse
enables operational decisions to be directly implemented and tracked. As the
warehouse grows with new decisions and results, the organization can
continually mine the best practices and apply them to future decisions.
This design represents a fundamental
shift from conventional decision support systems. Rather than simply delivering
data to the end user through query and reporting software, the Advanced
Analysis Server applies users’ business models directly to the warehouse and
returns a proactive analysis of the most relevant information. These results
enhance the metadata in the OLAP Server by providing a dynamic metadata layer
that represents a distilled view of the data. Reporting, visualization, and
other analysis tools can then be applied to plan future actions and confirm the
impact of those plans.
PROFITABLE APPLICATIONS
A wide range of companies have
deployed successful applications of data mining. While early adopters of this
technology have tended to be in information-intensive industries such as
financial services and direct mail marketing, the technology is applicable to any
company looking to leverage a large data warehouse to better manage their
customer relationships. Two critical factors for success with data mining are:
a large, well-integrated data warehouse and a well-defined understanding of the
business process within which data mining is to be applied (such as customer
prospecting, retention, campaign management, and so on).
Some successful application
areas include:
·
A pharmaceutical company can
analyze its recent sales force activity and their results to improve targeting
of high-value physicians and determine which marketing activities will have the
greatest impact in the next few months.
·
A credit card company can
leverage its vast warehouse of customer transaction data to identify customers
most likely to be interested in a new credit product. Using a small test
mailing, the attributes of customers with an affinity for the product can be
identified. Recent projects have indicated more than a 20-fold decrease in
costs for targeted mailing campaigns over conventional approaches.
·
A diversified transportation
company with a large direct sales force can apply data mining to identify the
best prospects for its services. Using data mining to analyze its own customer
experience, this company can build a unique segmentation identifying the
attributes of high-value prospects. Applying this segmentation to a general
business database such as those provided by Dun & Bradstreet can yield a
prioritized list of prospects by region.
·
A large consumer package goods
company can apply data mining to improve its sales process to retailers. Data
from consumer panels, shipments, and competitor activity can be applied to
understand the reasons for brand and store switching. Through this analysis,
the manufacturer can select promotional strategies that best reach their target
customer segments.
Each of these examples have a
clear common ground. They leverage the knowledge about customers implicit in a
data warehouse to reduce costs and improve the value of customer relationships.
These organizations can now focus their efforts on the most important
(profitable) customers and prospects, and design targeted marketing strategies
to best reach them.
CONCLUSION
Comprehensive data warehouses that
integrate operational data with customer, supplier, and market information have
resulted in an explosion of information. Competition requires timely and
sophisticated analysis on an integrated view of the data. Both relational and
OLAP technologies have tremendous capabilities for navigating massive data
warehouses, but brute force navigation of data is not enough. A new
technological leap is needed to structure and prioritize information for
specific end-user problems. Quantifiable business benefits have been proven
through the integration of data mining with current information systems, and
new products are on the horizon that will bring this integration to an even
wider audience of users.
No comments:
Post a Comment