Data mining is the process of extracting patterns from data-OasisDala. Since more data are being collected, in which the amount of data is doubling every three years, data mining is becoming an increasingly important tool to convert these data into information. Generally, it is used to outline detailed behaviors, such as marketing, monitoring, fraud detection and scientific discovery.This Is part of Information technology.
While data mining can be used to expose patterns in data samples, it is also important to know that the use of non-representative samples of data can produce results that are not indicative of the domain. Similarly, data mining will not detect patterns that may exist in the domain, if those patterns are not present in the sample which is being “mined”. There is a tendency among inadequate knowledgeable “consumers” for the results that see “magical ability” for “data mining” and take this technique as a means of seeing exactly as a crystal ball. Like any other tool, it works only in the association of proper raw materials: in this case, indicator and representative data that the user must first deposit. Apart from this, finding a particular pattern in a particular set of data does not necessarily represent the entire population of the pattern from which the data has been taken. Therefore, an important part of this process is verification and certification of patterns on other samples of data.
The term data mining has also been used in a related but negative sense, where it indicates a clear but not necessarily representative pattern of thoughtful search in large numbers of data. In order to avoid confusion with other expressions, the use of data dredging and data snooping is often used. Note, that the dragging (dredging) and snooping (and sometimes) when the hypothesis is being developed and clarified can be used as an exploration tool.
Behind data mining
The man has been extracting patterns from the “hands” data for centuries, but in modern times the increasing amount of data has made more automated methods necessary. Early methods of identifying patterns in data include the Baes theorem (1700s) and regression analysis (1800s). The spread of computer technology, ubiquity, and increasing power have increased data collection and storage. Since the data sets have grown in size and complexity, actual analysis of data has been increasingly accelerated through indirect, automated data processing. This has been further boosted by other discoveries in computer science, such as Neural Networks, Clustering, Genetic Algorithms (the 1950s), Decision Tree (1960s) and Support Vector Machine (1980s). Data mining is the process of implementing these methods on the data intended to reveal hidden patterns. For many years, it has been used by the industry, scientists and governments to filter the quantity of data such as air passenger travel records, census Supermarket scanner data to generate data and market research reports (However, note that reporting is not always considered as data mining).
Support in the analysis of behavioral findings is one of the main reasons for the use of data mining. Such data are sensitive to luminance due to unknown interrelationships. An indispensable fact of data mining is that the data set (sub) can not be representative of the whole domain, and therefore it may not contain examples of some important relationships and behaviors that other parts of the domain Exist in To solve this kind of problem, using analysis-based experimentation and other methods, such as solving child-modeling for human-born data. In these situations, either control the underlying interconnection or remove it altogether, during the formation of experimental design.
There have been some attempts to define standards for data mining, for example, 1999 European Cross Industry Standard Process for Data Mining (CRISP-DM 1.0) and 2004 Java Data Mining Standard (JDM 1.0). These are the evolving standards; Versions of these standards are currently under development process. An independent form of open sources software systems such as RapidMiner, Weka, KNIME and R Project has become an informal standard for defining data mining processes, free of these efforts of standardization. Most of these systems are capable of importing and exporting models in PMML (Predictive Model Markup Language), which provides a standard way of presenting data mining models so that they can be shared between different statistical applications. PMML is an XML based language developed by the Data Mining Group (DMG) , an independent group of many data mining companies. PMML version 4.0, released in June 2009.
Research and Development
In addition to the industry-driven demand of standard and interoperability, business and academic activities have also contributed a lot to the development and precision of methods and models; Article published in the 2008 issue of the International Journal of Information Technology and Decision Making, summarizes the results of a literature survey that identifies and analyzes this development.
The main business organization in this field is the Association for Computing Machinery’s Special Interest Group on Knowledge Discovery and Data Mining (SIGKDD). [Please add citation] Since 1989, they have organized an annual international conference and published their proceedings, and 1999 Since then, a bi-annual academic journal, “SIGKDD Explorations” has been published. Other sciences on computer data mining Meln include:
DMIN – International Conference on Data Mining;
DMKD – Research Issues on Data Mining and Knowledge Discovery;
ECML-PKDD – [European Conference on Machine Learning and Principles and Practice of Knowledge Discovery in Databases;] ICDM – IEEE International Conference on Data Mining;
MLDM – Machine Learning and Data Mining in Pattern Recognition;
SDM – SIAM International Conference on Data Mining
Knowledge Discovery in Databases (KDD) was a name coined by Gregory Piatetski-Shapiro in 1989, for describing the process of finding interesting, analyzed, useful and innovative data. There are many nuances in this process, but roughly its steps are to pre-process raw data, mapping data, and interpreting results.
Once the purpose of the KDD process is known, a target data set should be collected. Since data mining can only highlight the patterns already present in the data, the target data set should be so large that it contains these patterns while it is so short that it can be mined in an acceptable time frame. A common source for data is a datamart or data warehouse.
The set target is then cleared. Removes comments with cleanliness, noise and missing data.
Clear data is transformed into a feature vector, a vector per observation. A feature vector is a short version of raw data observation. For example, the black and white image of a face which is 100px by 100px would contain 10,000 bits of raw data. It can be transformed into feature vectors by detecting eye and mouth in the image. By doing so, the data for each vector will be reduced, there will be three codes for locations with 10,000 bits, dramatically reducing the size of the dataset to be mapped and thus reducing the processing function. The selected feature will depend on what the purpose is (are); Obviously, the choice of “right” feature (s) is the basis for successful data mining.
The feature vectors are divided into two sets, “Training Set” and “Test Set”. The training set is used to train the data mining algorithm, while the test set is used to verify the accuracy of any given pattern.
Data mining normally involves four classes of tasks:
Classification – Organizes data in predefined groups. For example, an email program might attempt to classify an email as legitimate or spam. Common algorithms include Decision Tree Learning, Near Neighbor, Nive Bassian Classifications and Neural Networks.
Clustering is like categorization, but the groups are not predefined, so the algorithm will try to collect similar things into groups.
Regression – attempts to find a job that models the data with the least error. A common method is to use genetic programming.
Association rule learning – searches for relationships between variables. For example, a supermarket can collect data on customers’ shopping habits. Using association rule learning, the supermarket can determine which products are often purchased together and can use this information for marketing purposes. It is sometimes called “Market Basket Analysis”.
The final step of exploring knowledge from data is to verify the patterns generated by the data mining algorithm which is in a wider data set. Not all patterns found by the data mining algorithm are necessarily correct. It is common to detect patterns in the training set for data mining algorithms which are not present in the normal data set, it is called overfitting. To overcome this, the evaluation uses a test set of data on which the data mining algorithm was not trained. The Lent pattern is applied to this test set whose result is compared to the desired result. For example, a data mining algorithm trying to isolate legitimate emails from spam will be trained on the training set of sample emails. Once trained, the Lent pattern will be applied to the test set of emails on which it was not trained, the accuracy of these patterns, how many emails they have been classified correctly, can be measured from this. Several statistical methods can be used to evaluate the algorithm like ROC curves.
If the learned patterns do not meet the desired standards, then it is necessary to re-evaluate and replace preprocessing and data mining. If the learn patterns meet the desired standards then the final process is to interpret those long patterns and convert them into knowledge.
From the early 1960s, with the availability of Oracle for some mixed sports, which were also called tablebases (e.g., 3×3-chess), any initial configuration, small board dots-and-boxes, small board hex and chess Some endgames, dots-and-boxes, and hex; With a new area open for data mining. This is the extraction of human useful strategies from these oracles. Existing pattern recognition methods do not have the necessary high levels of separation to be successfully implemented. Instead, in combination with an intensive study of tablebase answers of well-designed problems, extensive use of tablesbases and knowledge of pre-arts, i.e. with pre-tablebase knowledge, is used to generate practical patterns. Dots John Nun is an example of major researchers doing this work in Barelykamp and Chess Endgams in the End Boxes etc. Although they have tablebase generation Not included in.
In customer relationship management applications, data mining can contribute significantly to the bottom line. [Please add citation] Rather than contacting a prospect or customer randomly by sending a call center or mail, a company focuses on its efforts Who have a high likelihood of responding to any proposal. More sophisticated methods can be used to maximize resources in the campaign so that it can be predicted which channel and which offer a person is more likely to answer – in all the capable products. In addition, sophisticated applications can be used to automate mailing. Once the result is determined from data mining (potential prospect/customer and channel/offer), this “sophisticated application” can automatically send an e-mail or regular mail. In the end, in cases where many people will react without any proposal, the use of upholift modeling can be used to determine which response will be the biggest increase in the response of the people. Data clustering can be used to automatically locate a group or group within a customer data set.
Businesses that adopt data mining can see a return or investment, but they also see that the number of predictive models can be very fast. Instead of a model to tell which customer will respond, an industry can create a different model for each region and customer type. Then instead of sending a proposal for all the respondents who responded, they would want to send proposals only to those who are likely to take the offer. And finally, he would also like to decide which customers will be profitable in a time period and will send a proposal only to those who are likely to be profitable. To maintain this volume of models, they need to move towards model version management and automatic data mining.
Data mining can also be useful for the human resources department, to identify the characteristics of their most successful employees. Information received, such as the university used for education by highly successful employees, can help HR to focus on recruitment efforts accordingly. In addition, strategic enterprise management applications help a company to translate the targets of corporate stars, such as profit and margin share goals, within operational decisions, such as production plans and workforce levels.
Another example of data mining, often called Market Basket Analysis, is related to its use in retail sales. If a clothing store registers the purchasing of customers, a data mining system can mark those customers who prefer silk shirts rather than cotton. Although some explanations of relationships can be difficult, it is easy to take advantage of it. This example discusses the association rule within transaction-based data. All data are transaction-based and not logical or inaccurate, rules can also exist within a database. In a manufacturing application, an ineffective rule can say that 73% of the products with a specific flaw or problem will also develop a secondary problem within the next six months.
Market basket analysis has been used to identify the purchasing pattern of the Alpha consumer. Alpha consumers, those people who play an important role in connecting with the concept behind a product, then adopt that product and ultimately reinforces the rest of the society. Analysis of data collected on these types of users enables companies to predict future purchasing trends and predict supply-demand.
Data mining is a very effective tool in the catalog marketing industry. Ketloger has a rich history of customer transactions over millions of customers, over millions of customers. Data mining tools can identify patterns among customers and can help identify the most potential customers who respond to the upcoming mailing campaign.
In relation to an integrated circuit production line, an example of data mining is described in the letter “Mining IC Test Data to Optimize VLSI Testing.” Application of data mining and dye-level function in this paper.
Science and engineering
In recent years, data mining has been widely used in the field of science and engineering such as BioInformatics, Genetics, Medical, Education and Electrical Power Engineering.
The key goal in the field of study on human genetics is to illustrate the relation between individual DNA sequence and individual variation in the variability of disease sensitivity. In general, terminology, find out how changes in a person’s DNA sequence affect the risk of developing common diseases like cancer. It helps a lot in improving diagnosis, prevention, and treatment of diseases. Data mining technology, which is used for this task, is known as Multifactor Dimensional Reduction.
In electrical engineering, data mining technology is widely used for monitoring the position of high voltage power devices. The purpose of monitoring the situation is to get valuable information on the health status of insulation devices. Data clustering, such as self-organizing map (SOM), has been implemented in the analysis of vibration monitoring and transformer on-load tap-converter (OLTCS). By using vibration monitoring, it can be seen that each tap conversion operation generates a signal that includes information about the state of the tap converter contact and the drive mechanism. Obviously, different tap positions will cause different signals. Although for the exact tap position, there was considerable variability among normal position signs. SOM has been applied to detect abnormal conditions and to estimate the nature of abnormalities.
Data mining technology has been applied to dissolved gas analysis (DGA) on an electric transformer. DGA has been available for many years as a diagnostic tool for electric transformers. Data mining techniques, such as SOM, are applied to determine data analysis and trends, which are not clear to standard DGA ratio techniques like Duvial Triangle.
A fourth area of the use of data mining in the fields of science/engineering is academic research, where data mining is used to study the major factors that inspire students to choose behavior that their study Decrease and understand the factors that influence student retention of the university. One such example of the social application of data mining is its use in the specialty discovery system, under which the descriptors of human expertise are extracted, normalized and classified so that experts can be facilitated, especially In scientific and technical fields In this way, data mining can help institutional memory.
Other examples of the application of data mining techniques are the simplified biomedical data by domain entomologists, medical test data mining, traffic analysis using SOM, etc.
In adverse drug response monitoring, Uppsala Monitoring Center has reported patterns of regularly using data mining methods on emerging safe drug issues in WHO’s global database of 4.6 million suspected adverse drug reaction events since 1998. Recently, similar procedures were developed to mimic the vast collection of electronic health records for the temporary pattern of medical diagnosis related to prescription drugs.
Local data mining
Spatial data mining is the application of data mining techniques on spatial data. Spatial data mining adheres to similar processes in data mining, where their ultimate purpose is to find patterns in geography. So far, Data Mining and Geographic Information Systems (GIS) have existed as two separate technologies, both of which are with their own perspectives towards different traditions, methods and visualization and data analysis. In particular, most contemporary GIS has very basic spatial analysis functionality. Due to the development of geographically referenced data, heavy explosions, digital mapping, telecommunication data and global spread of GIS, emphasizes the importance of developing data-driven inductive approaches for geographic analysis and modeling.
Data mining, which is a partial automatic search for hidden patterns in large databases, provides highly capable benefits for practical GIS-based decision-making. Recently, the task of integrating these two technologies has become important, especially in the areas of various public and private sector organizations, which have a large database of thematic and geographically referenced data, to realize the tremendous potential of hidden information. is introduced. Among those organizations are:
Offices requiring analysis or dissemination of geo-referenced statistical data
Public health services that seek explanation of the disease groups
Environmental agencies that are evaluating the effect of the changing method of land use on climate change
Geo-marketing companies that are segmenting customers based on local location
Geospatial data storage is very large. In addition, existing GIS datasets are often divided into a feature and specialty components, which are traditionally stored in hybrid data management systems. Algorithmic requirements are quite different for relational (attribute) data management and topological data management. This is related to the diversity and range of geographic data format, which also offers unique challenges. Digital Geographic Data Revolution is creating a new type of data format beyond traditional “vector” and “raster” formats. The geographical data repositories include fast structured data such as imaginary images and geo-referenced multi-media.
There are many important research challenges in geographic knowledge discovery and data mining. Miller and Han provide the following list of emerging research topics in this area:
Development and support of geographic data storage – Spatial properties are often limited to simple ectopic properties in mainstream data repositories. In the construction of an integrated GDW, the issues of spatial and temporary data interoperability need to be resolved, including differences in semantics, referenced systems, geometry, accuracy, and status.
Better local-temporal representation in geographic knowledge discovery – Current geographic knowledge discovery (GKD) techniques generally use very simple representations of geographic objects and spatial relationships. Geographic data mining techniques can be found in more complex geographical objects (lines and polygons) and relationships (non-euclidean distance, direction, contact and mail through geographic location such as geographic location). Time should be integrated into more geographical representations and relationships more fully.
Geographical discovery using different types of data – should develop GKD technology that can handle various types of data beyond traditional raster and vector models, including fictional images and geo-referenced multimedia, as well as dynamic data Type (video stream, animation).
To prevent terrorist activities under the U.S. government, prior data mining includes the Total Information Awareness Program (TIA), Safe Flight (formerly Computer-Assessed Passenger Prescrining System (CAPPS II) analysis, dissemination, visualization, insight, semantic enrichment ( ADVISE) and the Multistate Anti-Terrorism Information Exchange (Matrix) These programs, the fourth amendment of the American Constitution, Been closed due to dispute the skipping, although many programs that were formed under their various organizations, or continue to be funded under different names.
There are two potential data mining techniques in reference to counter terrorism “pattern mining” and “subject-based data mining”.
“Pattern Mining” is a data mining technique in which data is included in the pattern already existing. In this context, the meaning of the pattern is often the consistent rule. The original motivation for the search of consistent rules came from the desire to analyze supermarket transaction data, i.e., in the case of purchased products, check customer behavior. For example, a consistent rule “beer => crisps (80%)” states that four of the five customers who bought beer have also purchased Crisps.
In the context of pattern mining as a tool for identifying terrorist activity, the National Research Council provides the following definition: “Pattern-based data mining pattern searches (including odd data patterns) that may be linked to terrorist activities – These patterns can be considered as a small signal in the ocean of noise. ” Pattern mining includes new areas such as a music information retrieval (MIR) where patterns seen in both temporary and non-temporal domains are imported into the search technology for scientific knowledge discovery.
Topic based data mining
“Theme based data mining” is a data mining technique in which data involves the search of companionship among individuals. In reference to fighting terrorism, the National Research Council provides the following definition: “Subject-based data mining uses a person or other fact that is based on other information, is considered to be of high utility and aims, It is necessary to determine which other people or financial transactions or movements, etc., are related to those starting figures. ”
Privacy concerns and principles
Some people believe that data mining itself is ethically neutral. However, the ways in which data mining can be used can raise the question of privacy, validity, and ethics. Specifically, the data mining government or For national security or law enforcement purposes, for example in the Total Information Awareness Program or ADVISE, Business Data Set has increased privacy concerns.
Data mining requires data creation, which can reveal information or patterns that can compromise with privacy and privacy rules. A common method of getting such an event is through data collection. Data collection is when data is collected, possibly from different sources and put together so that it can be analyzed, it is not automatically data mining, but for the purposes of analysis, and before, the preparation of data is a result of. The danger of a person’s privacy begins to occur when the data, once compiled, enables the data miner or anyone whose access is up to the newly compiled data set, to identify specific individuals, especially From when originally the data were anonymous
It is recommended that a person should be aware of the following before collecting the data:
The purpose of data collection and any data mining project,
How will the data be used,
Who will be able to mine the data and use them
Security of access to data and, in addition,
How the collected data can be updated
An individual can additionally modify the data so that they become anonymous so that people are not easily identified. However, even non-marked data sets may contain enough information to identify a person, as it happened when unknowingly a journalist found a number of search history based on a set of search history. Enabled.
Situation in the United States
In the United States, health aspects have been addressed by the US Congress through the path of regulatory control such as the Health Insurance Portability and Accountability Act (HIPAA). According to an article by the Biotech Business Week, HIPAA requires individuals to give their “informed consent” about the information they provide and for the current and future use of it, says AHC, “In practice, HIPAA can not offer any more security than the long-running rules in the research field. “More importantly, the rule of conservation through informed consent is low. The complexity of the necessary consent forms for patients and participants reaching the level of compulsion for the agreed people. “This data underscores the need for data in the integration and mining practices.
US Information Privacy Laws such as the HIPAA and Family Educational Rights and Privacy Act (FERPA) are applicable only in those specific areas, each of which is applicable to such law addresses in the U.S. In most businesses, the use of data mining is not controlled by any law.