Mining sequential patterns

Publiceringsår

2001

Upphovspersoner

Ahola, Jussi

Abstrakt

Discovering associations is one of the fundamental tasks of data mining. Its aim is to automatically seek for dependencies from vast amounts of data. The task results in socalled association rules, which are of form: If A occurs in the data then B occurs also. Only those rules that occur in the data frequently enough are generated. However, various information sources generate data with an inherent sequential nature, i.e., it is composed of discrete events which have a temporal/spatial ordering. This kind of data can be obtained from, e.g., telecommunications networks, electronic commerce, www-servers of Internet, and various scientific sources, like gene databases. The sequential nature of the data is totally ignored in the generation of the association rules. Thus, a part of the useful information included in the data is discarded. Thus, since the mid 90's the interest in discovering also the sequential associations in the data has arisen among the data mining community. The sequential associations or sequential patterns can be presented in the form: when A occurs, B occurs within some certain time. So, the difference to traditional association rules is that here the time information is included both in the rule itself and also in the mining process in the form of timing constraints. Nowadays there exist several highly efficient methods for mining these kind of patterns. The problem with them is that they assume the input data to be sequences of discrete events including only the information of the ordering, usually the time. Often, however, the events are associated with some additional attributes. The existing methods cannot take this multi-dimensionality of the data into account and so they lose the additional information it involves. Furthermore, the methods are designed for some specific problem, and are not, as such, applicable to different types of sequential data. In this report, a general formulation of the sequential patterns is introduced as it is presented in [1]. By using this approach the last problem of the existing algorithms can be tackled. A survey of the existing algorithm is then done. Three algorithms are presented in detail: WINEPI [2] and GSP [4] as they form the basis of the algorithms, and cSPADE [6] since it seems to be the most promising method proposed for the problem yet. Also the other relevant approaches are shortly introduced. Lastly, the extension of the patterns into the multi-dimensional is considered. Some ideas of handling the problem are given and also the features of the existing algorithms supporting multi-dimensionality are studied.

Visa mer

Organisationer och upphovspersoner

Teknologiska forskningscentralen VTT Ab

Ahola Jussi

Publikationstyp

Publikationsform

Separat verk

Målgrupp

Facklig

UKM:s publikationstyp

D4 Publicerad utvecklings- eller forskningsrapport eller -utredning

Publikationskanalens uppgifter

Journal/Serie

VTT Research Report

Förläggare

VTT Technical Research Centre of Finland

Nummer

TTE1-2001-10

Öppen tillgång

Öppen tillgänglighet i förläggarens tjänst

Licens för förläggarens version

Annan licens

Parallellsparad

Nej

Övriga uppgifter

Vetenskapsområden

Data- och informationsvetenskap

Nyckelord

[object Object]

Språk

engelska

Internationell sampublikation

Nej

Sampublikation med ett företag

Nej

Publikationen ingår i undervisnings- och kulturministeriets datainsamling

Nej