ResearchHub | Open Science Community

Searching and mining trillions of time series subsequences under dynamic time warping

Thanawin Rakthanmanon et al.Aug 12, 2012

Most time series data mining algorithms use similarity search as a core subroutine, and thus the time taken for similarity search is the bottleneck for virtually all time series data mining algorithms. The difficulty of scaling search to large datasets largely explains why most academic work on time series data mining has plateaued at considering a few millions of time series objects, while much of industry and science sits on billions of time series objects waiting to be explored. In this work we show that by using a combination of four novel ideas we can search and mine truly massive time series for the first time. We demonstrate the following extremely unintuitive fact; in large datasets we can exactly search under DTW much more quickly than the current state-of-the-art Euclidean distance search algorithms. We demonstrate our work on the largest set of time series experiments ever attempted. In particular, the largest dataset we consider is larger than the combined size of all of the time series datasets considered in all data mining papers ever published. We show that our ideas allow us to solve higher-level time series data mining problem such as motif discovery and clustering at scales that would otherwise be untenable. In addition to mining massive datasets, we will show that our ideas also have implications for real-time monitoring of data streams, allowing us to handle much faster arrival rates and/or use cheaper and lower powered devices than are currently possible.

Artificial Intelligence

Paleontology

0

Paper

Artificial Intelligence

975

0

Save

0

Experimental comparison of representation methods and distance measures for time series data

Xiaoyue Wang et al.Feb 9, 2012

The previous decade has brought a remarkable increase of the interest in applications that deal with querying and mining of time series data. Many of the research efforts in this context have focused on introducing new representation methods for dimensionality reduction or novel similarity measures for the underlying data. In the vast majority of cases, each individual work introducing a particular method has made specific claims and, aside from the occasional theoretical justifications, provided quantitative experimental observations. However, for the most part, the comparative aspects of these experiments were too narrowly focused on demonstrating the benefits of the proposed methods over some of the previously introduced ones. In order to provide a comprehensive validation, we conducted an extensive experimental study re-implementing eight different time series representations and nine similarity measures and their variants, and testing their effectiveness on 38 time series data sets from a wide variety of application domains. In this article, we give an overview of these different techniques and present our comparative experimental findings regarding their effectiveness. In addition to providing a unified validation of some of the existing achievements, our experiments also indicate that, in some cases, certain claims in the literature may be unduly optimistic.

Artificial Intelligence

Paleontology

0

Paper

Artificial Intelligence

748

0

Save

0

Matrix Profile I: All Pairs Similarity Joins for Time Series: A Unifying View That Includes Motifs, Discords and Shapelets

Chin‐Chia Yeh et al.Dec 1, 2016

The all-pairs-similarity-search (or similarity join) problem has been extensively studied for text and a handful of other datatypes. However, surprisingly little progress has been made on similarity joins for time series subsequences. The lack of progress probably stems from the daunting nature of the problem. For even modest sized datasets the obvious nested-loop algorithm can take months, and the typical speed-up techniques in this domain (i.e., indexing, lower-bounding, triangular-inequality pruning and early abandoning) at best produce one or two orders of magnitude speedup. In this work we introduce a novel scalable algorithm for time series subsequence all-pairs-similarity-search. For exceptionally large datasets, the algorithm can be trivially cast as an anytime algorithm and produce high-quality approximate solutions in reasonable time. The exact similarity join algorithm computes the answer to the time series motif and time series discord problem as a side-effect, and our algorithm incidentally provides the fastest known algorithm for both these extensively-studied problems. We demonstrate the utility of our ideas for two time series data mining problems, including motif discovery and novelty discovery.

Artificial Intelligence

Paleontology

0

Paper

Artificial Intelligence

472

0

Save

0

Exact Discovery of Time Series Motifs

Abdullah Mueen et al.Apr 30, 2009

Time series motifs are pairs of individual time series, or subsequences of a longer time series, which are very similar to each other. As with their discrete analogues in computational biology, this similarity hints at structure which has been conserved for some reason and may therefore be of interest. Since the formalism of time series motifs in 2002, dozens of researchers have used them for diverse applications in many different domains. Because the obvious algorithm for computing motifs is quadratic in the number of items, more than a dozen approximate algorithms to discover motifs have been proposed in the literature. In this work, for the first time, we show a tractable exact algorithm to find time series motifs. As we shall show through extensive experiments, our algorithm is up to three orders of magnitude faster than brute-force search in large datasets. We further show that our algorithm is fast enough to be used as a subroutine in higher level data mining algorithms for anytime classification, near-duplicate detection and summarization, and we consider detailed case studies in domains as diverse as electroencephalograph interpretation and entomological telemetry data mining.

Artificial Intelligence

Paleontology

0

Paper

Artificial Intelligence

406

0

Save

0

Logical-shapelets

Abdullah Mueen et al.Aug 21, 2011

Time series shapelets are small, local patterns in a time series that are highly predictive of a class and are thus very useful features for building classifiers and for certain visualization and summarization tasks. While shapelets were introduced only recently, they have already seen significant adoption and extension in the community. Despite their immense potential as a data mining primitive, there are two important limitations of shapelets. First, their expressiveness is limited to simple binary presence/absence questions. Second, even though shapelets are computed offline, the time taken to compute them is significant. In this work, we address the latter problem by introducing a novel algorithm that finds shapelets in less time than current methods by an order of magnitude. Our algorithm is based on intelligent caching and reuse of computations, and the admissible pruning of the search space. Because our algorithm is so fast, it creates an opportunity to consider more expressive shapelet queries. In particular, we show for the first time an augmented shapelet representation that distinguishes the data based on conjunctions or disjunctions of shapelets. We call our novel representation Logical-Shapelets. We demonstrate the efficiency of our approach on the classic benchmark datasets used for these problems, and show several case studies where logical shapelets significantly outperform the original shapelet representation and other time series classification techniques. We demonstrate the utility of our ideas in domains as diverse as gesture recognition, robotics, and biometrics.

Artificial Intelligence

Law

0

Paper

Artificial Intelligence

266

0

Save

0

DeBot: Twitter Bot Detection via Warped Correlation

Nikan Chavoshi et al.Dec 1, 2016

We develop a warped correlation finder to identify correlated user accounts in social media websites such as Twitter. The key observation is that humans cannot be highly synchronous for a long duration, thus, highly synchronous user accounts are most likely bots. Existing bot detection methods are mostly supervised, which requires a large amount of labeled data to train, and do not consider cross-user features. In contrast, our bot detection system works on activity correlation without requiring labeled data. We develop a novel lag-sensitive hashing technique to cluster user accounts into correlated sets in near real-time. Our method, named DeBot, detects thousands of bots per day with a 94% precision and generates reports online everyday. In September 2016, DeBot has accumulated about 544,868 unique bots in the previous one year. We compare our detection technique with per-user techniques and with Twitter's suspension system. We observe that some bots can avoid Twitter's suspension mechanism and remain active for months, and, more alarmingly, we show that DeBot detects bots at a rate higher than the rate Twitter is suspending them.

Artificial Intelligence

Signal Processing

0

Paper

Artificial Intelligence

244

0

Save

0

Addressing Big Data Time Series

Thanawin Rakthanmanon et al.Sep 1, 2013

Most time series data mining algorithms use similarity search as a core subroutine, and thus the time taken for similarity search is the bottleneck for virtually all time series data mining algorithms, including classification, clustering, motif discovery, anomaly detection, and so on. The difficulty of scaling a search to large datasets explains to a great extent why most academic work on time series data mining has plateaued at considering a few millions of time series objects, while much of industry and science sits on billions of time series objects waiting to be explored. In this work we show that by using a combination of four novel ideas we can search and mine massive time series for the first time. We demonstrate the following unintuitive fact: in large datasets we can exactly search under Dynamic Time Warping (DTW) much more quickly than the current state-of-the-art Euclidean distance search algorithms. We demonstrate our work on the largest set of time series experiments ever attempted. In particular, the largest dataset we consider is larger than the combined size of all of the time series datasets considered in all data mining papers ever published. We explain how our ideas allow us to solve higher-level time series data mining problems such as motif discovery and clustering at scales that would otherwise be untenable. Moreover, we show how our ideas allow us to efficiently support the uniform scaling distance measure, a measure whose utility seems to be underappreciated, but which we demonstrate here. In addition to mining massive datasets with up to one trillion datapoints, we will show that our ideas also have implications for real-time monitoring of data streams, allowing us to handle much faster arrival rates and/or use cheaper and lower powered devices than are currently possible.

Artificial Intelligence

Paleontology

0

Paper

Artificial Intelligence

200

0

Save

0

Matrix Profile II: Exploiting a Novel Algorithm and GPUs to Break the One Hundred Million Barrier for Time Series Motifs and Joins

Yan Zhu et al.Dec 1, 2016

Time series motifs have been in the literature for about fifteen years, but have only recently begun to receive significant attention in the research community. This is perhaps due to the growing realization that they implicitly offer solutions to a host of time series problems, including rule discovery, anomaly detection, density estimation, semantic segmentation, etc. Recent work has improved the scalability to the point where exact motifs can be computed on datasets with up to a million data points in tenable time. However, in some domains, for example seismology, there is an insatiable need to address even larger datasets. In this work we show that a combination of a novel algorithm and a high-performance GPU allows us to significantly improve the scalability of motif discovery. We demonstrate the scalability of our ideas by finding the full set of exact motifs on a dataset with one hundred million subsequences, by far the largest dataset ever mined for time series motifs. Furthermore, we demonstrate that our algorithm can produce actionable insights in seismology and other domains.

Artificial Intelligence

Paleontology

0

Paper

Artificial Intelligence

193

0

Save

0

BitLINK: Temporal Linkage of Address Clusters in Bitcoin Blockchain

Sheng Zhong et al.Aug 24, 2024

In the Bitcoin blockchain, an entity (e.g., a gambling service) may control multiple distinct address clusters. Links (i.e., trust relationships) between these disjoint address clusters can be established when one cluster is abandoned, and a new one is formed shortly thereafter. To link the clusters across time, we have developed a deep neural network model that exploits these synchronous actions derived from unlabeled data in a self-supervised manner. This model assesses whether two clusters exhibit synchronous temporal signatures indicative of a shared entity ownership.

Genetics

Artificial Intelligence

0

Paper

Genetics

Artificial Intelligence

0

Save