Time Series discrete note

Anh-Thi Dinh
draft

Terminologies & fields of research

  • Burst detection: An unexpectedly large number of events occurring within some certain temporal or spatial region is called a burst, suggesting unusual behaviors or activities.
  • Time Series Regression: (ref) Time series regression is a statistical method for predicting a future response based on the response history (known as autoregressive dynamics) and the transfer of dynamics from relevant predictors. Time series regression can help you understand and predict the behavior of dynamic systems from experimental or observational data. Time series regression is commonly used for modeling and forecasting of economic, financial, and biological systems.
  • Time Series Classification: (ref) Time series classification deals with classifying the data points over the time based on its' behavior. There can be data sets which behave in an abnormal manner when comparing with other data sets. Identifying unusual and anomalous time series is becoming increasingly common for organizations
    • (ref) Time series classification data differs from a regular classification problem since the attributes have an ordered sequence.
  • Anomaly Detection:
    • A part in the same time series.
    • Finding one or more time series which are different from others.
    • Some abnormal points in the same time series.
    • Applied for both univariate and multivariate time series.

Read_CSV

More here.
1df_13 = pd.read_csv(path_file,
2                    index_col='timestamp',
3                    parse_dates=True, # index contains dates
4                    infer_datetime_format=True, # auto regconize format
5                    cache_dates=True) # faster

Find the windows of time series

Suppose we have data like in below, we wanna find the common length interval of all groups.
1# find the biggest gap
2df['date'].diff().max()
3
4# 4 biggest gaps
5df['date'].diff().sort_values().iloc[-5:]
6
7# starting of each window (the gap used to separate windows is '1D')
8w_starts = df.reset_index()[~(df['date'].diff() < pd.to_timedelta('1D'))].index
9
10# ending of each window
11w_ends = (w_starts[1:] - 1).append(pd.Index([df.shape[0]-1]))
12
13# count the number of windows
14len(w_starts)
15
16# the biggest/average window size (in points)
17(w_ends - w_starts).max()
18(w_ends - w_starts).values.mean()
19
20# the biggest window size (in time range)
21pd.Timedelta((df.iloc[w_ends]['date'] - df.iloc[w_starts]['date']).max(), unit='ns')
If you wanna add a window column to the original dataframe,
1df_tmp = df.copy()
2w_idx = 0
3for i in range(w_starts.shape[0]):
4    df_tmp.loc[w_starts[i]:(w_ends[i]+1), 'window'] = w_idx
5    w_idx += 1
6df_tmp.window = df_tmp.window.astype(int) # convert dtype to int64
There are other cases need to be considered,
The gaps are not regular
If we choose the gaps (to determine the windows) too small, there are some windows have only 1 point like in this case.
Find the gap's threshold automatically,
1from sklearn.cluster import MeanShift
2
3def find_gap_auto(df):
4
5    X = df['date'].diff().unique()
6    X = X[~np.isnat(X)] # remove 'NaT'
7    X.sort()
8    X = X.reshape(-1,1)
9
10    clustering = MeanShift().fit(X)
11    labels = clustering.labels_
12    cluster_min = labels[0]
13
14    gap = pd.to_timedelta((X[labels!=cluster_min].min() + X[labels==cluster_min].max())/2)
15
16    return gap
Loading comments...