Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature Extraction Bug: FFT Data Leakage causing Fake Result #363

Open
nova-land opened this issue May 12, 2023 · 0 comments
Open

Feature Extraction Bug: FFT Data Leakage causing Fake Result #363

nova-land opened this issue May 12, 2023 · 0 comments

Comments

@nova-land
Copy link

nova-land commented May 12, 2023

The FFT should not be considered a proper feature.

The FFT is constructed from the whole dataset where the earlier values would be affected by the future data.
For example, If you remove the last row of the price data, the whole FFT values will be different.

If the Author can achieve good accuracy, it is mainly based on data leakage.

The following code will contain data leakage.

The original section: Link

close_fft = np.fft.fft(np.asarray(data_FT['GS'].tolist()))
fft_df = pd.DataFrame({'fft':close_fft})
fft_df['absolute'] = fft_df['fft'].apply(lambda x: np.abs(x))
fft_df['angle'] = fft_df['fft'].apply(lambda x: np.angle(x))

The whole project can become invalid just because of such data leakage. Every step after will be GIGO.
Even a decent MLP Model can have a good result with such data leakage.


A possible solution:

df = ... # The Price Data

periods = [3, 6, 9]
index_data = []
for p in periods:
    data[f'abs_{p}'] = []
    data[f'angle_{p}'] = []

# Calculate the FFT only to the latest row
# Caution: The range(1, len(df)) should be changed as the early data will be useless with such small data to calculate the FFT value.
for i in range(1, len(df)):
    window = df[:i]['close']
    index_data.append(df.index[i])
    fft_close = np.fft.fft(window.values)
    absolute = np.abs(fft_close)
    angle = np.angle(absolute)
    
    for p in periods:
        fft_list = np.copy(fft_close)
        fft_list[p:-p] = 0

        final_fft = np.fft.ifft(fft_list)
        absolute = np.abs(final_fft)[-1]
        angle = np.angle(final_fft)[-1]

        data[f'abs_{p}'].append(absolute)
        data[f'angle_{p}'].append(angle)

In such a case, you will notice the huge difference, which WILL NOT capture the same movement from the Author's FIGURE. This proves the project result performance is based on data leakage.

Caution: Separating the training and testing data before using the Author's original FFT feature will still cause data leakage. The problem is FFT can only be calculated at 'seen' data. Otherwise, it will use the whole dataset to calculate the FFT value.

@nova-land nova-land changed the title Feature Extraction Bug: FFT Data Leakage Feature Extraction Bug: FFT Data Leakage causing Fake Result May 12, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant