The "Time-reversed" Problem of Some Crawled Data #13

MiracleXYZ · 2018-02-07T10:01:54Z

When I was downloading historical data of SP500 from Yahoo Finance ^GSPC, I found that the data was time-reversed, i.e. the latest entries of data were put on the top of the DataFrame. This phenomenon also exists in nearly all of the data in the provided data archive (stock-data-lilianweng.tar.gz) except SP500.csv and _SP500.csv.
Now here is the point: we did not sort the data by time to ensure the basic requirement of LSTM model! In data_model.py, Line 25 to 35:

        # Read csv file
        raw_df = pd.read_csv(os.path.join("data", "%s.csv" % stock_sym))

        # Merge into one sequence
        if close_price_only:
            self.raw_seq = raw_df['Close'].tolist()
        else:
            self.raw_seq = [price for tup in raw_df[['Open', 'Close']].values for price in tup]

        self.raw_seq = np.array(self.raw_seq)
        self.train_X, self.train_y, self.test_X, self.test_y = self._prepare_data(self.raw_seq)

We simply extracted the close prices out of the DataFrame without checking the time. Therefore we were using the earliest 10% for test instead of the latest 10%, which is unreasonable.

Maybe we should sort the data by time before extracting the closing prices, or make sure our data is read in a right order / a consistent format.

The text was updated successfully, but these errors were encountered:

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

The "Time-reversed" Problem of Some Crawled Data #13

The "Time-reversed" Problem of Some Crawled Data #13

MiracleXYZ commented Feb 7, 2018

The "Time-reversed" Problem of Some Crawled Data #13

The "Time-reversed" Problem of Some Crawled Data #13

Comments

MiracleXYZ commented Feb 7, 2018