You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
When I was downloading historical data of SP500 from Yahoo Finance ^GSPC, I found that the data was time-reversed, i.e. the latest entries of data were put on the top of the DataFrame. This phenomenon also exists in nearly all of the data in the provided data archive (stock-data-lilianweng.tar.gz) except SP500.csv and _SP500.csv.
Now here is the point: we did not sort the data by time to ensure the basic requirement of LSTM model! In data_model.py, Line 25 to 35:
# Read csv file
raw_df = pd.read_csv(os.path.join("data", "%s.csv" % stock_sym))
# Merge into one sequence
if close_price_only:
self.raw_seq = raw_df['Close'].tolist()
else:
self.raw_seq = [price for tup in raw_df[['Open', 'Close']].values for price in tup]
self.raw_seq = np.array(self.raw_seq)
self.train_X, self.train_y, self.test_X, self.test_y = self._prepare_data(self.raw_seq)
We simply extracted the close prices out of the DataFrame without checking the time. Therefore we were using the earliest 10% for test instead of the latest 10%, which is unreasonable.
Maybe we should sort the data by time before extracting the closing prices, or make sure our data is read in a right order / a consistent format.
The text was updated successfully, but these errors were encountered:
When I was downloading historical data of SP500 from Yahoo Finance ^GSPC, I found that the data was time-reversed, i.e. the latest entries of data were put on the top of the DataFrame. This phenomenon also exists in nearly all of the data in the provided data archive (stock-data-lilianweng.tar.gz) except
SP500.csv
and_SP500.csv
.Now here is the point: we did not sort the data by time to ensure the basic requirement of LSTM model! In
data_model.py
, Line 25 to 35:We simply extracted the close prices out of the DataFrame without checking the time. Therefore we were using the earliest 10% for test instead of the latest 10%, which is unreasonable.
Maybe we should sort the data by time before extracting the closing prices, or make sure our data is read in a right order / a consistent format.
The text was updated successfully, but these errors were encountered: