本文件用于数据预处理和数据分析,对两个数据集进行查看数据摘要和可视化分析,对缺失值用四种方法进行处理
数据集各属性含义:
- name - the name of the repository
- stars_count - stars count of the repository
- forks_count - forks count of the repository
- watchers - watchers in the repository
- pull_requests - pull requests made in the repository
- primary_language - the primary language of the repository
- languages_used - list of all the languages used in the repository
- commit_count - commits made in the repository
- created_at - time and date when the repository was created
- license - license assigned to the repository.
数据集各属性含义:
- TWEET: Text of the tweet. (String)
- STOCK: Company's stock mentioned in the tweet. (String)
- DATE: Date the tweet was posted. (Date)
- LAST_PRICE: Company's last price at the time of tweeting. (Float)
- 1_DAY_RETURN: Amount the stock returned or lost over the course of the next day after being tweeted about. (Float)
- 2_DAY_RETURN: Amount the stock returned or lost over the course of the two days after being tweeted about. (Float)
- 3_DAY_RETURN: Amount the stock returned or lost over the course of the three days after being tweeted about. (Float)
- 7_DAY_RETURN: Amount the stock returned or lost over the course of the seven days after being tweeted about. (Float)
- PX_VOLUME: Volume traded at the time of tweeting. (Integer)
- VOLATILITY_10D: Volatility measure across 10 day window. (Float)
- VOLATILITY_30D: Volatility measure across 30 day window. (Float)
- LSTM_POLARITY: Labeled sentiment from LSTM. (Float)
- TEXTBLOB_POLARITY: Labeled sentiment from TextBlob. (Float)
- MENTION: Number of times the stock was mentioned in the tweet. (Integer)
1.运行github.py查看GitHub Dataset的处理结果 2.运行tweet-sentiment.py查看Tweet Sentiment's Impact on Stock Returns的处理结果