Literature Review

Papers

Number	Data Set	Goal	Method	Conclusion	Remarks
1.	Compiled list of terms, searched twitter for terms, retrieved timelines of users from initial search. Crowd source labelled tweets	Want to split tweets up into hate speech, offensive and not	logistic regression, naive bayes, svm, decision trees	sexist -> offensive
2.	From twitter, human labelled	Build dataset and study the NLP features in order to classify this.	Broke misogyny up into 5 groups: Discredit, Stereotype & Obj., Sexual Harassment and Threats of Violence, Dominance and Derailing. Used SVM, RF, NB & MPNN and guideline from 17.	Can identify
3.	From Caroline Craido-Perez Attack	Analyze the language surrounding sexual aggression on Twitter to detect emerging discourse communities and how they identify		When talked about in relation to threats and abuse, women occurred as the grammatical target of abuse/threats. Gender collocates with aggression. The gramatical actor is invisible or implied. Can also identify risk of user to be agressive based on profile	CL Corpus linguistics DA discourse analysis Collocation - sequence of words that co-occur more often than expected by chance
4.	Collected tweets using offensive keywords (lists in article) split over five areas of harassment: (i) sexual, (ii) racial, (iii) appearance, (iv) political, (v) intellectual. 10,000 tweets per term.	Develop a content specific corpus for cyber bullying	Three native English speaking annotators determined whether or not a given tweet is harassing with respect to the type of harassment content and assigned one of three labels “yes”, “no”, and “other”.	NA	Harassment lexicon available on GitHub
5.	Same data set from no.4	Compare multiclass and binary type-specific classifiers (type referring to five types in no.4)	Compare SVM, GBM and KNN classifiers with different vector representations (e.g., TF-IDF, word2vec, etc.)	For sexual harassment tweets, a GBM classifier combined with a TF-IDF/LIWC vector combination was highly accurate at classifying tweets, P, R, & F scores > 95%.
6.	Used Twitters streaming API to gather 300,000 Tweets which were then filtered by Keyword	Provide empirical insights into social media discourse on the sensitive topic of GBV.	Mine conversations discussing sexual harassment cases (rather than find abusive Tweets)	The analysis shows more engagement with GBV tweets in comparison to generic tweets, the engagement is not uniform across all ages and genders	NA
8.	Collected data from the Twitter Streaming API (Twitter, 2014), using its ‘filter/track’ method for the given set of keywords pertaining to physical violence, sexual violence, and harmful practices (see Table 2 for the keywords selected). 14 million tweets collected over 10 months.	Analyze public opinion regarding GBV, highlighting the nature of tweeting practices by geographical location and gender.	Mixed methods to reveal patterns in data. Quant: Examine GBV content by geography, time, and gender. Qual: reveal attitudes and behaviors across different countries and between genders.	(i) Spikes in GBV content reflect the influence of transient events, particularly involving celebrities, (ii) Gender, language, technology penetration, and education influence participation with implications for the interpretation of quantitative measures, (iii) GBV content includes humor and metaphor (e.g., in sports) that reflect both attitude and behavior, (iv) Content highlights the role of government, law enforcement and business in the tolerance of GBV.	NA
9.	NA	Literature review of existing empirical work whether deceptiveness leaves stylistic trace	NA	Deceptiveness as such leaves no content-invariant stylistic trace, and textual similarity measures provide superior means of classifying texts as potentially deceptive.	While trolls or cyberbullies are not exclusively dishonest, there is major overlap in the purposes of a deceiver and a troll: both write content with a purpose other than its truthful communication
10.	5500 Tweets searched using three terms	Identify sentiment	Sentiment analysis	68.22% +ve, 9.34% -ve	Table 1. contains a list of key words drawn from a review of research papers
11.	3251 tweets in English	build three different classifiers that allow the identification of misogynistic behaviour	Logistic Regression vs. Naive Bayes vs. SVM vs. blended model (NB & SVM)	Blended model with tf-idf features highest F scores	NA
14.	2500 tweets	To study the user profiles and content of tweets in the contenxt of online harassment	RF, SVM with user profile, conversation and content features	RF was best	Affect score for sentiment (Warriner resource), SMOTE for balanaced data.
15.	4000 labelled tweets	Develop ML models for the detection of misogyny	Feature extraction: lexical (presence of hashtags, presence of URLs, swear word count, sexist slur, swear word presence, women word presence) , sentiment, BOW using EoC classifiers		Classifiers (EoC) containing a Logistic Re-gression model, an SVM, a Random Forest, a Gra-dient Boosting model, and a Stochastic GradientDescent model
16.	An extension of 15.				Imbalance will be an issue
17.	Comments found on Yahoo! Finance and News and labelled by employees	Identify abuse by trialing new features and build labelled dataset	Supervised learning. Feature are N-grams, Linguistic, Syntactic and Embeddings	Can do it, but noise, temporal and evolution of language issue	Justifications for choosing these features: N-grams -> to to not miss words in noisy data, Linguistic -> other features such as word lengthening, containing URLs etc, Syntactic -> features are essentially differ-ent types of tuples making use of the words, POS tags anddependency relations, Embeddings -> temporal/distributional aspects
18.		Reproduce seven state-of-the-arthate speech detection models and show limitations	Reproduce research	Proposed detection techniques are brittle against adversaries. Adversarial training does not mitigate the attacks. Using character-level features makes the models systematically more attack-resistant than using word-level features.	Adversies - making the text noisy
19.		Use criteria for hate speech found in critical race theory and use it to label 16k tweets. Analyze impact of extra linguistic features as well as n-gram for identifying hate speech		Hate speech is from men predominantly, N-gram character better than n-gram words length up to 4. Adding gender information improves F1 score. Gender-based slurs a feature.	hate speech is a precursor to hate crime

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Literature Review

Papers

Clone this wiki locally