This is an R Markdown document. Markdown is a simple formatting syntax for authoring HTML, PDF, and MS Word documents. For more details on using R Markdown see http://rmarkdown.rstudio.com.
When you click the Knit button a document will be generated that includes both content as well as the output of any embedded R code chunks within the document. You can embed an R code chunk like this:
summary(cars)
You can also embed plots, for example:
Note that the echo = FALSE
parameter was added to the
code chunk to prevent printing of the R code that generated the
plot.
When working with wordCount application on Cloudera, I think this project can be wrapped in python if we have an idea to approach expanded version of this dataset.
Year by year because I think that DEI consensus is accumulation.
I start with Lobbyists4America, it has a good file format (JSON) and small enough for my internet connection.
First, I tried to upload it to some platform we leaned like databrick, mode analytics but it’s soon stuck with upload speed so I tried a smaller step by local analysis. I failed in loading file in R programming, so I back to python for clean and filter data. I started with my experience: “text”. I have some interesting with cloud word. So I decided to go further, make a sub-data with created_at, text, word and count.
I keep text to make a future analysis in a back finding. In this capstone, I just work with date and word_count.
import pandas as pd
import numpy as np
import jinja2
from pandasql import sqldf
file = pd.read_json(r".\Capstone-Data-science-UCDavis\Lobbyists4America\tweets.json", lines=True)
file.keys()
for i in range(len(text)):
wordCount = {}
#textDict={}
#textDict["date_created"] = dateCreated[i]
#textDict['text'] = text[i]
for j in text[i].split(" "):
if j not in wordCount:
wordCount[j] = 0
wordCount[j] = wordCount[j] + 1
wordCountKey= list(wordCount)
for word in wordCountKey:
textDict={}
textDict["date_created"] = dateCreated[i]
textDict['text'] = text[i]
textDict['word'] = word
textDict['count'] = wordCount[word]
textCountList.append(textDict)
import csv
header = ['date_created','text', 'word', 'count']
with open(r".\Capstone-Data-science-UCDavis\Lobbyists4America\textCount.csv", "w",encoding='UTF8', newline="") as csvfile:
writer = csv.DictWriter(csvfile, fieldnames=header)
writer.writeheader()
for i in textCountList:
writer.writerow(i)
Import data to R programming and take a round for some key word
install.packages("tidyverse")
install.packages("skimr")
install.packages("janitor")
library(tidyverse)
library(skimr)
library(janitor)
textCount2 <- textCount %>%
group_by(word) %>%
summarise(count = sum(count))
pattern <- "^.*t.co.*$"
pattern2 <- "^.*https://t.co.*$"
as you will see, it didn’t push result for tylenol or paracetamol or acetaminophen
tylenol <- textCount2 %>%
filter(word == "tylenol")
head(tylenol)
paracetamol <- textCount2 %>%
filter(grepl("paracetamol", word))
head(paracetamol)
acetaminophen <- textCount2 %>%
filter(grepl("acetaminophen", word))
head(paracetamol)
I approach this result with Google Trends, and it’s compatible. As in current status, I care about this because I got some message with overdose in pandemic. Although those, I got only 1 for count of apple watch, and it’s 60 for ‘iPhone’. Seemly I have to think to some other keyword for approaching lobbyst.
so I also play another version to make data smaller.
textCountDict = {}
for i in range(len(text)):
for j in text[i].split(" "):
if j not in textCountDict:
textCountDict[j] = 0
textCountDict[j] = textCountDict[j] + 1
pattern = re.compile("t.co")
testKey = textCountDict.keys()
for key in testKey:
if pattern.search(key):
del textCountDict[key]
They have a lot of shorten url make by twitter (name in that time), I have no ideal to approach this. We are in economic despression and I think it’s a good time for analysis an old.
checkWord <- function(string){
tylenol <- textCount2 %>%
filter(word == string)
head(tylenol)
}
affirmative action was already banned prior to the 2023 U.S. Supreme Court ruling. Do you wonder how much DEI in this dataset?
tylenol <- textCount2 %>%
filter(word == "equity" |
word == "Equity"|
word == "Density"|
word == "density"|
word == "Inclusion"|
word == "inclusion"|
word == "affirmative") %>%
arrange(count)
head(tylenol)
checkWord("gas")
checkWord("price")
tylenol <- textCount2 %>%
filter(word == "Afganistan" |
word == "Pakistan"|
word == "Yemem"|
word == "Libya"|
word == "Somalia"|
word == "Iraq"|
word == "Syria" |
word == "Israel"|
word == "Russian"|
word == "Ukraina") %>%
arrange(count)
head(tylenol,10)
: Awsome! We found a new traffic. Okay. Let’s summary this week.
file = pd.read_json(r".\Capstone-Data-science-UCDavis\Lobbyists4America\tweets.json", lines=True)
date = file['created_at']
text = file['text']
retweeted = file['retweeted']
date.index("2010-04-05 15:42:09") #has not worked cause of pandas object deny RangeIndex
I’m trying to get index and loc it from file then wrap it in a function but have not done. It’s a good trick for index and match than make a lot of dataset just to take insight.
file.query("created_at == '2010-04-05 15:42:09'")['retweeted'] #false
file.query("created_at == '2010-04-05 15:42:09'")['retweet_count'] #4
file.query("created_at == '2010-04-05 15:42:09'")['user_id'] #19418459
user_id 19418459 have some great tweet, I will got this for further analysis.
tylenol <- textCount %>%
filter( word == "inequity"|
word == "unfair"|
word == "unequal"|
word == "unbalanced"|
word == "unjust") %>%
filter(date_created > "2013-01-01") %>%
summarise(count = sum(count))
head(tylenol, 10)
tylenol <- textCount %>%
filter( word == "inequity"|
word == "unfair"|
word == "unequal"|
word == "unbalanced"|
word == "unjust") %>%
filter(date_created < "2013-01-01") %>%
summarise(count = sum(count))
head(tylenol, 10)
sum 764 changing date 62 -> 702
tylenol <- textCount %>%
filter(word == "equity" |
word == "Equity"|
word == "Density"|
word == "density"|
word == "Inclusion"|
word == "inclusion"|
word == "affirmative"|
word == "fair") %>%
filter(date_created < "2013-01-01") %>%
summarise(count = sum(count))
head(tylenol,10)
tylenol <- textCount %>%
filter(word == "equity" |
word == "Equity"|
word == "Density"|
word == "density"|
word == "Inclusion"|
word == "inclusion"|
word == "affirmative"|
word == "fair") %>%
filter(date_created > "2013-01-01") %>%
summarise(count = sum(count))
head(tylenol,10)
345 -> 3307
tylenol <- textCount %>%
filter(date_created > "2013-01-01" )%>%
filter(word == "unfair") %>%
summarise(count = sum(count))
head(tylenol,10)
fair and unfair 368-3332 fair 310 - 2297 unfair 58 - 535
#create YearList for iterate
yearList <- c()
unfairGroupCount <- c()
for (i in 7:9){
#print(paste('200',i,'-1-1', sep = ""))
yearPoint = paste('200',i,'-1-1', sep = "")
yearList <- append(yearList, yearPoint)
}
yearList <- append(yearList, '2010-1-1')
for (i in 11:18){
#print(paste('20',i,'-1-1', sep = ""))
yearPoint = paste('20',i,'-1-1', sep = "")
yearList <- append(yearList, yearPoint)
}
Use this function to count, I will wrap it to just take arguments
unfairGroupFunction <- function(yearPoint){tylenol <- textCount %>%
filter( word == "inequity"|
word == "unfair"|
word == "unequal"|
word == "unbalanced"|
word == "unjust") %>%
filter(date_created < yearPoint) %>%
summarise(count = sum(count))
}
let try them
unfairGroupCount <- c()
for (i in yearList){
tylenol <- unfairGroupFunction(i)
unfairGroupCount <- append(unfairGroupCount, as.integer(tylenol$count))
}
I will check for fair group by changing filter
fairGroupFunction <- function(yearPoint){tylenol <- textCount %>%
filter(word == "equity" |
word == "Equity"|
word == "Density"|
word == "density"|
word == "Inclusion"|
word == "inclusion"|
word == "affirmative"|
word == "fair") %>%
filter(date_created < yearPoint) %>%
summarise(count = sum(count))
}
fairGroupCount <- c()
for (i in yearList){
tylenol <- fairGroupFunction(i)
fairGroupCount <- append(fairGroupCount, as.integer(tylenol$count))
}
GroupTable <- list(year=yearList, count1 = fairGroupCount, count2 = unfairGroupCount)
GroupTable <- as.data.frame(GroupTable)
I combined them in excel, and make a plot there. It’s have a gap but have a great increasing together.
Let’s try with just fair or unfair.
unfairFunction <- function(yearPoint){tylenol <- textCount %>%
filter(word == "unfair") %>%
filter(date_created < yearPoint) %>%
summarise(count = sum(count))
}
unfairCount <- c()
for (i in yearList){
tylenol <- unfairFunction(i)
unfairCount <- append(unfairCount, as.integer(tylenol$count))
}
fairFunction <- function(yearPoint){tylenol <- textCount %>%
filter( word == "fair") %>%
filter(date_created < yearPoint) %>%
summarise(count = sum(count))
}
fairCount <- c()
for (i in yearList){
tylenol <- fairFunction(i)
fairCount <- append(fairCount, as.integer(tylenol$count))
}
GroupTable <- list(year=yearList, fairCount = fairCount, unfairCount = unfairCount)
GroupTable <- as.data.frame(GroupTable)
As we see, I wonder, what happened in 2013 and 2017? Will we see it in 2019 and 2022?