R Markdown

This is an R Markdown document. Markdown is a simple formatting syntax for authoring HTML, PDF, and MS Word documents. For more details on using R Markdown see http://rmarkdown.rstudio.com.

When you click the Knit button a document will be generated that includes both content as well as the output of any embedded R code chunks within the document. You can embed an R code chunk like this:

summary(cars)

Including Plots

You can also embed plots, for example:

Note that the echo = FALSE parameter was added to the code chunk to prevent printing of the R code that generated the plot.

Update: 1-Aug-2024

When working with wordCount application on Cloudera, I think this project can be wrapped in python if we have an idea to approach expanded version of this dataset.

Week 4:

Year by year because I think that DEI consensus is accumulation.

WEEK 1:

  1. Which client/dataset did you select and why?

I start with Lobbyists4America, it has a good file format (JSON) and small enough for my internet connection.

  1. Describe the steps you took to import and clean the data.

First, I tried to upload it to some platform we leaned like databrick, mode analytics but it’s soon stuck with upload speed so I tried a smaller step by local analysis. I failed in loading file in R programming, so I back to python for clean and filter data. I started with my experience: “text”. I have some interesting with cloud word. So I decided to go further, make a sub-data with created_at, text, word and count.

I keep text to make a future analysis in a back finding. In this capstone, I just work with date and word_count.

  1. Perform initial exploration of data and provide some screenshots or display some stats of the data you are looking at. I try to access data with python
import pandas as pd
import numpy as np
import jinja2 
from pandasql import sqldf

file = pd.read_json(r".\Capstone-Data-science-UCDavis\Lobbyists4America\tweets.json", lines=True)

file.keys()
for i in range(len(text)):
    wordCount = {}
    #textDict={}
    #textDict["date_created"] = dateCreated[i]
    #textDict['text'] = text[i]
    for j in text[i].split(" "):
        if j not in wordCount:
            wordCount[j] = 0
        wordCount[j] = wordCount[j] + 1
    wordCountKey= list(wordCount)
    for word in wordCountKey:
        textDict={}
        textDict["date_created"] = dateCreated[i]
        textDict['text'] = text[i]
        textDict['word'] = word
        textDict['count'] = wordCount[word]
        textCountList.append(textDict)  

import csv

header = ['date_created','text', 'word', 'count']

with open(r".\Capstone-Data-science-UCDavis\Lobbyists4America\textCount.csv", "w",encoding='UTF8', newline="") as csvfile:
    writer = csv.DictWriter(csvfile, fieldnames=header)
    writer.writeheader()
    for i in textCountList:
        writer.writerow(i)

Import data to R programming and take a round for some key word

install.packages("tidyverse")
install.packages("skimr")
install.packages("janitor")

library(tidyverse)
library(skimr)
library(janitor)

textCount2 <- textCount %>% 
    group_by(word) %>% 
    summarise(count = sum(count))

pattern <- "^.*t.co.*$"

pattern2 <- "^.*https://t.co.*$"

as you will see, it didn’t push result for tylenol or paracetamol or acetaminophen

tylenol <- textCount2 %>% 
  filter(word == "tylenol")
head(tylenol)
paracetamol <- textCount2 %>% 
  filter(grepl("paracetamol", word)) 
head(paracetamol)
acetaminophen <- textCount2 %>% 
  filter(grepl("acetaminophen", word)) 
head(paracetamol)

I approach this result with Google Trends, and it’s compatible. As in current status, I care about this because I got some message with overdose in pandemic. Although those, I got only 1 for count of apple watch, and it’s 60 for ‘iPhone’. Seemly I have to think to some other keyword for approaching lobbyst.

so I also play another version to make data smaller.

textCountDict = {}
for i in range(len(text)):
    for j in text[i].split(" "):
        if j not in textCountDict:
            textCountDict[j] = 0
        textCountDict[j] = textCountDict[j] + 1

pattern = re.compile("t.co")
testKey = textCountDict.keys()
for key in testKey:
    if pattern.search(key):
        del textCountDict[key]

They have a lot of shorten url make by twitter (name in that time), I have no ideal to approach this. We are in economic despression and I think it’s a good time for analysis an old.

  1. Create an ERD or proposed ERD to show the relationships of the data you are exploring.

Step 2: Develop Project Proposal

Description

checkWord <- function(string){
  tylenol <- textCount2 %>% 
    filter(word == string)
  head(tylenol)
}

affirmative action was already banned prior to the 2023 U.S. Supreme Court ruling. Do you wonder how much DEI in this dataset?

tylenol <- textCount2 %>% 
  filter(word == "equity" | 
          word == "Equity"| 
           word == "Density"| 
           word == "density"| 
           word == "Inclusion"| 
           word == "inclusion"| 
           word == "affirmative") %>% 
  arrange(count)

head(tylenol)
checkWord("gas")
checkWord("price")
tylenol <- textCount2 %>% 
  filter(word == "Afganistan" | 
          word == "Pakistan"| 
           word == "Yemem"| 
           word == "Libya"| 
           word == "Somalia"| 
           word == "Iraq"| 
           word == "Syria" |
            word == "Israel"|
            word == "Russian"|
            word == "Ukraina") %>% 
  arrange(count)

head(tylenol,10)

Approach

I have definition of meme here:
“A meme is a unit of cultural information that is spread by imitation. It can be an idea, behavior, style, or usage that spreads from person to person within a culture and often carries symbolic meaning representing a particular phenomenon or theme. A meme acts as a unit for carrying cultural ideas, symbols, or practices, that can be transmitted from one mind to another through writing, speech, gestures, rituals, or other imitable phenomena with a mimicked theme. Supporters of the concept regard memes as cultural analogues to genes in that they self-replicate, mutate, and respond to selective pressures.” source: en.wikipedia.org/wiki/Meme

WEEK 2

: Awsome! We found a new traffic. Okay. Let’s summary this week.

  1. I try to go further with antonym and anonym. I’m not too much similar with their catchphrase but get a new statistics. Fair, wages, job and worker go together. WHY? Because these tweets are from lobbyist, saying about WHAT fair is better, make affirmative action.
  2. For that, I can suggest 3 key points:
    • Get index with df(date) then panda.loc() to see an example.
    • If it was retweeted, go further into profile or tweeter user.
    • Looking for another relationship.
  3. file = pd.read_json(r".\Capstone-Data-science-UCDavis\Lobbyists4America\tweets.json", lines=True)
    date = file['created_at']
    text = file['text']
    retweeted = file['retweeted']
    date.index("2010-04-05 15:42:09") #has not worked cause of pandas object deny RangeIndex

    I’m trying to get index and loc it from file then wrap it in a function but have not done. It’s a good trick for index and match than make a lot of dataset just to take insight.

  4. After this season, I have to get more idioms and expressions.
  5. Donald Trump have a great chance in this time, that’s what I read from my current news sources. And it establishs this approach.
  6. So a question: “will work come back to USA and what it’s mean? I found this code to easier get result for the previous question. As I said, I has not import json to rStudio or modeanalytics, so I make a combine Python and R markdown.

    file.query("created_at == '2010-04-05 15:42:09'")['retweeted'] #false
    file.query("created_at == '2010-04-05 15:42:09'")['retweet_count'] #4
    file.query("created_at == '2010-04-05 15:42:09'")['user_id'] #19418459

    user_id 19418459 have some great tweet, I will got this for further analysis.

WEEK 3:

     tylenol <- textCount %>% 
       filter( word == "inequity"| 
                 word == "unfair"| 
                 word == "unequal"| 
                 word == "unbalanced"| 
                 word == "unjust") %>%
       filter(date_created > "2013-01-01") %>% 
       summarise(count = sum(count))
     
     head(tylenol, 10)    
     tylenol <- textCount %>% 
       filter( word == "inequity"| 
                 word == "unfair"| 
                 word == "unequal"| 
                 word == "unbalanced"| 
                 word == "unjust") %>%
       filter(date_created < "2013-01-01") %>% 
       summarise(count = sum(count))
     
     head(tylenol, 10)  

sum 764 changing date 62 -> 702

tylenol <- textCount %>% 
  filter(word == "equity" | 
          word == "Equity"| 
           word == "Density"| 
           word == "density"| 
           word == "Inclusion"| 
           word == "inclusion"| 
           word == "affirmative"| 
           word == "fair")  %>%
       filter(date_created < "2013-01-01") %>% 
       summarise(count = sum(count))

head(tylenol,10)
tylenol <- textCount %>% 
  filter(word == "equity" | 
          word == "Equity"| 
           word == "Density"| 
           word == "density"| 
           word == "Inclusion"| 
           word == "inclusion"| 
           word == "affirmative"| 
           word == "fair")  %>%
       filter(date_created > "2013-01-01") %>% 
       summarise(count = sum(count))

head(tylenol,10)

345 -> 3307

tylenol <- textCount %>% 
  
       filter(date_created > "2013-01-01" )%>% 
  filter(word == "unfair") %>% 
       summarise(count = sum(count))

head(tylenol,10)

fair and unfair 368-3332 fair 310 - 2297 unfair 58 - 535

fair-unfair group with point 2013

WEEK 4:

#create YearList for iterate

yearList <- c()
unfairGroupCount <- c()
for (i in 7:9){
  #print(paste('200',i,'-1-1', sep = ""))
  yearPoint = paste('200',i,'-1-1', sep = "")
  yearList <- append(yearList, yearPoint)
}
yearList <- append(yearList, '2010-1-1')

for (i in 11:18){
  #print(paste('20',i,'-1-1', sep = ""))
  yearPoint = paste('20',i,'-1-1', sep = "")
  yearList <- append(yearList, yearPoint)
}

Use this function to count, I will wrap it to just take arguments

unfairGroupFunction <- function(yearPoint){tylenol <- textCount %>% 
  filter( word == "inequity"| 
            word == "unfair"| 
            word == "unequal"| 
            word == "unbalanced"| 
            word == "unjust") %>%
  filter(date_created < yearPoint) %>% 
  summarise(count = sum(count))
}

let try them

unfairGroupCount <- c()
for (i in yearList){
  tylenol <- unfairGroupFunction(i)
  unfairGroupCount <- append(unfairGroupCount, as.integer(tylenol$count))
}

I will check for fair group by changing filter

fairGroupFunction <- function(yearPoint){tylenol <- textCount %>% 
  filter(word == "equity" | 
          word == "Equity"| 
           word == "Density"| 
           word == "density"| 
           word == "Inclusion"| 
           word == "inclusion"| 
           word == "affirmative"| 
           word == "fair") %>%
  filter(date_created < yearPoint) %>% 
  summarise(count = sum(count))
}
fairGroupCount <- c()
for (i in yearList){
  tylenol <- fairGroupFunction(i)
  fairGroupCount <- append(fairGroupCount, as.integer(tylenol$count))
}
GroupTable <- list(year=yearList, count1 = fairGroupCount, count2 = unfairGroupCount)
GroupTable <- as.data.frame(GroupTable)

I combined them in excel, and make a plot there. It’s have a gap but have a great increasing together.

Group-side-by-side Group-together

Let’s try with just fair or unfair.

unfairFunction <- function(yearPoint){tylenol <- textCount %>% 
  filter(word == "unfair") %>%
  filter(date_created < yearPoint) %>% 
  summarise(count = sum(count))
}

unfairCount <- c()
for (i in yearList){
  tylenol <- unfairFunction(i)
  unfairCount <- append(unfairCount, as.integer(tylenol$count))
}


fairFunction <- function(yearPoint){tylenol <- textCount %>% 
  filter( word == "fair") %>%
  filter(date_created < yearPoint) %>% 
  summarise(count = sum(count))
}
fairCount <- c()
for (i in yearList){
  tylenol <- fairFunction(i)
  fairCount <- append(fairCount, as.integer(tylenol$count))
}


GroupTable <- list(year=yearList, fairCount = fairCount, unfairCount = unfairCount)
GroupTable <- as.data.frame(GroupTable)
fair-unfair-side-by-side fair-unfair-together

As we see, I wonder, what happened in 2013 and 2017? Will we see it in 2019 and 2022?