Week4

Including Plots

You can also embed plots, for example:

Note that the echo = FALSE parameter was added to the code chunk to prevent printing of the R code that generated the plot.

Update: 1-Aug-2024

When working with wordCount application on Cloudera, I think this project can be wrapped in python if we have an idea to approach expanded version of this dataset.

Loop through tweets.json text: we can use extend() to append to a list
With this approach: we can use Hadoop to make list of word instead of current way
Another way, with list of words in there about unfair anf fair, just use list.count()
With this approach, data that is cleaned easier and better in week 1 with the end in mind

Week 4:

Year by year because I think that DEI consensus is accumulation.

Client: lobbyist and their customers.
Hypotheses: They believe in their said, so they do nothing or play counter because that’s fair in their mind.
Approach: From week 3, I decided to make some traffic in unfair and fair, with group, over year to see how far it reached. I will iterate through previous modules before starting this week.
Further: This is a open dataset. And with this code, we can apply just by input and see change. Or make a prediction and apply new data then.

WEEK 1:

Which client/dataset did you select and why?

I start with Lobbyists4America, it has a good file format (JSON) and small enough for my internet connection.

Describe the steps you took to import and clean the data.

First, I tried to upload it to some platform we leaned like databrick, mode analytics but it’s soon stuck with upload speed so I tried a smaller step by local analysis. I failed in loading file in R programming, so I back to python for clean and filter data. I started with my experience: “text”. I have some interesting with cloud word. So I decided to go further, make a sub-data with created_at, text, word and count.

I keep text to make a future analysis in a back finding. In this capstone, I just work with date and word_count.

Perform initial exploration of data and provide some screenshots or display some stats of the data you are looking at. I try to access data with python

import pandas as pd
import numpy as np
import jinja2 
from pandasql import sqldf

file = pd.read_json(r".\Capstone-Data-science-UCDavis\Lobbyists4America\tweets.json", lines=True)

file.keys()
for i in range(len(text)):
    wordCount = {}
    #textDict={}
    #textDict["date_created"] = dateCreated[i]
    #textDict['text'] = text[i]
    for j in text[i].split(" "):
        if j not in wordCount:
            wordCount[j] = 0
        wordCount[j] = wordCount[j] + 1
    wordCountKey= list(wordCount)
    for word in wordCountKey:
        textDict={}
        textDict["date_created"] = dateCreated[i]
        textDict['text'] = text[i]
        textDict['word'] = word
        textDict['count'] = wordCount[word]
        textCountList.append(textDict)  

import csv

header = ['date_created','text', 'word', 'count']

with open(r".\Capstone-Data-science-UCDavis\Lobbyists4America\textCount.csv", "w",encoding='UTF8', newline="") as csvfile:
    writer = csv.DictWriter(csvfile, fieldnames=header)
    writer.writeheader()
    for i in textCountList:
        writer.writerow(i)

Import data to R programming and take a round for some key word

install.packages("tidyverse")
install.packages("skimr")
install.packages("janitor")

library(tidyverse)
library(skimr)
library(janitor)

textCount2 <- textCount %>% 
    group_by(word) %>% 
    summarise(count = sum(count))

pattern <- "^.*t.co.*$"

pattern2 <- "^.*https://t.co.*$"

as you will see, it didn’t push result for tylenol or paracetamol or acetaminophen

tylenol <- textCount2 %>% 
  filter(word == "tylenol")
head(tylenol)

paracetamol <- textCount2 %>% 
  filter(grepl("paracetamol", word)) 
head(paracetamol)

acetaminophen <- textCount2 %>% 
  filter(grepl("acetaminophen", word)) 
head(paracetamol)

I approach this result with Google Trends, and it’s compatible. As in current status, I care about this because I got some message with overdose in pandemic. Although those, I got only 1 for count of apple watch, and it’s 60 for ‘iPhone’. Seemly I have to think to some other keyword for approaching lobbyst.

so I also play another version to make data smaller.

textCountDict = {}
for i in range(len(text)):
    for j in text[i].split(" "):
        if j not in textCountDict:
            textCountDict[j] = 0
        textCountDict[j] = textCountDict[j] + 1

pattern = re.compile("t.co")
testKey = textCountDict.keys()
for key in testKey:
    if pattern.search(key):
        del textCountDict[key]

They have a lot of shorten url make by twitter (name in that time), I have no ideal to approach this. We are in economic despression and I think it’s a good time for analysis an old.

Create an ERD or proposed ERD to show the relationships of the data you are exploring.

date-created: dttm
text: chr
word: chr
count: dbl

Step 2: Develop Project Proposal

Description

checkWord <- function(string){
  tylenol <- textCount2 %>% 
    filter(word == string)
  head(tylenol)
}

affirmative action was already banned prior to the 2023 U.S. Supreme Court ruling. Do you wonder how much DEI in this dataset?

tylenol <- textCount2 %>% 
  filter(word == "equity" | 
          word == "Equity"| 
           word == "Density"| 
           word == "density"| 
           word == "Inclusion"| 
           word == "inclusion"| 
           word == "affirmative") %>% 
  arrange(count)

head(tylenol)

What short word or long tail do I should look?
How can I connect to other part of database? For example, in time or constructor.
Can it refect what population wonder? Hypothesis
Do they really care about? Do they dare to believe in freedom of speech in social media and complain?
I think to some key in environment, job…
Export, import will be interesting if you have a business.

checkWord("gas")
checkWord("price")

tylenol <- textCount2 %>% 
  filter(word == "Afganistan" | 
          word == "Pakistan"| 
           word == "Yemem"| 
           word == "Libya"| 
           word == "Somalia"| 
           word == "Iraq"| 
           word == "Syria" |
            word == "Israel"|
            word == "Russian"|
            word == "Ukraina") %>% 
  arrange(count)

head(tylenol,10)

Approach

I take a look in text, a direct analysis in 2009, 2012, 2015 with some keys in economic and technology.
Lots of bich tech company made a revolution in these time, after a depression.
How meme change or spread after a social media? I think a timeline will work with AB testing or regression.

I have definition of meme here:

“A meme is a unit of cultural information that is spread by imitation. It can be an idea, behavior, style, or usage that spreads from person to person within a culture and often carries symbolic meaning representing a particular phenomenon or theme. A meme acts as a unit for carrying cultural ideas, symbols, or practices, that can be transmitted from one mind to another through writing, speech, gestures, rituals, or other imitable phenomena with a mimicked theme. Supporters of the concept regard memes as cultural analogues to genes in that they self-replicate, mutate, and respond to selective pressures.” source: en.wikipedia.org/wiki/Meme

WEEK 2

: Awsome! We found a new traffic. Okay. Let’s summary this week.

I try to go further with antonym and anonym. I’m not too much similar with their catchphrase but get a new statistics. Fair, wages, job and worker go together. WHY? Because these tweets are from lobbyist, saying about WHAT fair is better, make affirmative action.
For that, I can suggest 3 key points:
- Get index with df(date) then panda.loc() to see an example.
- If it was retweeted, go further into profile or tweeter user.
- Looking for another relationship.

file = pd.read_json(r".\Capstone-Data-science-UCDavis\Lobbyists4America\tweets.json", lines=True)
date = file['created_at']
text = file['text']
retweeted = file['retweeted']
date.index("2010-04-05 15:42:09") #has not worked cause of pandas object deny RangeIndex

I’m trying to get index and loc it from file then wrap it in a function but have not done. It’s a good trick for index and match than make a lot of dataset just to take insight.

After this season, I have to get more idioms and expressions.
Donald Trump have a great chance in this time, that’s what I read from my current news sources. And it establishs this approach.

file.query("created_at == '2010-04-05 15:42:09'")['retweeted'] #false
file.query("created_at == '2010-04-05 15:42:09'")['retweet_count'] #4
file.query("created_at == '2010-04-05 15:42:09'")['user_id'] #19418459

user_id 19418459 have some great tweet, I will got this for further analysis.

WEEK 3:

     tylenol <- textCount %>% 
       filter( word == "inequity"| 
                 word == "unfair"| 
                 word == "unequal"| 
                 word == "unbalanced"| 
                 word == "unjust") %>%
       filter(date_created > "2013-01-01") %>% 
       summarise(count = sum(count))
     
     head(tylenol, 10)

     tylenol <- textCount %>% 
       filter( word == "inequity"| 
                 word == "unfair"| 
                 word == "unequal"| 
                 word == "unbalanced"| 
                 word == "unjust") %>%
       filter(date_created < "2013-01-01") %>% 
       summarise(count = sum(count))
     
     head(tylenol, 10)

sum 764 changing date 62 -> 702

tylenol <- textCount %>% 
  filter(word == "equity" | 
          word == "Equity"| 
           word == "Density"| 
           word == "density"| 
           word == "Inclusion"| 
           word == "inclusion"| 
           word == "affirmative"| 
           word == "fair")  %>%
       filter(date_created < "2013-01-01") %>% 
       summarise(count = sum(count))

head(tylenol,10)

tylenol <- textCount %>% 
  filter(word == "equity" | 
          word == "Equity"| 
           word == "Density"| 
           word == "density"| 
           word == "Inclusion"| 
           word == "inclusion"| 
           word == "affirmative"| 
           word == "fair")  %>%
       filter(date_created > "2013-01-01") %>% 
       summarise(count = sum(count))

head(tylenol,10)

345 -> 3307

tylenol <- textCount %>% 
  
       filter(date_created > "2013-01-01" )%>% 
  filter(word == "unfair") %>% 
       summarise(count = sum(count))

head(tylenol,10)

fair and unfair 368-3332 fair 310 - 2297 unfair 58 - 535

WEEK 4:

#create YearList for iterate

yearList <- c()
unfairGroupCount <- c()
for (i in 7:9){
  #print(paste('200',i,'-1-1', sep = ""))
  yearPoint = paste('200',i,'-1-1', sep = "")
  yearList <- append(yearList, yearPoint)
}
yearList <- append(yearList, '2010-1-1')

for (i in 11:18){
  #print(paste('20',i,'-1-1', sep = ""))
  yearPoint = paste('20',i,'-1-1', sep = "")
  yearList <- append(yearList, yearPoint)
}

Use this function to count, I will wrap it to just take arguments

unfairGroupFunction <- function(yearPoint){tylenol <- textCount %>% 
  filter( word == "inequity"| 
            word == "unfair"| 
            word == "unequal"| 
            word == "unbalanced"| 
            word == "unjust") %>%
  filter(date_created < yearPoint) %>% 
  summarise(count = sum(count))
}

let try them

unfairGroupCount <- c()
for (i in yearList){
  tylenol <- unfairGroupFunction(i)
  unfairGroupCount <- append(unfairGroupCount, as.integer(tylenol$count))
}

I will check for fair group by changing filter

fairGroupFunction <- function(yearPoint){tylenol <- textCount %>% 
  filter(word == "equity" | 
          word == "Equity"| 
           word == "Density"| 
           word == "density"| 
           word == "Inclusion"| 
           word == "inclusion"| 
           word == "affirmative"| 
           word == "fair") %>%
  filter(date_created < yearPoint) %>% 
  summarise(count = sum(count))
}
fairGroupCount <- c()
for (i in yearList){
  tylenol <- fairGroupFunction(i)
  fairGroupCount <- append(fairGroupCount, as.integer(tylenol$count))
}

GroupTable <- list(year=yearList, count1 = fairGroupCount, count2 = unfairGroupCount)
GroupTable <- as.data.frame(GroupTable)

I combined them in excel, and make a plot there. It’s have a gap but have a great increasing together.

Let’s try with just fair or unfair.

unfairFunction <- function(yearPoint){tylenol <- textCount %>% 
  filter(word == "unfair") %>%
  filter(date_created < yearPoint) %>% 
  summarise(count = sum(count))
}

unfairCount <- c()
for (i in yearList){
  tylenol <- unfairFunction(i)
  unfairCount <- append(unfairCount, as.integer(tylenol$count))
}


fairFunction <- function(yearPoint){tylenol <- textCount %>% 
  filter( word == "fair") %>%
  filter(date_created < yearPoint) %>% 
  summarise(count = sum(count))
}
fairCount <- c()
for (i in yearList){
  tylenol <- fairFunction(i)
  fairCount <- append(fairCount, as.integer(tylenol$count))
}


GroupTable <- list(year=yearList, fairCount = fairCount, unfairCount = unfairCount)
GroupTable <- as.data.frame(GroupTable)

As we see, I wonder, what happened in 2013 and 2017? Will we see it in 2019 and 2022?