Live streaming tweets

Installing and loading package

Prior to streaming, make sure to install and load rtweet. This vignette assumes users have already setup app access tokens (see: the “auth” vignette, vignette("auth", package = "rtweet")).

## Load rtweet
library(rtweet)
client_as("my_app")

Overview

rtweet makes it possible to capture live streams of Twitter data¹.

There are two ways of having a stream:

A stream collecting data from a set of rules, which can be collected via filtered_stream().
A stream of a 1% of tweets published, which can be collected via sample_stream().

In either case we need to choose how long should the streaming connection hold, and in which file it should be saved to.

## Stream time in seconds so for one minute set timeout = 60
## For larger chunks of time, I recommend multiplying 60 by the number
## of desired minutes. This method scales up to hours as well
## (x * 60 = x mins, x * 60 * 60 = x hours)
## Stream for 5 seconds
streamtime <- 5
## Filename to save json data (backup)
filename <- "rstats.json"

Filtered stream

The filtered stream collects tweets for all rules that are currently active, not just one rule or query.

Creating rules

Streaming rules in rtweet need a value and a tag. The value is the query to be performed, and the tag is the name to identify tweets that match a query. You can use multiple words and hashtags as value, please read the official documentation. Multiple rules can match to a single tweet.

## Stream rules used to filter tweets
new_rule <- stream_add_rule(list(value = "#rstats", tag = "rstats"))

Listing rules

To know current rules you can use stream_add_rule() to know if any rule is currently active:

rules <- stream_add_rule(NULL)
rules
#>   result_count                sent
#> 1            1 2023-03-19 22:04:29
rules(rules)
#>                    id   value    tag
#> 1 1637575790693842952 #rstats rstats

With the help of rules() the id, value and tag of each rule is provided.

Removing rules

To remove rules use stream_rm_rule()

# Not evaluated now
stream_rm_rule(ids(new_rule))

Note, if the rules are not used for some time, Twitter warns you that they will be removed. But given that filtered_stream() collects tweets for all rules, it is advisable to keep the rules list short and clean.

filtered_stream()

Once these parameters are specified, initiate the stream. Note: Barring any disconnection or disruption of the API, streaming will occupy your current instance of R until the specified time has elapsed. It is possible to start a new instance or R —streaming itself usually isn’t very memory intensive— but operations may drag a bit during the parsing process which takes place immediately after streaming ends.

## Stream election tweets
stream_rstats <- filtered_stream(timeout = streamtime, file = filename, parse = FALSE)
#> Warning: No matching tweets with streaming rules were found in the time provided.

If no tweet matching the rules is detected a warning will be issued.

Parsing larger streams can take quite a bit of time (in addition to time spent streaming) due to a somewhat time-consuming simplifying process used to convert a json file into an R object.

Don’t forget to clean the streaming rules:

stream_rm_rule(ids(new_rule))
#>                  sent deleted not_deleted
#> 1 2023-03-19 22:04:51       1           0

Sample stream

The sample_stream() function doesn’t need rules or anything.

stream_random <- sample_stream(timeout = streamtime, file = filename, parse = FALSE)
#> 
 Found 316 records...
 Imported 316 records. Simplifying...
length(stream_random)
#> [1] 316

Saving files

Users may want to stream tweets into json files upfront and parse those files later on. To do this, simply add parse = FALSE and make sure you provide a path (file name) to a location you can find later.

You can also use append = TRUE to continue recording a stream into an already existing file.

Currently parsing the streaming data file with parse_stream() is not functional. However, you can read it back in with jsonlite::stream_in(file).

Returned data object

The parsed object should be the same whether a user parses up-front or from a json file in a later session.

Currently the returned object is a raw conversion of the feed into a nested list depending on the fields and extensions requested.

rtweet: Collecting Twitter Data