Have you ever found yourself using the Twitter API for your research, and needed to run many different searches to answer your question? Maybe you have a set of politics related accounts and you want to see who’s interacting with them via @mentions, or you have a set of #hashtags used for different aspects of a discussion and want to capture all of the matching tweets.
While it’s possible to manually run many searches, it’s very easy to make a mistake and miss one, especially when you start talking about hundreds of politicians. You could also work around this by writing a larger boolean query to try and capture everything (for example by searching for
"@politician1 OR @politician2 ..."), but you quickly run into the limitations of both 1024 characters per search on the Twitter API side, and the struggle on your side to make sure your complicated search is complete and correct. Many people end up writing a custom script to do this, but this sometimes means we’re either writing the same code again, or we’re avoiding experimenting because we don’t want to write another script for a small experiment.
We realised this is a common research use-case, so we decided to extend Twarc* to handle it for you in a flexible way. Using the new searches command in Twarc as of version 2.6.0, you can specify a set of search queries in a CSV input file, and Twarc will take care of running each search one by one. This will work with the both the recent search endpoint available to everyone and the full archive search available if you have academic access – it will also let you check how many tweets match each search, so you can double check to make sure you’re not going to use up all your quota.
* Thanks to the DocNow team for their ongoing maintenance of Twarc.
Let’s look at a concrete example! Let’s say you have a file like the one below, with one line per search query you want to run, and saved in a file called
Assuming you have already setup Twarc with
twarc2 configure, you can do a quick test run to count the number of tweets matching each search for the last seven days like below.
twarc2 searches --counts-only auspol_test.csv auspol_test_counts.csv
We always recommend checking with
--counts-only first, so you don’t accidentally waste your quota on a misspelled query. After checking the counts matching each search in the
auspol_test_counts.csv you can run a similar command without
--counts-only to collect the matching tweets (we’ve also changed the output filename to avoid confusing ourselves later).
twarc2 searches auspol_test.csv auspol_test_tweets.json
We can also do the same thing, but this time specifying
--archive so we can use the academic access track in the new V2 API to search earlier than the last seven days, and also
--end-time to only search in January of 2020.
twarc2 searches --archive --start-time 2020-01-01 --end-time auspol_test.csv auspol_test.json
As one last example, you can also specify
--combine-queries. If you’re collecting a lot of tweets, you might find the number of calls to the Twitter API slowing you down – the
--combine-queries option lets Twarc combine queries together to limit the number of API calls and potentially use your quota more efficiently.
twarc2 searches --combine-queries auspol_test.csv auspol_test_tweets.json
For our example data above, this will only issue one search query
(@ScottMorrisonMP) OR (@AlboMP) OR (#auspol), instead of three.