This repository contains scripts for collecting and analyzing data about open source activity associated with a university or institution.
All issues and pull requests have been closed. Please refer to the README for next steps.
The vision for this repository is to build out a set of Python scripts that can effectively be utilized by open source program offices to gain a better understanding of their specific institution. See details below about specific tools.
This script is designed to gather data about GitHub accounts which mention the specified institution in the "bio" statement associated with the account and save that information in a CSV (simple-github-account-url-list-[year]-[month]-[day]-[institutionname].csv). In addition to gathering summary information about each account, it has been developed with a specific focus on gathering data about the open source activity of university research communities and uses the provided "bio" information associated with each account to make a prediction about type of affiliation with the defined university. The script is also able to gather information about the individual GitHub repositories under each account and saves information about those repositories to a separate CSV file (simple-github-repo-url-list-[year]-[month]-[day]-[institutionname].csv). In order for the script to be used successfully, important parameter information must be defined in the repository's .env file. See section below for more details about preparing the .env file.
This script creates visualizations of the CSV formatted data that is collected using github-activity-metrics-tool.py. The path to the GitHub account information CSV and GitHub repo CSV needs to be defined in the repositories .env file for the script to run successfully.
Once this repo is cloned locally, the template.env file should be renamed to just .env and the contents of the file should be edited to replace the example values that are provided in the file by default with the correct values based on the institution for which the script will be run. Take care to preserve the JSON formatting of the .env file to ensure proper functioning of the Python scripts in the repository which depend on the parameters defined in the .env file.
- githubtoken: The personalized GitHub token generated by the user of the script should be supplied here
- test: A Boolean value (true or false, do not enclose in quotation marks) to enable (true) or disable (false) the test environment. This is used for small sample size runs to ensure functionality of the workflow after updates. For UT Austin, the test conditions the API call by only searching for specific subsidiary schools or institutes (e.g., Dell Medical) under the UT Austin umbrella that are known to only return a few dozen results
- ratelimiting: A Boolean value (true or false) for implementing manual rate limiting from the start of the process; see the next section for some remarks on rate limiting
- contents: A Boolean value (true or false) for querying individual repositories' endpoints in order to get information on their contents. Note that this can significantly increase the number of requests and likely will require manual rate limiting. Runtime could be on the order of hours if enabled
- onlyaffiliated: A Boolean value (true or false) for only exporting accounts that record a string prescribed in institutionnamepermutations (see below) in the name, bio, company, or location fields (i.e. removes false positives). If enabled, the script will filter all accounts and then only query/export repositories associated with affiliated accounts. It can be useful to run an initial search without this filter enabled in order to identify possible unexpected permutations, as an account must have at least one listed permutation in order to be retained by this filter (e.g., 'University of Austin at Texas')
- institutionname: The standard institution name - this will be used to name files and manage other script processes
- institutionnamepermutations: A list of all of the variations of the institution name that GitHub users might mention in their account bio statement, company, location, or account name. This is not used in the search process but is used in the affiliation validation process, so you should include nearly identical permutations if you want to make an exact affiliation match (e.g., UT Austin vs. UT, Austin)
- institutioncity: The city that the institution is located in
- institutionemaildomain: The email address extension for the institution (e.g. "utexas.edu")
- githubaccountdetailscsvpath: Path to CSV file containing GitHub account results
- githubaccountdetailsfilteredcsvpath: Path to CSV file containing filtered GitHub account results. Results are filtered based on whether at least one string listed in institutionalnamepermutations is found in the bio, company, location, or username
- githubrepodetailscsvpath: Path to CSV file containing GitHub repository results
- githubrepodetailsfilteredcsvpath: Path to CSV file containing filtered GitHub repository results. Results are filtered based on whether at least one string listed in institutionalnamepermutations is found in the bio, company, location, or username
- resultsperpage: Number of GitHub results per page to return
- pagelimit: Number of pages of GitHub results to return
- minimumfollowers: Minimum number of followers that a GitHub account must have for it to be included in results
- minimumrepos: Minimum number of repositories that a GitHub account must have for it to be included in results
- githubrepolastupdatethresholdinmonths: An integer used to restrict repository data gathered so that only repositories updated with the specified number of months will be included in the saved CSV. This can be helpful for filtering out projects that are no longer active.
- detaillevel: "fulldetail" or "limiteddetail" - this controls whether the email, company, and bio fields will be filled in in the results CSV
- plotformat: The image format extension to export graphs from the github-data-visualizer.py script (e.g., "png", "tiff", "jpeg")
The GitHub API limits requests to the Search API at a rate of 30 per minute and limits requests to most other endpoints at a rate of 5,000 per minute. The number of remaining requests and the time until the cap is 'reset' are included in the API response header. Whether you will hit this rate limit will be institutionally dependent, but it is reasonable to expect that any R1 of a comparable size to UT Austin will hit the limit. The function to make API requests has built-in functionality to make conditional delays and/or conditional, manually prescribed rate limiting. You can also implement rate limiting from the start with the ratelimiting parameter. Some of this functionality is still in development to optimize, but the underlying functionality is not affected.
pip install rsconnect-python
Visit http:// w.shinyapps.io/ and log in to your existing account or create a new account
Retrieve your token from the shinyapps.io dashboard by selecting the Tokens option in the menu at the top right of the shinyapps dashboard then run the following command in the command prompt or terminal.
`rsconnect add --account <ACCOUNT> --name <NAME> --token <TOKEN> --secret <SECRET>`
Note: if you click the "Show" button on the "Token" page you can see the rsconnect add command below pre-populated with your specific account information instead of placeholder values. A more detailed overview of this process can be found in the shiny apps documentation at https://docs.posit.co/shinyapps.io/guide/getting_started/.
COMMAND FORMULA: rsconnect deploy shiny "/path/to/app" --name <NAME> --title affiliated-os-project-data
EXAMPLE: rsconnect deploy shiny "C:/Users/exampleuser/Documents/scripts/institutional-innovation-grapher/shiny-app" --name yourshinyaccountname --title yourappname
For any questions about this repository, please contact the UT Austin Open Source Program Office at ospo@utlists.utexas.edu.
2025-12-16