Skip to content
This repository was archived by the owner on Dec 17, 2025. It is now read-only.

This repository contains scripts for collecting and analyzing data about open source activity associated with a university or institution

License

Notifications You must be signed in to change notification settings

UT-OSPO/institutional-innovation-grapher

Repository files navigation

Institutional Innovation Grapher

This repository contains scripts for collecting and analyzing data about open source activity associated with a university or institution.

🛑 This project is no longer actively developed

All issues and pull requests have been closed. Please refer to the README for next steps.

Tools

The vision for this repository is to build out a set of Python scripts that can effectively be utilized by open source program offices to gain a better understanding of their specific institution. See details below about specific tools.

github-activity-metrics-tool.py

This script is designed to gather data about GitHub accounts which mention the specified institution in the "bio" statement associated with the account and save that information in a CSV (simple-github-account-url-list-[year]-[month]-[day]-[institutionname].csv). In addition to gathering summary information about each account, it has been developed with a specific focus on gathering data about the open source activity of university research communities and uses the provided "bio" information associated with each account to make a prediction about type of affiliation with the defined university. The script is also able to gather information about the individual GitHub repositories under each account and saves information about those repositories to a separate CSV file (simple-github-repo-url-list-[year]-[month]-[day]-[institutionname].csv). In order for the script to be used successfully, important parameter information must be defined in the repository's .env file. See section below for more details about preparing the .env file.

github-data-visualizer.py

This script creates visualizations of the CSV formatted data that is collected using github-activity-metrics-tool.py. The path to the GitHub account information CSV and GitHub repo CSV needs to be defined in the repositories .env file for the script to run successfully.

Configuring the .env file

Once this repo is cloned locally, the template.env file should be renamed to just .env and the contents of the file should be edited to replace the example values that are provided in the file by default with the correct values based on the institution for which the script will be run. Take care to preserve the JSON formatting of the .env file to ensure proper functioning of the Python scripts in the repository which depend on the parameters defined in the .env file.

Explanation of configurable parameters

  • githubtoken: The personalized GitHub token generated by the user of the script should be supplied here
  • test: A Boolean value (true or false, do not enclose in quotation marks) to enable (true) or disable (false) the test environment. This is used for small sample size runs to ensure functionality of the workflow after updates. For UT Austin, the test conditions the API call by only searching for specific subsidiary schools or institutes (e.g., Dell Medical) under the UT Austin umbrella that are known to only return a few dozen results
  • ratelimiting: A Boolean value (true or false) for implementing manual rate limiting from the start of the process; see the next section for some remarks on rate limiting
  • contents: A Boolean value (true or false) for querying individual repositories' endpoints in order to get information on their contents. Note that this can significantly increase the number of requests and likely will require manual rate limiting. Runtime could be on the order of hours if enabled
  • onlyaffiliated: A Boolean value (true or false) for only exporting accounts that record a string prescribed in institutionnamepermutations (see below) in the name, bio, company, or location fields (i.e. removes false positives). If enabled, the script will filter all accounts and then only query/export repositories associated with affiliated accounts. It can be useful to run an initial search without this filter enabled in order to identify possible unexpected permutations, as an account must have at least one listed permutation in order to be retained by this filter (e.g., 'University of Austin at Texas')
  • institutionname: The standard institution name - this will be used to name files and manage other script processes
  • institutionnamepermutations: A list of all of the variations of the institution name that GitHub users might mention in their account bio statement, company, location, or account name. This is not used in the search process but is used in the affiliation validation process, so you should include nearly identical permutations if you want to make an exact affiliation match (e.g., UT Austin vs. UT, Austin)
  • institutioncity: The city that the institution is located in
  • institutionemaildomain: The email address extension for the institution (e.g. "utexas.edu")
  • githubaccountdetailscsvpath: Path to CSV file containing GitHub account results
  • githubaccountdetailsfilteredcsvpath: Path to CSV file containing filtered GitHub account results. Results are filtered based on whether at least one string listed in institutionalnamepermutations is found in the bio, company, location, or username
  • githubrepodetailscsvpath: Path to CSV file containing GitHub repository results
  • githubrepodetailsfilteredcsvpath: Path to CSV file containing filtered GitHub repository results. Results are filtered based on whether at least one string listed in institutionalnamepermutations is found in the bio, company, location, or username
  • resultsperpage: Number of GitHub results per page to return
  • pagelimit: Number of pages of GitHub results to return
  • minimumfollowers: Minimum number of followers that a GitHub account must have for it to be included in results
  • minimumrepos: Minimum number of repositories that a GitHub account must have for it to be included in results
  • githubrepolastupdatethresholdinmonths: An integer used to restrict repository data gathered so that only repositories updated with the specified number of months will be included in the saved CSV. This can be helpful for filtering out projects that are no longer active.
  • detaillevel: "fulldetail" or "limiteddetail" - this controls whether the email, company, and bio fields will be filled in in the results CSV
  • plotformat: The image format extension to export graphs from the github-data-visualizer.py script (e.g., "png", "tiff", "jpeg")

Rate limiting

The GitHub API limits requests to the Search API at a rate of 30 per minute and limits requests to most other endpoints at a rate of 5,000 per minute. The number of remaining requests and the time until the cap is 'reset' are included in the API response header. Whether you will hit this rate limit will be institutionally dependent, but it is reasonable to expect that any R1 of a comparable size to UT Austin will hit the limit. The function to make API requests has built-in functionality to make conditional delays and/or conditional, manually prescribed rate limiting. You can also implement rate limiting from the start with the ratelimiting parameter. Some of this functionality is still in development to optimize, but the underlying functionality is not affected.

Deploying Python Shiny app to Display Data in Interactive Table

Install rsconnect

pip install rsconnect-python

Create a shinyapps account

Visit http:// w.shinyapps.io/ and log in to your existing account or create a new account

Configure the rsconnect-python package to use your shinyapps account

Retrieve your token from the shinyapps.io dashboard by selecting the Tokens option in the menu at the top right of the shinyapps dashboard then run the following command in the command prompt or terminal.

`rsconnect add --account <ACCOUNT> --name <NAME> --token <TOKEN> --secret <SECRET>`  

Note: if you click the "Show" button on the "Token" page you can see the rsconnect add command below pre-populated with your specific account information instead of placeholder values. A more detailed overview of this process can be found in the shiny apps documentation at https://docs.posit.co/shinyapps.io/guide/getting_started/.

Deploy the app

COMMAND FORMULA: rsconnect deploy shiny "/path/to/app" --name <NAME> --title affiliated-os-project-data
EXAMPLE: rsconnect deploy shiny "C:/Users/exampleuser/Documents/scripts/institutional-innovation-grapher/shiny-app" --name yourshinyaccountname --title yourappname

Contact

For any questions about this repository, please contact the UT Austin Open Source Program Office at ospo@utlists.utexas.edu.

Last updated

2025-12-16

About

This repository contains scripts for collecting and analyzing data about open source activity associated with a university or institution

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 2

  •  
  •