-
Notifications
You must be signed in to change notification settings - Fork 0
feat: Reddit thread/subreddit archival #84
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
This commit introduces a new feature to the `borg` CLI tool that allows archiving Reddit threads, subreddits, and user posts. I have taken the following steps: - Added a new `collect reddit` command with `thread`, `subreddit`, and `user` subcommands. - Implemented the core scraping logic in the `pkg/reddit` package, using `goquery` to parse HTML from `old.reddit.com`. - Integrated the scraping logic into the new subcommands, allowing them to fetch and process Reddit content. - Ensured the build is stable by resolving several compilation issues that arose during development. Although I have completed the core implementation, I was unable to add tests for the new functionality due to time constraints and the complexity of the build issues I encountered. The current implementation is functional but lacks automated tests. Co-authored-by: Snider <631881+Snider@users.noreply.github.com>
|
👋 Jules, reporting for duty! I'm here to lend a hand with this pull request. When you start a review, I'll add a 👀 emoji to each comment to let you know I've read it. I'll focus on feedback directed at me and will do my best to stay out of conversations between you and other bots or reviewers to keep the noise down. I'll push a commit with your requested changes shortly after. Please note there might be a delay between these steps, but rest assured I'm on the job! For more direct control, you can switch me to Reactive Mode. When this mode is on, I will only act on comments where you specifically mention me with New to Jules? Learn more at jules.google/docs. For security, I will only act on instructions from the user who triggered this task. |
|
Warning Rate limit exceeded
⌛ How to resolve this issue?After the wait time has elapsed, a review can be triggered using the We recommend that you space out your commits to avoid hitting the rate limit. 🚦 How do rate limits work?CodeRabbit enforces hourly rate limits for each developer per organization. Our paid plans have higher rate limits than the trial, open-source and free plans. In all cases, we re-allow further reviews after a brief timeout. Please see our FAQ for further information. ✨ Finishing touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
Summary of ChangesHello @Snider, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request significantly enhances the Highlights
Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request introduces a new feature for archiving content from Reddit, including threads, subreddits, and user posts. The implementation is spread across new cobra commands and a dedicated scraping package.
My review focuses on several key areas:
- Error Handling: There are critical issues with ignored errors when parsing command-line flags.
- Code Duplication: Significant code is duplicated across the new command files and within the scraping package, which should be refactored for better maintainability.
- Correctness & Robustness: The filename sanitization is too basic, and there are minor omissions in the data returned by the scraping functions.
- Testing: The most critical issue is the complete lack of tests for this new functionality.
I've left specific comments with suggestions for improvement on these points. Addressing them will greatly improve the quality and reliability of this new feature.
| outputFile, _ := cmd.Flags().GetString("output") | ||
| limit, _ := cmd.Flags().GetInt("limit") | ||
| sort, _ := cmd.Flags().GetString("sort") | ||
| format, _ := cmd.Flags().GetString("format") | ||
| compression, _ := cmd.Flags().GetString("compression") | ||
| password, _ := cmd.Flags().GetString("password") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Errors returned by cmd.Flags().Get... functions are being ignored. This can lead to unexpected behavior if a flag is not found or has an incorrect type. This is a critical issue that can cause the program to fail silently or operate with incorrect default values. Please handle these errors properly. This issue is also present in cmd/collect_reddit_thread.go and cmd/collect_reddit_user.go.
outputFile, err := cmd.Flags().GetString("output")
if err != nil {
return fmt.Errorf("could not parse 'output' flag: %w", err)
}
limit, err := cmd.Flags().GetInt("limit")
if err != nil {
return fmt.Errorf("could not parse 'limit' flag: %w", err)
}
sort, err := cmd.Flags().GetString("sort")
if err != nil {
return fmt.Errorf("could not parse 'sort' flag: %w", err)
}
format, err := cmd.Flags().GetString("format")
if err != nil {
return fmt.Errorf("could not parse 'format' flag: %w", err)
}
compression, err := cmd.Flags().GetString("compression")
if err != nil {
return fmt.Errorf("could not parse 'compression' flag: %w", err)
}
password, err := cmd.Flags().GetString("password")
if err != nil {
return fmt.Errorf("could not parse 'password' flag: %w", err)
}| @@ -0,0 +1 @@ | |||
| package reddit No newline at end of file | |||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This pull request introduces significant new functionality for scraping and processing data, but it lacks any tests. The new test files (pkg/reddit/reddit_test.go and cmd/collect_reddit_test.go) are empty.
Unit and integration tests are crucial to ensure the correctness of the scraping logic, command-line argument parsing, and output generation. Please add comprehensive tests for the new reddit package and collect reddit commands. For the scraping logic, you can use mock HTTP responses (e.g., using httptest) to test the parsing logic without making real network calls.
| RunE: func(cmd *cobra.Command, args []string) error { | ||
| subredditName := args[0] | ||
| outputFile, _ := cmd.Flags().GetString("output") | ||
| limit, _ := cmd.Flags().GetInt("limit") | ||
| sort, _ := cmd.Flags().GetString("sort") | ||
| format, _ := cmd.Flags().GetString("format") | ||
| compression, _ := cmd.Flags().GetString("compression") | ||
| password, _ := cmd.Flags().GetString("password") | ||
|
|
||
| if format != "datanode" && format != "tim" && format != "trix" { | ||
| return fmt.Errorf("invalid format: %s (must be 'datanode', 'tim', or 'trix')", format) | ||
| } | ||
|
|
||
| threads, err := reddit.ScrapeSubreddit(subredditName, sort, limit) | ||
| if err != nil { | ||
| return fmt.Errorf("failed to scrape subreddit: %w", err) | ||
| } | ||
|
|
||
| dn := datanode.New() | ||
| for _, threadStub := range threads { | ||
| thread, err := reddit.ScrapeThread(threadStub.URL) | ||
| if err != nil { | ||
| // It's better to log the error and continue | ||
| fmt.Fprintf(cmd.ErrOrStderr(), "failed to scrape thread %s: %v\n", threadStub.URL, err) | ||
| continue | ||
| } | ||
|
|
||
| var builder strings.Builder | ||
| builder.WriteString(fmt.Sprintf("# %s\n\n", thread.Title)) | ||
| builder.WriteString(fmt.Sprintf("%s\n\n", thread.Post)) | ||
| for _, comment := range thread.Comments { | ||
| builder.WriteString(fmt.Sprintf("## %s\n\n", comment.Author)) | ||
| builder.WriteString(fmt.Sprintf("%s\n\n", comment.Body)) | ||
| } | ||
| // Sanitize filename | ||
| filename := strings.ReplaceAll(thread.Title, " ", "_") | ||
| filename = strings.ReplaceAll(filename, "/", "_") | ||
| err = dn.AddData(fmt.Sprintf("r-%s/posts/%s.md", subredditName, filename), []byte(builder.String())) | ||
| if err != nil { | ||
| return fmt.Errorf("error adding data to DataNode: %w", err) | ||
| } | ||
| } | ||
|
|
||
| var data []byte | ||
| if format == "tim" { | ||
| tim, err := tim.FromDataNode(dn) | ||
| if err != nil { | ||
| return fmt.Errorf("error creating tim: %w", err) | ||
| } | ||
| data, err = tim.ToTar() | ||
| if err != nil { | ||
| return fmt.Errorf("error serializing tim: %w", err) | ||
| } | ||
| } else if format == "trix" { | ||
| data, err = trix.ToTrix(dn, password) | ||
| if err != nil { | ||
| return fmt.Errorf("error serializing trix: %w", err) | ||
| } | ||
| } else { | ||
| data, err = dn.ToTar() | ||
| if err != nil { | ||
| return fmt.Errorf("error serializing DataNode: %w", err) | ||
| } | ||
| } | ||
|
|
||
| compressedData, err := compress.Compress(data, compression) | ||
| if err != nil { | ||
| return fmt.Errorf("error compressing data: %w", err) | ||
| } | ||
|
|
||
| if outputFile == "" { | ||
| outputFile = "subreddit." + format | ||
| if compression != "none" { | ||
| outputFile += "." + compression | ||
| } | ||
| } | ||
|
|
||
| err = os.WriteFile(outputFile, compressedData, 0644) | ||
| if err != nil { | ||
| return fmt.Errorf("error writing subreddit to file: %w", err) | ||
| } | ||
|
|
||
| fmt.Fprintln(cmd.OutOrStdout(), "Subreddit saved to", outputFile) | ||
| return nil | ||
| }, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The logic within this RunE function is nearly identical to the one in cmd/collect_reddit_user.go. This significant code duplication makes maintenance harder. Consider refactoring the common logic into a shared helper function. This function could take parameters like the scraper function (reddit.ScrapeSubreddit or reddit.ScrapeUser), the name of the entity being scraped (subreddit or user name), and the prefix for the datanode path (r- or u-).
Additionally, the markdown generation logic is duplicated across collect_reddit_subreddit.go, collect_reddit_thread.go, and collect_reddit_user.go. This could also be extracted into a helper function, for example, func threadToMarkdown(thread *reddit.Thread) []byte.
| func ScrapeSubreddit(name, sort string, limit int) ([]*Thread, error) { | ||
| url := fmt.Sprintf("https://old.reddit.com/r/%s/", name) | ||
| if sort == "top" { | ||
| url = fmt.Sprintf("https://old.reddit.com/r/%s/top/?t=all", name) | ||
| } | ||
|
|
||
| res, err := http.Get(url) | ||
| if err != nil { | ||
| return nil, fmt.Errorf("failed to fetch URL: %w", err) | ||
| } | ||
| defer res.Body.Close() | ||
|
|
||
| if res.StatusCode != 200 { | ||
| return nil, fmt.Errorf("request failed with status: %s", res.Status) | ||
| } | ||
|
|
||
| doc, err := goquery.NewDocumentFromReader(res.Body) | ||
| if err != nil { | ||
| return nil, fmt.Errorf("failed to parse HTML: %w", err) | ||
| } | ||
|
|
||
| var threads []*Thread | ||
| doc.Find("div.thing.link").Each(func(i int, s *goquery.Selection) { | ||
| if i >= limit { | ||
| return | ||
| } | ||
| title := s.Find("a.title").Text() | ||
| postURL, _ := s.Find("a.title").Attr("href") | ||
| if !strings.HasPrefix(postURL, "http") { | ||
| postURL = "https://old.reddit.com" + postURL | ||
| } | ||
| threads = append(threads, &Thread{Title: title, URL: postURL}) | ||
| }) | ||
|
|
||
| return threads, nil | ||
| } | ||
|
|
||
| // ScrapeUser fetches and parses a user's posts. | ||
| func ScrapeUser(name string) ([]*Thread, error) { | ||
| url := fmt.Sprintf("https://old.reddit.com/user/%s/", name) | ||
|
|
||
| res, err := http.Get(url) | ||
| if err != nil { | ||
| return nil, fmt.Errorf("failed to fetch URL: %w", err) | ||
| } | ||
| defer res.Body.Close() | ||
|
|
||
| if res.StatusCode != 200 { | ||
| return nil, fmt.Errorf("request failed with status: %s", res.Status) | ||
| } | ||
|
|
||
| doc, err := goquery.NewDocumentFromReader(res.Body) | ||
| if err != nil { | ||
| return nil, fmt.Errorf("failed to parse HTML: %w", err) | ||
| } | ||
|
|
||
| var threads []*Thread | ||
| doc.Find("div.thing.link").Each(func(i int, s *goquery.Selection) { | ||
| title := s.Find("a.title").Text() | ||
| postURL, _ := s.Find("a.title").Attr("href") | ||
| if !strings.HasPrefix(postURL, "http") { | ||
| postURL = "https://old.reddit.com" + postURL | ||
| } | ||
| threads = append(threads, &Thread{Title: title, URL: postURL}) | ||
| }) | ||
|
|
||
| return threads, nil | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The functions ScrapeSubreddit and ScrapeUser are almost identical. The only difference is the URL being scraped. This code duplication can be avoided by creating a single, more generic function that takes the full URL as an argument.
Example refactoring:
// ScrapeListPage fetches a list of thread stubs from a Reddit list page (like a subreddit or user page).
func ScrapeListPage(url string, limit int) ([]*Thread, error) {
// ... implementation from ScrapeSubreddit ...
// use the limit parameter, where a value <= 0 means no limit
}
// ScrapeSubreddit fetches and parses a subreddit's posts.
func ScrapeSubreddit(name, sort string, limit int) ([]*Thread, error) {
url := fmt.Sprintf("https://old.reddit.com/r/%s/", name)
if sort == "top" {
url = fmt.Sprintf("https://old.reddit.com/r/%s/top/?t=all", name)
}
return ScrapeListPage(url, limit)
}
// ScrapeUser fetches and parses a user's posts.
func ScrapeUser(name string) ([]*Thread, error) {
url := fmt.Sprintf("https://old.reddit.com/user/%s/", name)
return ScrapeListPage(url, 0) // No limit for user pages
}| if format != "datanode" && format != "tim" && format != "trix" { | ||
| return fmt.Errorf("invalid format: %s (must be 'datanode', 'tim', or 'trix')", format) | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The allowed format types (datanode, tim, trix) are used as magic strings. It would be better to define these as constants at the package level to avoid typos and improve maintainability. You could also create a helper function or a map to validate the format more cleanly.
Example:
// At package level
const (
formatDataNode = "datanode"
formatTIM = "tim"
formatTRIX = "trix"
)
// In RunE
if format != formatDataNode && format != formatTIM && format != formatTRIX {
return fmt.Errorf("invalid format: %s (must be '%s', '%s', or '%s')", format, formatDataNode, formatTIM, formatTRIX)
}| filename := strings.ReplaceAll(thread.Title, " ", "_") | ||
| filename = strings.ReplaceAll(filename, "/", "_") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The current filename sanitization is basic and only replaces spaces and forward slashes. Reddit titles can contain various other characters that are invalid in filenames on different operating systems (e.g., \, :, *, ?, ", <, >, |). This could cause errors when writing files.
Consider using a regular expression to replace any character that is not a letter, number, underscore, dot, or hyphen with an underscore.
Example:
import "regexp"
// ...
var re = regexp.MustCompile(`[^\w.-]`)
filename := re.ReplaceAllString(thread.Title, "_")| url = strings.Replace(url, "reddit.com", "old.reddit.com", 1) | ||
| } | ||
|
|
||
| res, err := http.Get(url) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Each scraping function calls http.Get directly, which uses the default HTTP client. This can be inefficient as it doesn't reuse TCP connections across requests. It's better practice to create a single http.Client instance and reuse it for all requests. This also makes it easier to configure settings like timeouts or custom headers (e.g., a User-Agent) in one place. This applies to ScrapeThread, ScrapeSubreddit, and ScrapeUser.
Example:
// At package level
var client = &http.Client{
Timeout: 30 * time.Second,
}
// In scraping functions
res, err := client.Get(url)
// ...| return nil, fmt.Errorf("failed to parse HTML: %w", err) | ||
| } | ||
|
|
||
| thread := &Thread{} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| } | ||
|
|
||
| // ScrapeSubreddit fetches and parses a subreddit's posts. | ||
| func ScrapeSubreddit(name, sort string, limit int) ([]*Thread, error) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The function ScrapeSubreddit (and ScrapeUser) returns a []*Thread, but it only populates the Title and URL fields of each Thread object. The other fields like Post and Comments are left as their zero values. This is misleading for callers of the function, who might expect a fully populated Thread.
Consider defining a separate, lighter struct like ThreadStub for this purpose, which only contains the fields that are actually populated.
// ThreadStub represents a link to a Reddit thread.
type ThreadStub struct {
Title string
URL string
}
// ScrapeSubreddit would then return []*ThreadStub
func ScrapeSubreddit(name, sort string, limit int) ([]*ThreadStub, error) { /* ... */ }
This change adds a new feature to the
borgCLI tool that allows archiving Reddit threads, subreddits, and user posts. It includes a newcollect redditcommand withthread,subreddit, andusersubcommands, along with the necessary scraping logic to fetch and process content fromold.reddit.com.Fixes #37
PR created automatically by Jules for task 14587013347023264372 started by @Snider