Scraping Paginated APIs with Queues

Easily scalable with multiple queue workers. Jobs can be rate limited. Failures can be retried.

Scraping Paginated APIs with Queues

Inspired by Using Generators for Pagination, this is a look at how you could handle consuming paginated records from an API using Laravel’s queues.

(new GetInvoicesRequest($httpClient))->dispatch();

Why Queues?

How?

There are two ways you could go with this. Both of them would have a job that looks something like this:

class GetInvoicesJob implements ShouldQueue
{
    use Queueable;
    use Batchable;
    use Dispatchable;

    public function __construct(
        public int $page = 1
    ) {
    }

    public function handle(GetInvoicesRequest $request)
    {
        $response = $request->page($this->page)->send();

        // do something with $response
        // like update the database
    }
}

The handle method could resolve the request from the service container. Then build and send the request. Then do some important thing with the response.

Batch All Jobs Up Front

If you know how many total records there are, you could create all the jobs up front.

class GetInvoicesRequest
{
    public function dispatch(): Batch
    {
        return Bus::batch([
            new GetInvoicesJob(page: 1),
            new GetInvoicesJob(page: 2),
            new GetInvoicesJob(page: 3),
            // and so on...
        ])->then(function (Batch $batch) {
             // completed successfully...
        })->catch(function (Batch $batch, Throwable $e) {
             // failure detected...
        })->finally(function (Batch $batch) {
             // finished executing...
        })
          ->name(self::class)
          ->dispatch();
    }
}

Each Job Dispatches the Next

If the total count is unknown, or if you don’t want to have the overhead of getting the count, you could make each job check if there is another page and then dispatch a job for it.

class GetInvoicesRequest
{
    public function dispatch(): PendingDispatch
    {
        return GetInvoicesJob::dispatch(page: 1);
    }
}

Then inside the job handler you’d have something like this:

class GetInvoicesJob implements ShouldQueue
{
    use Queueable;
    use Batchable;
    use Dispatchable;

    public function __construct(
        public int $page = 1
    ) {
    }

    public function handle(GetInvoicesRequest $request)
    {
        $response = $request->page($this->page)->send();

        // do something with $response
        // like update the database

        if ($response->hasMorePages()) {
            static::dispatch(page: $this->page + 1);
        }
    }
}

Conclusion

I tend to like the batching method better because jobs finish faster and you can interrogate the batch to determine progress. But I’ve used both of these methods on real projects with success.

So, what do you think? Have you used queues for something like this?