Skip to main content

Retrieving Raw Data

Using the correct query

There are two ways to retrieve raw URL (/link/sitemap/etc) data from Deepcrawl:

  • This page describes how to download all data from a datasource in a single request, however this cannot be filtered or sorted. This is the most efficient way to access all data.
  • The Get URL Data guide describes how to retrieve defined metrics for URLs in the crawl. That query can be filtered, sorted, etc. but requires you to paginate URLs 100 at a time. It is perfect for getting a sample of the available data, but is not well suited to getting all data for a crawl.

Downloading all raw data

During a typical crawl, Deepcrawl may produce millions of rows worth of URL data, and hundreds of millions of rows about links. While our REST and Graph APIs allow to you access a hundred of these rows per page, paginating through hundreds of thousands of requests is not an efficient way to download data if you need information about all URLs or links.

In the background, Deepcrawl stores crawl data in parquet-formatted files. Parquet is a compressed, columnar format that is widely supported by datalake and query systems. To allow clients easy access to their full datasets, we make these parquet files available to be directly downloaded.

The sample query below will return 3 properties (expiresAt, datasourceName, files) for the crawl_urls datasource (URLs in the crawl) - remove this filter to see other available data (links, sitemaps, etc).

query getParquet {
getCrawl(id: 1234) {
parquetFiles(datasourceName: "crawl_urls") {
expiresAt
datasourceName
files
}
}
}
Show example response Show cURL example Try in explorer

The response contains authenticated links to download the requested files. The links are valid for 7 days from the time they are generated, so data should be downloaded promptly.

Using a Parquet file

Parquet is an industry standard format for big data storage and analysis. One single file will typically contain all data from a given datasource. You can access and analyse the data in any standard parquet-compatible system.

Some of our favourite readers are:

  • NodeJS: node-duckdb - a Deepcrawl-maintained node wrapper for DuckDB - this will allow you to run SQL queries over a parquet file without first loading it into a database.
  • NodeJS: parquetjs-lite - a parquet reader that allows extraction of records from the file
  • Python: Pandas can natively read parquet files into a DataFrame
  • Python: parquet-python is a native parquet reader for Python
  • AWS: Parquet is widely supported in AWS’s ecosystem - S3 Select, Athena, EMR, and other analysis services
  • Datalakes: Parquet can be natively ingested/read into BigQuery, Azure, Snowflake