Find all archive of given type within a vector of html text.

get_archive(
  origin,
  extension = c("zip", "7z"),
  name = NULL,
  date = NULL,
  version = NULL,
  directory = FALSE,
  html = c("simple", "data.gouv")
)

Arguments

origin

character, the url address where to find archives.

extension

character, vector of acceptable types of archives to be downloaded.

name

character, vector of acceptable names fo archives to be downloaded. See details

date

character, something like a date that should be used as a filter. See details

version

character, something like a version that should be used as a filter.

directory

logical, should directories be found instead of archives. See details.

html

character, indicates if the page is "simple" or is coming from "data.gouv". This has implications on how to look for links.

Value

A character vector of all archives or directory found in origin matching with given constraints.

Details

First, a regex search is made to find in x names enclosed in href="name" or href='name'.

extension may contain different possibilities. It will be matches at the end of archives' names. This may lead to an empty character as result.

name may contain different possibilities. It will be matched at the beginning of archives' names. This may lead to an empty character as result.

date may contain either "last", and so anything that can be considered as a date in archives' names ("\ against and the max is taken. If nothing matches, all archives' names are kept. codedate may also contain anything admissible for codecreate_date. If so, anything that can be considered as a date in archives' names ("\ "\ date pertain to create_date(date) are kept, possibly nothing.

version may contain different possibilities. Il will be matched anywhere in archives' names. This may lead an empty character as result.

If directory is set to TRUE, extension is not used. Instead, links finishing by "\" are looked after.

Examples


if (FALSE) {
# RPG archive for year 2010 in data.cquest.org
origin = "https://data.cquest.org/registre_parcellaire_graphique/2010"
file_list = get_archive(origin)
get_archive(origin)
get_archive(origin, version = "34")
get_archive(origin, version = 30:35)

# All RPG archives for any year for region "Occitanie" in data.cquest.org
origin = get_archive(
 "https://data.cquest.org/registre_parcellaire_graphique",
 directory = TRUE
)
get_archive(origin, version = "R76")

# "geo_siret" archives in data.cquest.org
origin = "https://data.cquest.org/geo_sirene/v2019/last/dep"
get_archive(origin, "gz", c("geo_siret_34", "geo_siret_83"))
get_archive(origin, "gz", c("geo_siret"), version = c("34", "83"))

# "ADMIN EXPRESS" archives in ign
origin = "https://geoservices.ign.fr/adminexpress"
get_archive(origin, "7z", "ADMIN-EXPRESS-COG", date = "last")
get_archive(origin, "7z", "ADMIN-EXPRESS-COG", version = "FRA", date = "last")
get_archive(origin, "7z", "ADMIN-EXPRESS", date = 2021:2022)
}