All Downloads are FREE. Search and download functionalities are using the official Maven repository.

com.metaeffekt.mirror.download.advisor.CertFrDownload Maven / Gradle / Ivy

There is a newer version: 0.132.0
Show newest version
/*
 * Copyright 2021-2024 the original author or authors.
 *
 * Licensed under the Apache License, Version 2.0 (the "License");
 * you may not use this file except in compliance with the License.
 * You may obtain a copy of the License at
 *
 *     http://www.apache.org/licenses/LICENSE-2.0
 *
 * Unless required by applicable law or agreed to in writing, software
 * distributed under the License is distributed on an "AS IS" BASIS,
 * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 * See the License for the specific language governing permissions and
 * limitations under the License.
 */
package com.metaeffekt.mirror.download.advisor;

import com.metaeffekt.artifact.analysis.utils.ArchiveUtils;
import com.metaeffekt.artifact.analysis.utils.FileUtils;
import com.metaeffekt.artifact.analysis.utils.StringUtils;
import com.metaeffekt.mirror.download.documentation.MirrorMetadata;
import com.metaeffekt.mirror.concurrency.ScheduledDelayedThreadPoolExecutor;
import com.metaeffekt.mirror.contents.advisory.CertFrAdvisorEntry;
import com.metaeffekt.mirror.download.Download;
import com.metaeffekt.mirror.download.ResourceLocation;
import com.metaeffekt.mirror.download.WebAccess;
import com.metaeffekt.mirror.download.documentation.DocRelevantMethods;
import lombok.Data;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.select.Elements;
import org.slf4j.Logger;
import org.slf4j.LoggerFactory;

import java.io.File;
import java.io.FileFilter;
import java.io.IOException;
import java.net.URL;
import java.nio.charset.StandardCharsets;
import java.util.ArrayList;
import java.util.Arrays;
import java.util.Calendar;
import java.util.List;
import java.util.regex.Matcher;
import java.util.regex.Pattern;
import java.util.stream.Collectors;

import static com.metaeffekt.mirror.download.advisor.CertFrDownload.ResourceLocationCertFr.CERT_FR_ARCHIVE;
import static com.metaeffekt.mirror.download.advisor.CertFrDownload.ResourceLocationCertFr.CERT_FR_RSS_FEED;

/**
 * 

References:

*
    *
  • Data provider: CERT-FR
  • *
  • {metaeffekt} mirror: https://metaeffekt.com/mirror/cert-fr/txt/%d.tar
  • *
*

Up until the end of 2024-05, the CERT-FR provided yearly archives in the form of TAR files that contain all published security advisory notes in both a PDF and a TXT variant. * All files are simply extracted and put into a directory with the year as name. * All PDF files are discarded, as they are not machine-readable and therefore unusable for the indexing process. * The remaining TXT files are stored in files with a filename of CERTA-[YEAR]-[TYPE]-[ID].txt. * The advisories start from 2000, meaning that all tar files from 2000 to the current year are downloaded and extracted.

*

However, since then, they have changed their approach without warning the public of this change, with their transition period taking place over two weeks, where their files were not available for some time at all. * After that, the format completely changed, now being based on an API approach, where each file needs to be downloaded individually. * This results in over 18.000 requests that have to be made to obtain a full mirror. * This is not feasible on your client device, especially since there is no way to download only the updated notes, as they do not provide their "modified" information in the listing API. * So, the {metaeffekt} provides a service available on https://metaeffekt.com/mirror/cert-fr/txt/%d.tar, where we build the old tar files every night (CET) for you to download. * This is stored in the CERT_FR_ARCHIVE resource location. * As an alternative, JSON files are also built and uploaded to a separate directory: https://metaeffekt.com/mirror/cert-fr/json/%d.tar, which can be used as well for the download URL.

*

If you want to build this pre-stage and host it for yourself, you may use the internal-data-mirror goal of the ae-mirror-plugin as such and upload it via FTP (or other) to your sever. * Simply change the CERT_FR_ARCHIVE value to your domain.

*
<plugin>
 *     <groupId>com.metaeffekt.artifact.analysis</groupId>
 *     <artifactId>ae-mirror-plugin</artifactId>
 *     <version>${ae.artifact.analysis.version}</version>
 *     <executions>
 *         <execution>
 *             <id>internal-data-mirror</id>
 *             <goals>
 *                 <goal>internal-data-mirror</goal>
 *             </goals>
 *             <configuration>
 *                 <mirrorDirectory>${database.path}</mirrorDirectory>
 *                 <certFrDownload></certFrDownload>
 *                 <moveTargetFilesTo>upload-to-server</moveTargetFilesTo>
 *             </configuration>
 *         </execution>
 *     </executions>
 * </plugin>
*

Another difficulty the TXT files pose on their own is that they come in a wildly unstructured format, being simple transcripts of the PDF files. * This makes parsing them later challenging. For more information on this, see the index process for the CERT-FR advisories.

*
.
 * ├── 2000
 * │   ├── CERTA-2000-ALE-001.txt
 * │   ├── CERTA-2000-ALE-002.txt
 * ...
 * └── 2022
 *     ├── CERTFR-2022-ACT-001.txt
 * ...
 * 
*/ @MirrorMetadata(directoryName = "certfr", mavenPropertyName = "certFrDownload") public class CertFrDownload extends Download { private final static Logger LOG = LoggerFactory.getLogger(CertFrDownload.class); private List cachedAvailableArchiveYears = null; private List cachedUnavailableArchiveYears = null; public CertFrDownload(File baseMirrorDirectory) { super(baseMirrorDirectory, CertFrDownload.class); } @Override @DocRelevantMethods({"CertFrDownload#downloadRequiredArchiveYears"}) protected void performDownload() { this.downloadRequiredArchiveYears(); super.propertyFiles.set(super.downloadIntoDirectory, "info", InfoFileAttributes.CERT_FR_PREFIX.getKey() + "last-feed-size", getCurrentFeedSize()); } private void downloadRequiredArchiveYears() { final List updateYears = determineWhatYearsRequireUpdate(); LOG.info("Updating year archives {}", updateYears); if (updateYears.isEmpty()) { LOG.info("--> No archive years require update, skipping archive download"); return; } for (Integer year : updateYears) { final File downloadIntoDirectory = super.downloadIntoDirectory; final WebAccess downloader = super.downloader; final ScheduledDelayedThreadPoolExecutor.ThrowingRunnable downloadThread = () -> { LOG.info("Updating year [{}]", year); final URL requestUrl = getRemoteResourceLocationUrl(CERT_FR_ARCHIVE, year); final File downloadToFile = new File(downloadIntoDirectory, year + ".tar"); final File unpackedDirectory = new File(downloadIntoDirectory, year.toString()); if (unpackedDirectory.exists()) { FileUtils.deleteDir(unpackedDirectory); } downloader.fetchResponseBodyFromUrlToFile(requestUrl, downloadToFile); if (!downloadToFile.exists()) { throw new RuntimeException("Download successful, but file does not exist: " + downloadToFile.getAbsolutePath() + " from: " + requestUrl); } if (downloadToFile.length() < 2000) { final String content = FileUtils.readFileToString(downloadToFile, StandardCharsets.UTF_8); if (content.contains("404") && (content.startsWith("{") || content.startsWith("[") || content.startsWith("<"))) { LOG.warn("Downloaded file contains 404, skipping extraction: {}", downloadToFile.getAbsolutePath()); if (!downloadToFile.delete()) { LOG.warn("Unable to delete downloaded archive file: {}", downloadToFile.getAbsolutePath()); } return; } } try { ArchiveUtils.untar(downloadToFile, unpackedDirectory); } catch (Throwable e) { throw new RuntimeException("Unable to untar from " + downloadToFile + " to " + unpackedDirectory, e); } // ArchiveUtils#untar only throws exceptions if the untar failed and the following deletion of temporary // files fails, which is why the existence of the result has to be checked separately if (!unpackedDirectory.exists()) { if (!downloadToFile.delete()) { LOG.warn("Unable to delete downloaded archive file: {}", downloadToFile.getAbsolutePath()); } if (year == 2000) { LOG.warn("Unable to untar from {} to {}, this is known for the year 2000, skipping year", downloadToFile, unpackedDirectory); return; } else { throw new RuntimeException("Unable to untar from " + downloadToFile + " to " + unpackedDirectory); } } if (!downloadToFile.delete()) { LOG.warn("Unable to delete downloaded archive file {}", downloadToFile.getAbsolutePath()); } cleanFilesOfTypes(unpackedDirectory, pathname -> pathname.getName().endsWith(".pdf") || pathname.isDirectory() || pathname.getName().contains("XXX")); File[] extractedFiles = unpackedDirectory.listFiles(); if (extractedFiles != null) { LOG.info("Extracted [{}] entries for year [{}]", extractedFiles.length, year); } }; super.executor.submit(downloadThread); } super.executor.setDelay(3 * 1000); super.executor.setSize(4); super.executor.start(); try { super.executor.join(); } catch (InterruptedException e) { throw new RuntimeException("Failed to wait for download threads to finish.", e); } } private static void cleanFilesOfTypes(File directory, FileFilter fileFilter) { final File[] filesForDeletion = directory.listFiles(fileFilter); if (filesForDeletion != null) { for (File file : filesForDeletion) { if (!file.delete()) { LOG.warn("Unable to delete file {} whilst deleting files in {}", file.getAbsolutePath(), directory.getAbsolutePath()); } } } } private List determineWhatYearsRequireUpdate() { final List availableArchiveYears = this.findAvailableArchiveYears(); final List years = new ArrayList<>(availableArchiveYears); // remove the years already present in the download directory if (this.downloadIntoDirectory.exists()) { Arrays.stream(this.downloadIntoDirectory.listFiles(File::isDirectory)) .map(File::getName) .filter(f -> f.matches("\\d+")) .map(Integer::parseInt) .forEach(years::remove); } years.add(availableArchiveYears.get(availableArchiveYears.size() - 1)); years.add(availableArchiveYears.get(availableArchiveYears.size() - 2)); // add years from RSS feed final URL requestUrl = this.getRemoteResourceLocationUrl(CERT_FR_RSS_FEED); final List rssFeedLines = super.downloader.fetchResponseBodyFromUrlAsList(requestUrl); final Pattern certFrIdYearPattern = Pattern.compile(".*CERTFR-(\\d{4}).+"); for (String rssFeedLine : rssFeedLines) { final Matcher matcher = certFrIdYearPattern.matcher(rssFeedLine); if (matcher.matches()) years.add(Integer.parseInt(matcher.group(1))); } return years.stream().distinct().sorted(Integer::compareTo).collect(Collectors.toList()); } @Override public void performInternalDownload() { this.downloadJsonForAll("json"); // convert all to legacy text based format to support old versions to parse data LOG.info("Converting CERT-FR JSON entries to legacy text format"); for (File jsonFile : FileUtils.listFiles(new File(this.downloadIntoDirectory, "json"), new String[]{"json"}, true)) { final String textFileName = jsonFile.getName().replace(".json", ".txt"); final File textFile = new File(new File(new File(this.downloadIntoDirectory, "txt"), jsonFile.getParentFile().getName()), textFileName); final CertFrAdvisorEntry entry; try { entry = CertFrAdvisorEntry.fromApiJson(jsonFile); } catch (Exception e) { LOG.warn("Unable to parse JSON file, skipping conversion to text: {}", jsonFile.getAbsolutePath(), e); continue; } final String textFileContent; try { textFileContent = entry.toCertFrArchiveLegacyTextFormat(); } catch (Exception e) { LOG.warn("Unable to convert JSON file to text format, skipping conversion to text: {}", jsonFile.getAbsolutePath(), e); continue; } try { FileUtils.writeStringToFile(textFile, textFileContent, StandardCharsets.UTF_8); } catch (IOException e) { LOG.warn("Unable to write text file, skipping conversion to text: {}", textFile.getAbsolutePath(), e); } } LOG.info("Packing CERT-FR mirror directories into tar files"); for (String fileType : Arrays.asList("txt", "json")) { for (File file : new File(this.downloadIntoDirectory, fileType).listFiles()) { if (file.isDirectory()) { try { ArchiveUtils.tarDirectory(file, new File(this.downloadIntoDirectory, "internal-mirror/" + fileType + "/" + file.getName() + ".tar")); } catch (IOException e) { throw new RuntimeException("Failed to pack directory " + file.getAbsolutePath() + " into tar file.", e); } } } try { FileUtils.forceDelete(new File(this.downloadIntoDirectory, fileType)); } catch (IOException e) { LOG.warn("Unable to delete CERT-FR mirror directory {}", new File(this.downloadIntoDirectory, fileType).getAbsolutePath()); } } } public void downloadJsonForAll(String subdir) { // actualite does not support /json endpoint and can therefore not be downloaded final List alerteEntries = scrapeHtmlPagesForEntryMetadata("alerte", 0); final List avisEntries = scrapeHtmlPagesForEntryMetadata("avis", 0); // final List actualiteEntries = scrapeHtmlPagesForEntryMetadata("actualite", 0); LOG.info(""); LOG.info("Found the following amounts of entries for all time:"); LOG.info(" - alerte: {}", alerteEntries.size()); LOG.info(" - avis: {}", avisEntries.size()); // LOG.info(" - actualite: {}", actualiteEntries.size()); LOG.info(""); downloadAllJsonEntries("alerte", alerteEntries, subdir, true); downloadAllJsonEntries("avis", avisEntries, subdir, true); // downloadAllJsonEntries("actualite", actualiteEntries, subdir, true); } private void downloadAllJsonEntries(String type, List entries, String subdir, boolean forceUpdate) { LOG.info("Downloading JSON entries for [{}] with [{}] entries", type, entries.size()); int i = 0; for (HtmlScrapingEntry entry : entries) { if (i % 500 == 0) { LOG.info("Downloading JSON entries for [{}] with [{}] entries, currently at [{}]", type, entries.size(), i); } i++; downloadJsonEntry(type, entry.getId(), subdir, forceUpdate); } } private void downloadJsonEntry(String type, String entryId, String subdir, boolean forceUpdate) { // extract the year; example: CERTFR-2021-ALE-020, CERTA-2021-AVI-001-1, CERTA-2021-AVI-001 final String year = entryId.replaceAll(".*-(\\d{4,5})-.*", "$1"); if (year.isEmpty() && !year.matches("\\d{4,5}")) { LOG.warn("Unable to extract year from entry ID [{}], skipping download", entryId); return; } final URL requestUrl = getRemoteResourceLocationUrl(ResourceLocationCertFr.CERT_FR_SINGLE_JSON_ENTRY, type, entryId); final File downloadToFile = new File(super.downloadIntoDirectory, (StringUtils.hasText(subdir) ? subdir + "/" : "") + year + "/" + entryId + ".json"); // compare file size first final boolean requiresUpdate; if (forceUpdate) { requiresUpdate = true; } else { if (downloadToFile.exists()) { final long remoteFileSize = super.downloader.fetchFileSizeFromUrl(requestUrl); final long localFileSize = downloadToFile.length(); requiresUpdate = remoteFileSize != localFileSize; if (requiresUpdate) { LOG.info("Entry [{}] has changed [{}] -> [{}], updating", downloadToFile.getName().replace(".json", ""), localFileSize, remoteFileSize); } } else { requiresUpdate = true; } } if (requiresUpdate) { super.downloader.fetchResponseBodyFromUrlToFile(requestUrl, downloadToFile); } } private final static Pattern EXTRACT_HTML_LISTING_DATE_PATTERN = Pattern.compile(".*le (\\d{1,2}) ([a-zéû]+) (\\d{4}).*"); public List scrapeHtmlPagesForEntryMetadata(String type, int lowestIncludeYear) { LOG.info("Scraping HTML pages for type [{}] up to (including) year [{}]", type, lowestIncludeYear); final List entries = new ArrayList<>(); int maxPage = 1; for (int page = 1; page <= maxPage; page++) { final URL requestUrl = getRemoteResourceLocationUrl(ResourceLocationCertFr.CERT_FR_ENTRY_LISTING_HTML, type, page); final Document document = super.downloader.fetchResponseBodyFromUrlAsDocument(requestUrl); /*
CERTFR-2024-AVI-0443 Publié le 27 mai 2024 Cloturée le 11 septembre 2023
*/ final Elements itemMetaElements = document.getElementsByClass("item-meta"); for (int i = 0; i < itemMetaElements.size(); i++) { final Elements itemRefElements = itemMetaElements.get(i).getElementsByClass("item-ref"); final Elements itemDateElements = itemMetaElements.get(i).getElementsByClass("item-date"); final Elements itemStatusElements = itemMetaElements.get(i).getElementsByClass("item-status"); if (itemRefElements.size() != 1 || itemDateElements.size() != 1) { LOG.warn("Skipping entry [{}] due to missing or duplicate title or ref elements", i); continue; } final String id = itemRefElements.get(0).text().trim(); final String date = itemDateElements.get(0).text().trim(); final Matcher dateMatcher = EXTRACT_HTML_LISTING_DATE_PATTERN.matcher(date); if (!dateMatcher.matches()) { LOG.warn("Skipping entry [{}] due to invalid date format [{}]", i, date); continue; } final int publishDay = Integer.parseInt(dateMatcher.group(1)); final int publishMonth = Arrays.asList("janvier", "février", "mars", "avril", "mai", "juin", "juillet", "août", "septembre", "octobre", "novembre", "décembre") .indexOf(dateMatcher.group(2)) + 1; final int publishYear = Integer.parseInt(dateMatcher.group(3)); final HtmlScrapingDate publishDate = new HtmlScrapingDate(publishYear, publishMonth, publishDay); final HtmlScrapingDate closedDate; if (itemStatusElements.size() == 1) { final String status = itemStatusElements.get(0).text().trim(); if (status.equals("Alerte en cours")) { closedDate = null; } else { final Matcher statusMatcher = EXTRACT_HTML_LISTING_DATE_PATTERN.matcher(status); if (statusMatcher.matches()) { final int closedDay = Integer.parseInt(statusMatcher.group(1)); final int closedMonth = Arrays.asList("janvier", "février", "mars", "avril", "mai", "juin", "juillet", "août", "septembre", "octobre", "novembre", "décembre") .indexOf(statusMatcher.group(2)) + 1; final int closedYear = Integer.parseInt(statusMatcher.group(3)); closedDate = new HtmlScrapingDate(closedYear, closedMonth, closedDay); } else { LOG.warn("Skipping entry [{}] due to invalid status format [{}]", i, status); continue; } } } else { closedDate = null; } final int latestYear = Math.max(publishYear, closedDate != null ? closedDate.getYear() : 0); if (latestYear < lowestIncludeYear) { LOG.info("Last include year [{} < {}] passed, stopping scraping for type [{}] on page [{}] with [{}] entries", latestYear, lowestIncludeYear, type, page, entries.size()); return entries; } entries.add(new HtmlScrapingEntry(id, publishDate, closedDate)); } /* */ final Elements pageNumbersElements = document.select("ul.page-numbers > li"); if (!pageNumbersElements.isEmpty()) { int maxFoundPage = Integer.MIN_VALUE; for (Element pageNumbersElement : pageNumbersElements) { for (Element textElements : pageNumbersElement.children()) { final String text = textElements.text().trim(); if (text.matches("\\d+")) { int pageNumber = Integer.parseInt(text); if (pageNumber > maxFoundPage) { maxFoundPage = pageNumber; } } } } if (maxFoundPage > maxPage) { LOG.info("Found new max page [{}]", maxFoundPage); maxPage = maxFoundPage; } } else { LOG.warn("Unable to find page numbers element to extract max page number"); } } LOG.info("Scraped [{}] entries for type [{}]", entries.size(), type); return entries; } public List findAvailableArchiveYears() { if (this.cachedAvailableArchiveYears != null) { return this.cachedAvailableArchiveYears; } /* */ this.cachedAvailableArchiveYears = new ArrayList<>(); final URL requestUrl = getRemoteResourceLocationUrl(ResourceLocationCertFr.CERT_FR_BASE_URL_FOR_ARCHIVE_LISTING); final Document document = super.downloader.fetchResponseBodyFromUrlAsDocument(requestUrl); final Elements archiveYearElements = document.select("ul.dropdown-menu > li > a"); for (Element archiveYearElement : archiveYearElements) { final String href = archiveYearElement.attr("href"); final String text = archiveYearElement.text(); if (href.matches(".*/\\d{4,5}\\.tar") && text.matches("\\d{4,5}")) { this.cachedAvailableArchiveYears.add(Integer.parseInt(text)); } } if (this.cachedAvailableArchiveYears.isEmpty()) { LOG.warn("Unable to find any archive years on the CERT-FR website, website might have removed element. Assuming 2000 to (current year - 1) as available years."); final int currentYear = getCurrentYear(); for (int year = 2000; year < currentYear; year++) { this.cachedAvailableArchiveYears.add(year); } } LOG.info("Found available CERT-FR archive years: {}", this.cachedAvailableArchiveYears); return this.cachedAvailableArchiveYears; } public List findUnavailableArchiveYears() { if (this.cachedUnavailableArchiveYears != null) { return this.cachedUnavailableArchiveYears; } final List availableArchiveYears = this.findAvailableArchiveYears(); if (availableArchiveYears.isEmpty()) { this.cachedUnavailableArchiveYears = new ArrayList<>(); LOG.warn("No available archive years found, unable to determine unavailable years"); return this.cachedUnavailableArchiveYears; } // 2000 - current year, find all years that are not available final int currentYear = getCurrentYear(); this.cachedUnavailableArchiveYears = new ArrayList<>(); for (int year = 2000; year <= currentYear; year++) { if (!availableArchiveYears.contains(year)) { this.cachedUnavailableArchiveYears.add(year); } } LOG.info("Found unavailable CERT-FR archive years: {}", this.cachedUnavailableArchiveYears); return this.cachedUnavailableArchiveYears; } private int getCurrentYear() { return Calendar.getInstance().get(Calendar.YEAR); } @Override protected boolean additionalIsDownloadRequired() { final long previousFeedSize = super.propertyFiles.getLong(super.downloadIntoDirectory, "info", InfoFileAttributes.CERT_FR_PREFIX.getKey() + "last-feed-size") .orElse(0L); if (previousFeedSize == 0) { return true; } return previousFeedSize != getCurrentFeedSize(); } private long getCurrentFeedSize() { final URL requestUrl = getRemoteResourceLocationUrl(CERT_FR_RSS_FEED); return super.downloader.fetchFileSizeFromUrl(requestUrl); } @Override public void setRemoteResourceLocation(String location, String url) { super.setRemoteResourceLocation(ResourceLocationCertFr.valueOf(location), url); } public enum ResourceLocationCertFr implements ResourceLocation { /** * Yearly archive download URL. Parameters: *
    *
  1. %d Archive year (example: 2020)
  2. *
*/ // CERT_FR_ARCHIVE("https://www.cert.ssi.gouv.fr/uploads/%d.tar"), CERT_FR_ARCHIVE("https://metaeffekt.com/mirror/cert-fr/json/%d.tar"), /** * RSS feed that contains the latest changes to the mirror.
* Used to check if update is required. */ CERT_FR_RSS_FEED("https://www.cert.ssi.gouv.fr/feed/"), /** * Paged listing of CERT-FR entries. Parameters: *
    *
  1. %s Type of entry (example: alerte)
  2. *
  3. %d Page number
  4. *
* Examples: * */ CERT_FR_ENTRY_LISTING_HTML("https://www.cert.ssi.gouv.fr/%s/page/%d"), /** * Base URL for the CERT-FR. Used to extract the archive year list. */ CERT_FR_BASE_URL_FOR_ARCHIVE_LISTING("https://www.cert.ssi.gouv.fr"), /** * Single JSON entry for a CERT-FR entry. Parameters: *
    *
  1. %s Type of entry (example: alerte)
  2. *
  3. %s Entry ID (example: CERTFR-2021-ALE-020)
  4. *
* Examples: * */ CERT_FR_SINGLE_JSON_ENTRY("https://www.cert.ssi.gouv.fr/%s/%s/json"); private final String defaultValue; ResourceLocationCertFr(String defaultValue) { this.defaultValue = defaultValue; } @Override public String getDefault() { return this.defaultValue; } } @Data public static class HtmlScrapingEntry { private final String id; private final HtmlScrapingDate publishDate; private final HtmlScrapingDate closedDate; } @Data public static class HtmlScrapingDate { private final int year; private final int month; private final int day; } }




© 2015 - 2025 Weber Informatics LLC | Privacy Policy