g0601_0700.s0609_find_duplicate_file_in_system.Solution Maven / Gradle / Ivy

Go to download
Show more of this group Show more artifacts with this name
Show all versions of leetcode-in-java21 Show documentation
Java-based LeetCode algorithm problem solutions, regularly updated
There is a newer version: 1.38
package g0601_0700.s0609_find_duplicate_file_in_system;

// #Medium #Array #String #Hash_Table #2022_03_21_Time_20_ms_(97.68%)_Space_51.3_MB_(87.10%)

import java.util.ArrayList;
import java.util.HashMap;
import java.util.List;
import java.util.Map;

/**
 * 609 - Find Duplicate File in System\.
 *
 * Medium
 *
 * Given a list `paths` of directory info, including the directory path, and all the files with contents in this directory, return _all the duplicate files in the file system in terms of their paths_. You may return the answer in **any order**.
 *
 * A group of duplicate files consists of at least two files that have the same content.
 *
 * A single directory info string in the input list has the following format:
 *
 * *   `"root/d1/d2/.../dm f1.txt(f1_content) f2.txt(f2_content) ... fn.txt(fn_content)"`
 *
 * It means there are `n` files `(f1.txt, f2.txt ... fn.txt)` with content `(f1_content, f2_content ... fn_content)` respectively in the directory "`root/d1/d2/.../dm"`. Note that `n >= 1` and `m >= 0`. If `m = 0`, it means the directory is just the root directory.
 *
 * The output is a list of groups of duplicate file paths. For each group, it contains all the file paths of the files that have the same content. A file path is a string that has the following format:
 *
 * *   `"directory_path/file_name.txt"`
 *
 * **Example 1:**
 *
 * **Input:** paths = ["root/a 1.txt(abcd) 2.txt(efgh)","root/c 3.txt(abcd)","root/c/d 4.txt(efgh)","root 4.txt(efgh)"]
 *
 * **Output:** [["root/a/2.txt","root/c/d/4.txt","root/4.txt"],["root/a/1.txt","root/c/3.txt"]]
 *
 * **Example 2:**
 *
 * **Input:** paths = ["root/a 1.txt(abcd) 2.txt(efgh)","root/c 3.txt(abcd)","root/c/d 4.txt(efgh)"]
 *
 * **Output:** [["root/a/2.txt","root/c/d/4.txt"],["root/a/1.txt","root/c/3.txt"]]
 *
 * **Constraints:**
 *
 * *   1 <= paths.length <= 2 * 10⁴
 * *   `1 <= paths[i].length <= 3000`
 * *   1 <= sum(paths[i].length) <= 5 * 10⁵
 * *   `paths[i]` consist of English letters, digits, `'/'`, `'.'`, `'('`, `')'`, and `' '`.
 * *   You may assume no files or directories share the same name in the same directory.
 * *   You may assume each given directory info represents a unique directory. A single blank space separates the directory path and file info.
 *
 * **Follow up:**
 *
 * *   Imagine you are given a real file system, how will you search files? DFS or BFS?
 * *   If the file content is very large (GB level), how will you modify your solution?
 * *   If you can only read the file by 1kb each time, how will you modify your solution?
 * *   What is the time complexity of your modified solution? What is the most time-consuming part and memory-consuming part of it? How to optimize?
 * *   How to make sure the duplicated files you find are not false positive?
**/
public class Solution {
    public List> findDuplicate(String[] paths) {
        Map> map = new HashMap<>();
        for (String path : paths) {
            String[] pathComponents = path.split(" ");
            String root = pathComponents[0];
            for (int i = 1; i < pathComponents.length; i++) {
                int startIndex = pathComponents[i].indexOf("(");
                int endIndex = pathComponents[i].lastIndexOf(")");
                String content = pathComponents[i].substring(startIndex, endIndex);

                map.putIfAbsent(content, new ArrayList<>());
                map.get(content).add(root + "/" + pathComponents[i].substring(0, startIndex));
            }
        }

        List> result = new ArrayList<>();
        for (List list : map.values()) {
            if (list.size() > 1) {
                result.add(list);
            }
        }
        return result;
    }
}