列表

详情


1242. 多线程网页爬虫

给你一个初始地址 startUrl 和一个 HTML 解析器接口 HtmlParser,请你实现一个 多线程的网页爬虫,用于获取与 startUrl 有 相同主机名 的所有链接。 

以 任意 顺序返回爬虫获取的路径。

爬虫应该遵循:

如上图所示,主机名是 example.org 。简单起见,你可以假设所有链接都采用 http 协议,并且没有指定 端口号。举个例子,链接 http://leetcode.com/problems 和链接 http://leetcode.com/contest 属于同一个 主机名, 而 http://example.org/test 与 http://example.com/abc 并不属于同一个 主机名

HtmlParser 的接口定义如下:

interface HtmlParser {
  // Return a list of all urls from a webpage of given url.
  // This is a blocking call, that means it will do HTTP request and return when this request is finished.
  public List<String> getUrls(String url);
}

注意一点,getUrls(String url) 模拟执行一个HTTP的请求。 你可以将它当做一个阻塞式的方法,直到请求结束。 getUrls(String url) 保证会在 15ms 内返回所有的路径。 单线程的方案会超过时间限制,你能用多线程方案做的更好吗?

对于问题所需的功能,下面提供了两个例子。为了方便自定义测试,你可以声明三个变量 urlsedges 和 startUrl。但要注意你只能在代码中访问 startUrl,并不能直接访问 urls 和 edges

 

拓展问题:

  1. 假设我们要要抓取 10000 个节点和 10 亿个路径。并且在每个节点部署相同的的软件。软件可以发现所有的节点。我们必须尽可能减少机器之间的通讯,并确保每个节点负载均衡。你将如何设计这个网页爬虫?
  2. 如果有一个节点发生故障不工作该怎么办?
  3. 如何确认爬虫任务已经完成?

 

示例 1:

输入:
urls = [
  "http://news.yahoo.com",
  "http://news.yahoo.com/news",
  "http://news.yahoo.com/news/topics/",
  "http://news.google.com",
  "http://news.yahoo.com/us"
]
edges = [[2,0],[2,1],[3,2],[3,1],[0,4]]
startUrl = "http://news.yahoo.com/news/topics/"
输出:[
  "http://news.yahoo.com",
  "http://news.yahoo.com/news",
  "http://news.yahoo.com/news/topics/",
  "http://news.yahoo.com/us"
]

示例 2:

输入:
urls = [
  "http://news.yahoo.com",
  "http://news.yahoo.com/news",
  "http://news.yahoo.com/news/topics/",
  "http://news.google.com"
]
edges = [[0,2],[2,1],[3,2],[3,1],[3,0]]
startUrl = "http://news.google.com"
输出:["http://news.google.com"]
解释:startUrl 链接与其他页面不共享一个主机名。

 

提示:

原站题解

去查看

上次编辑到这里,代码来自缓存 点击恢复默认模板
/** * // This is the HtmlParser's API interface. * // You should not implement it, or speculate about its implementation * class HtmlParser { * public: * vector<string> getUrls(string url); * }; */ class Solution { public: vector<string> crawl(string startUrl, HtmlParser htmlParser) { } };

java 解法, 执行用时: 11 ms, 内存消耗: 48.1 MB, 提交时间: 2023-10-15 13:15:21

/**
 * // This is the HtmlParser's API interface.
 * // You should not implement it, or speculate about its implementation
 * interface HtmlParser {
 *     public List<String> getUrls(String url) {}
 * }
 */
import java.util.List;
import java.util.Map;
import java.util.concurrent.*;
// import java.util.regex.Matcher;
// import java.util.regex.Pattern;

class Solution {
    BlockingQueue<String> urlQueue = new LinkedBlockingDeque<>();
    BlockingQueue<String> statusQueue = new LinkedBlockingDeque<>();
    Map<String, Integer> urlStatus = new ConcurrentHashMap<>();
    volatile String mainHost;

    //    ExecutorService service=Executors.newFixedThreadPool(Runtime.getRuntime().availableProcessors()*2);

    public List<String> crawl(String startUrl, HtmlParser htmlParser) {
        int n = Runtime.getRuntime().availableProcessors() * 2+2;
        urlQueue.add(startUrl);
        mainHost = getHost(startUrl);
        for (int i = 0; i < n; ++i) {
            new Thread(new Crawler(htmlParser)).start();
        }
        String msg = null;
        while (true) {
            try {
                msg = statusQueue.take();
            } catch (InterruptedException e) {
                e.printStackTrace();
                break;
            }
            if (msg != null) {
                if (msg.equals("--")) {
                    n--;
                    if (n == 0) {
                        break;
                    }
                } else {
                    urlStatus.put(msg, 1);
                }
            }
        }
        return new ArrayList<String>(urlStatus.keySet());
    }

    private String getHost(String startUrl) {
        if(startUrl.startsWith("http://")){
            return startUrl.replace("http://","").split("/")[0];
        }
        throw new IllegalArgumentException("url illegal");
    }

    private class Crawler implements Runnable {

        private final HtmlParser htmlParser;

        public Crawler(HtmlParser htmlParser) {
            this.htmlParser = htmlParser;
        }

        @Override
        public void run() {
            String url;
            int emptyTimes = 0;
            try {
                while (true) {
                    url = urlQueue.poll(17, TimeUnit.MILLISECONDS);
                    if (url == null) {
                        emptyTimes++;
                        if (emptyTimes > 2) {
                            statusQueue.offer("--");
                            break;
                        }
                    } else {
                        if (!urlStatus.containsKey(url)) {
                            List<String> urls = htmlParser.getUrls(url);
                            statusQueue.offer(url);
                            boolean hasNew = false;
                            if (urls != null && urls.size() > 0) {
                                for (String u : urls) {
                                    if (!urlStatus.containsKey(u) && getHost(u).equals(mainHost)) {
                                        hasNew = true;
                                        urlQueue.offer(u);
                                    }
                                }
                            }
                            if (!hasNew && urlQueue.size() == 0) {
                                statusQueue.offer("--");
                                break;
                            }
                        }
                        if (emptyTimes > 0) emptyTimes--;
                    }
                }
            } catch (InterruptedException ex) {
                statusQueue.offer("--");
                ex.printStackTrace();
            }
        }
    }
}

java 解法, 执行用时: 37 ms, 内存消耗: 49.6 MB, 提交时间: 2023-10-15 13:15:03

/**
 * // This is the HtmlParser's API interface.
 * // You should not implement it, or speculate about its implementation
 * interface HtmlParser {
 *     public List<String> getUrls(String url) {}
 * }
 */
class Solution {
    // 已知URL集合,存储当前可见的所有URL。
    private ConcurrentHashMap<String, Boolean> totalUrls = new ConcurrentHashMap<>();
    
    // 结果URL链表及对应锁。
    private ReentrantLock resultLock = new ReentrantLock();
    private LinkedList<String> resultUrls = new LinkedList<>();

    // 待抓取URL链表及对应锁。
    private ReentrantLock crawlLock = new ReentrantLock();
    private LinkedList<String> urlsToCrawl = new LinkedList<>();

    // 当前正在执行的工作线程个数。
    private AtomicInteger choreCount = new AtomicInteger(0);
    
    public List<String> crawl(String startUrl, HtmlParser htmlParser) {
        String hostName = extractHostName(startUrl);

        this.totalUrls.put(startUrl, true);

        addUrlToResult(startUrl); 
        addUrlToCrawl(startUrl);

        while (true) {
            String urlToCrawl = fetchUrlToCrawl();
            if (urlToCrawl != null) {
                incrChore();
                Chore chore = new Chore(this, hostName, htmlParser, urlToCrawl);
                (new Thread(chore)).start(); 
            } else {
                if (this.choreCount.get() == 0) {
                    break;
                }
                LockSupport.parkNanos(1L);
            }
        }

        return fetchResultUrls();
    }

    private String extractHostName(String url) {
        // HTTP protocol only.
        String processedUrl = url.substring(7);
        
        int index = processedUrl.indexOf("/");
        if (index == -1) {
            return processedUrl;
        } else {
            return processedUrl.substring(0, index);
        }
    }

    private class Chore implements Runnable {
        private Solution solution;
        private String hostName;
        private HtmlParser htmlParser; 
        private String urlToCrawl;

        public Chore(Solution solution, String hostName, HtmlParser htmlParser, String urlToCrawl) {
            this.solution = solution;
            this.hostName = hostName;
            this.htmlParser = htmlParser;
            this.urlToCrawl = urlToCrawl;
        }

        @Override
        public void run() {
            try {
                filterUrls(this.htmlParser.getUrls(urlToCrawl));
            } finally {
                this.solution.decrChore(); 
            }
        }

        private void filterUrls(List<String> crawledUrls) {
            if (crawledUrls == null || crawledUrls.isEmpty()) {
                return;
            }

            for (String url : crawledUrls) {
                // 如果该URL在已知的URL集合中已存在,那么不需要再重复抓取。
                if (this.solution.totalUrls.containsKey(url)) {
                    continue;
                }

                this.solution.totalUrls.put(url, true);
            
                String crawlHostName = this.solution.extractHostName(url);
                if (!crawlHostName.equals(this.hostName)) {
                    // 如果抓取的URL对应的HostName同Start URL对应的HostName不同,那么直接丢弃该URL。
                    continue;
                }
                
                // 将该URL添加至结果链表。
                this.solution.addUrlToResult(url);
                // 将该URL添加至待抓取链表,以便进行下一跳抓取。
                this.solution.addUrlToCrawl(url);
            }
        }
    }

    private void addUrlToResult(String url) {
        this.resultLock.lock();
        try {
            this.resultUrls.add(url);
        } finally {
            this.resultLock.unlock();
        }
    }

    private List<String> fetchResultUrls() {
        this.resultLock.lock();
        try {
            return this.resultUrls;
        } finally {
            this.resultLock.unlock();
        }
    }

    private void addUrlToCrawl(String url) {
        this.crawlLock.lock();
        try {
            this.urlsToCrawl.add(url);
        } finally {
            this.crawlLock.unlock();
        }
    }
    
    private String fetchUrlToCrawl() {
        this.crawlLock.lock();
        try {
            return this.urlsToCrawl.poll();
        } finally {
            this.crawlLock.unlock();
        }
    }

    private void incrChore() {
        this.choreCount.incrementAndGet();
    }

    private void decrChore() {
        this.choreCount.decrementAndGet();
    }
}

cpp 解法, 执行用时: 128 ms, 内存消耗: 39.7 MB, 提交时间: 2023-10-15 13:13:35

/**
 * // This is the HtmlParser's API interface.
 * // You should not implement it, or speculate about its implementation
 * class HtmlParser {
 *   public:
 *     vector<string> getUrls(string url);
 * };
 */
class Solution {
private:
    std::mutex mu;
    std::condition_variable cv_req,cv_res;
    int requestCounter=0;
    int exited_flag=0;
public:
    void threadFunc(HtmlParser& htmlParser,string& host_name,queue<string>& requestQueue,queue<string>& resultQueue) {
        while(true){
            unique_lock<mutex> lk(mu);
            while(requestQueue.empty()&&exited_flag==0) 
                cv_req.wait(lk);
            if(exited_flag) return ;
            string url=requestQueue.front();
            requestQueue.pop();
            lk.unlock();
            vector<string> urls=htmlParser.getUrls(url);
            for(auto gurl:urls){
                int pos=gurl.find('/',7);
                string host=gurl.substr(7,pos-7);
                if(host_name==host){
                    lk.lock();
                    resultQueue.push(gurl);
                    lk.unlock();
                }
            }
            lk.lock();
            this->requestCounter--;
            lk.unlock();
            cv_res.notify_one();
        }
    }
    vector<string> crawl(string startUrl, HtmlParser htmlParser) {
        queue<string> requestQueue;
        requestQueue.push(startUrl);
        queue<string> resultQueue;
        set<string> urlset;
        urlset.insert(startUrl);
        auto pos=startUrl.find('/',7);
        string hostname=startUrl.substr(7,pos-7);
        this->requestCounter=1;
        for(int i=1;i<=3;i++){
            std::thread th(&Solution::threadFunc,this,std::ref(htmlParser),std::ref(hostname),std::ref(requestQueue),std::ref(resultQueue));
            th.detach();
        }
        while(true){
            unique_lock<mutex> lk(mu);
            while(resultQueue.empty()&&this->requestCounter!=0){
                cv_res.wait(lk);
            }
            while(!resultQueue.empty()){
                string url=resultQueue.front();
                resultQueue.pop();
                if(urlset.find(url)!=urlset.end()) continue;
                requestQueue.push(url);
                urlset.insert(url);
                requestCounter++;
            }
            cv_req.notify_all();
            if(requestCounter==0) {
                exited_flag=1;
                break;
            } 
        }
        vector<string> res;
        for(auto resurl:urlset){
            res.push_back(resurl);
        }
        return res;
    }
};

python3 解法, 执行用时: 216 ms, 内存消耗: 24.3 MB, 提交时间: 2023-10-15 13:11:46

# """
# This is HtmlParser's API interface.
# You should not implement it, or speculate about its implementation
# """
#class HtmlParser(object):
#    def getUrls(self, url):
#        """
#        :type url: str
#        :rtype List[str]
#        """

class Solution:
    def crawl(self, startUrl: str, htmlParser: 'HtmlParser') -> List[str]:
        from threading import Lock, Thread

        def get_hostname(url: str) -> str:
            return url.split('//', 1)[1].split('/', 1)[0]

        def fetch(url: str) -> None:
            for url in htmlParser.getUrls(url):
                if get_hostname(url) == hostname:
                    with lock:
                        if url in visited:
                            continue
                        visited.add(url)
                    thread = Thread(target=fetch, args=(url,))
                    thread.start()
                    queue.append(thread)

        hostname = get_hostname(startUrl)
        lock = Lock()
        visited = {startUrl}
        main_thread = Thread(target=fetch, args=(startUrl,))
        main_thread.start()
        queue = deque([main_thread])
        while queue:
            queue.popleft().join()
        return list(visited)

上一题