欢迎访问 生活随笔!

生活随笔

当前位置: 首页 >

Nutch开发(四)

发布时间:2024/9/19 57 豆豆
生活随笔 收集整理的这篇文章主要介绍了 Nutch开发(四) 小编觉得挺不错的,现在分享给大家,帮大家做个参考.

Nutch开发(四)

文章目录

    • Nutch开发(四)
        • 开发环境
      • 1.Nutch插件设计介绍
      • 2.解读插件目录结构
      • 3. build.xml
      • 4. ivy.xml
      • 5. plugin.xml
      • 6. 解读parse-html插件
        • HtmlParser
          • setConf(Configuration conf)
          • parse(InputSource input)
          • getParse(Content content)
      • 7.解读parse-metatags插件
        • MetaTagsParser
          • filter方法
          • addIndexedMetatags方法
          • metadata plugin的配置

开发环境

  • Linux,Ubuntu20.04LST
  • IDEA
  • Nutch1.18
  • Solr8.11

转载请声明出处!!!By 鸭梨的药丸哥

1.Nutch插件设计介绍

Nutch高度可扩展,使用的插件系统是基于Eclipse2.x的插件系统。

Nutch暴露了几个扩展点,每个扩展点都是一个接口,通过实现接口来进行插件扩展的开发。Nutch提供以下扩展点,我们只需要实现对应的接口即可开发我们的Nutch插件

  • IndexWriter – Writes crawled data to a specific indexing backends (Solr, ElasticSearch, a CVS file, etc.).
  • IndexingFilter – Permits one to add metadata to the indexed fields. All plugins found which implement this extension point are run sequentially on the parse (from javadoc).
  • Parser – Parser implementations read through fetched documents in order to extract data to be indexed. This is what you need to implement if you want Nutch to be able to parse a new type of content, or extract more data from currently parseable content.
  • HtmlParseFilter – Permits one to add additional metadata to HTML parses (from javadoc).
  • Protocol – Protocol implementations allow Nutch to use different protocols (ftp, http, etc.) to fetch documents.
  • URLFilter – URLFilter implementations limit the URLs that Nutch attempts to fetch. The RegexURLFilter distributed with Nutch provides a great deal of control over what URLs Nutch crawls, however if you have very complicated rules about what URLs you want to crawl, you can write your own implementation.
  • URLNormalizer – Interface used to convert URLs to normal form and optionally perform substitutions.
  • ScoringFilter – A contract defining behavior of scoring plugins. A scoring filter will manipulate scoring variables in CrawlDatum and in resulting search indexes. Filters can be chained in a specific order, to provide multi-stage scoring adjustments.
  • SegmentMergeFilter – Interface used to filter segments during segment merge. It allows filtering on more sophisticated criteria than just URLs. In particular it allows filtering based on metadata collected while parsing page.

2.解读插件目录结构

Nutch插件的目录都相似,这里介绍一下parse-html的目录就行了

/src #源码目录 build.xml #ant怎样编译这个plugin配置文件(编译出jar包放哪啊等配置信息) ivy.xml #plugin的ivy配置信息(依赖管理,跟maven的pom.xml一样的东东) plugin.xml #nutch描述这个plugin的信息(如,这个插件实现了哪些扩展点,插件的扩展点实现类名字等)

3. build.xml

build.xml告知ant如何编译这个插件的

<project name="parse-html" default="jar-core"><import file="../build-plugin.xml"/><!-- Build compilation dependencies --><target name="deps-jar"><!--build时依赖于另一个插件--><ant target="jar" inheritall="false" dir="../lib-nekohtml"/></target><!-- Add compilation dependencies to classpath --><path id="plugin.deps"><fileset dir="${nutch.root}/build"><include name="**/lib-nekohtml/*.jar" /></fileset></path><!-- Deploy Unit test dependencies --><target name="deps-test"><!--test时用到的依赖插件--><ant target="deploy" inheritall="false" dir="../lib-nekohtml"/><ant target="deploy" inheritall="false" dir="../nutch-extensionpoints"/></target></project>

4. ivy.xml

跟maven的pom.xml一样的东西。一些外部依赖可以在这里声明导入

<ivy-module version="1.0"><info organisation="org.apache.nutch" module="${ant.project.name}"><license name="Apache 2.0"/><ivyauthor name="Apache Nutch Team" url="https://nutch.apache.org/"/><description>Apache Nutch</description></info><configurations><include file="../../../ivy/ivy-configurations.xml"/></configurations><publications><!--get the artifact from our module name--><artifact conf="master"/></publications><!--在这里添加外部依赖--><dependencies><dependency org="org.ccil.cowan.tagsoup" name="tagsoup" rev="1.2.1"/></dependencies></ivy-module>

5. plugin.xml

<!--插件的描述信息--> <pluginid="parse-html"name="Html Parse Plug-in"version="1.0.0"provider-name="nutch.org"><runtime><library name="parse-html.jar"><export name="*"/></library><library name="tagsoup-1.2.1.jar"/></runtime><!--插件导入--><requires><import plugin="nutch-extensionpoints"/><import plugin="lib-nekohtml"/></requires><!--扩展点的描述--><extension id="org.apache.nutch.parse.html"name="HtmlParse"point="org.apache.nutch.parse.Parser"><!--id唯一标识,class对应的实现类--><implementation id="org.apache.nutch.parse.html.HtmlParser"class="org.apache.nutch.parse.html.HtmlParser"><!--参数--><parameter name="contentType" value="text/html|application/xhtml+xml"/><parameter name="pathSuffix" value=""/></implementation></extension></plugin>

6. 解读parse-html插件

HtmlParser

HtmlParser实现了Parser扩展点

public class HtmlParser implements Parser

Parser接口方法:

  • public ParseResult getParse(Content c) //解析数据的
  • public void setConf(Configuration configuration) //用于nutch-setting中的配置
  • public Configuration getConf()
setConf(Configuration conf)

从nutch-setting.xml读取信息,因为nutch会在调用插件通过setConf(Configuration conf)往插件传递配置信息。

@Override public void setConf(Configuration conf) {this.conf = conf;//创建HtmlParseFilters,里面有一个数组HtmlParseFilters装实现类的插件//HtmlParseFilters使用数组HtmlParseFilter[] htmlParseFilters装插件this.htmlParseFilters = new HtmlParseFilters(getConf());//获取解析实现类名字,空就默认使用nekohtmlthis.parserImpl = getConf().get("parser.html.impl", "neko");//编码方式this.defaultCharEncoding = getConf().get("parser.character.encoding.default", "windows-1252");//一个dom工具this.utils = new DOMContentUtils(conf);//cache策略this.cachingPolicy = getConf().get("parser.caching.forbidden.policy",Nutch.CACHING_FORBIDDEN_CONTENT); }

查看nutch-default.xml,里面的parser.html.impl参数,确实有parser.html.impl,如果nutch-default.xml没有定义时还是会用NekoHTML去解析HTML页面。

  • 从前面的build.xml引入了lib-nekohtml插件,这个就是NekoHTML
  • 而ivy.xml引入了tagsoup的ivy依赖,这个就是TagSoup,两者都能解析html页面
<property><name>parser.html.impl</name><value>neko</value><description>HTML Parser implementation. Currently the following keywordsare recognized: "neko" uses NekoHTML, "tagsoup" uses TagSoup.</description> </property>
parse(InputSource input)

再看看parse这个方法,

private DocumentFragment parse(InputSource input) throws Exception {//如果设置了tagsoup就用tagsoup来解析htmlif ("tagsoup".equalsIgnoreCase(parserImpl))return parseTagSoup(input);elsereturn parseNeko(input); }
getParse(Content content)

注意:在ParseResult filteredParse = this.htmlParseFilters.filter(content,parseResult, metaTags, root);会运行继承HtmlParseFilter扩展点的插件,所以我们需要解析html中的格外的标签中的数据时,可以通过实现HtmlParseFilter扩展点来自定义一些html中的标签数据发解析。

public ParseResult getParse(Content content) {//HTML meta标签HTMLMetaTags metaTags = new HTMLMetaTags();//拿到urlURL base;try {base = new URL(content.getBaseUrl());} catch (MalformedURLException e) {return new ParseStatus(e).getEmptyParseResult(content.getUrl(), getConf());}//文本信息String text = "";//标题String title = "";//解析出的外部连接Outlink[] outlinks = new Outlink[0];//元数据Metadata metadata = new Metadata();//解析出的dom树// parse the contentDocumentFragment root;try {//拿到content封装成流byte[] contentInOctets = content.getContent();InputSource input = new InputSource(new ByteArrayInputStream(contentInOctets));//编码方式的解析EncodingDetector detector = new EncodingDetector(conf);detector.autoDetectClues(content, true);detector.addClue(sniffCharacterEncoding(contentInOctets), "sniffed");String encoding = detector.guessEncoding(content, defaultCharEncoding);metadata.set(Metadata.ORIGINAL_CHAR_ENCODING, encoding);metadata.set(Metadata.CHAR_ENCODING_FOR_CONVERSION, encoding);input.setEncoding(encoding);if (LOG.isTraceEnabled()) {LOG.trace("Parsing...");}root = parse(input);} catch (IOException e) {return new ParseStatus(e).getEmptyParseResult(content.getUrl(), getConf());} catch (DOMException e) {return new ParseStatus(e).getEmptyParseResult(content.getUrl(), getConf());} catch (SAXException e) {return new ParseStatus(e).getEmptyParseResult(content.getUrl(), getConf());} catch (Exception e) {LOG.error("Error: ", e);return new ParseStatus(e).getEmptyParseResult(content.getUrl(), getConf());}//解析出meta标签// get meta directivesHTMLMetaProcessor.getMetaTags(metaTags, root, base);//把标签数据装到metadata里面// populate Nutch metadata with HTML meta directivesmetadata.addAll(metaTags.getGeneralTags());if (LOG.isTraceEnabled()) {LOG.trace("Meta tags for " + base + ": " + metaTags.toString());}// check meta directivesif (!metaTags.getNoIndex()) { // okay to indexStringBuffer sb = new StringBuffer();if (LOG.isTraceEnabled()) {LOG.trace("Getting text...");}//解析文本信息,就是提取标签中的文本utils.getText(sb, root); // extract texttext = sb.toString();sb.setLength(0);if (LOG.isTraceEnabled()) {LOG.trace("Getting title...");}//提取title标签中的文本utils.getTitle(sb, root); // extract titletitle = sb.toString().trim();}if (!metaTags.getNoFollow()) { // okay to follow linksArrayList<Outlink> l = new ArrayList<Outlink>(); // extract outlinksURL baseTag = base;String baseTagHref = utils.getBase(root);if (baseTagHref != null) {try {baseTag = new URL(base, baseTagHref);} catch (MalformedURLException e) {baseTag = base;}}if (LOG.isTraceEnabled()) {LOG.trace("Getting links...");}//解析外部连接utils.getOutlinks(baseTag, l, root);outlinks = l.toArray(new Outlink[l.size()]);if (LOG.isTraceEnabled()) {LOG.trace("found " + outlinks.length + " outlinks in "+ content.getUrl());}}//创建parseStatusParseStatus status = new ParseStatus(ParseStatus.SUCCESS);if (metaTags.getRefresh()) {status.setMinorCode(ParseStatus.SUCCESS_REDIRECT);status.setArgs(new String[] { metaTags.getRefreshHref().toString(),Integer.toString(metaTags.getRefreshTime()) });}//封装解析数据ParseData parseData = new ParseData(status, title, outlinks,content.getMetadata(), metadata);//解析结果ParseResult parseResult = ParseResult.createParseResult(content.getUrl(),new ParseImpl(text, parseData));//运行HtmlParseFilter解析过滤器,如parse-metatags等,具体可通过配置添加// run filters on parseParseResult filteredParse = this.htmlParseFilters.filter(content,parseResult, metaTags, root);if (metaTags.getNoCache()) { // not okay to cachefor (Map.Entry<org.apache.hadoop.io.Text, Parse> entry : filteredParse)entry.getValue().getData().getParseMeta().set(Nutch.CACHING_FORBIDDEN_KEY, cachingPolicy);}return filteredParse;}

7.解读parse-metatags插件

MetaTagsParser

MetaTagsParser实现了HtmlParseFilter扩展点

public class MetaTagsParser implements HtmlParseFilter
filter方法
public ParseResult filter(Content content, ParseResult parseResult,HTMLMetaTags metaTags, DocumentFragment doc) {//拿到解析数据Parse parse = parseResult.get(content.getUrl());//拿到解析的元数据Metadata metadata = parse.getData().getParseMeta();/** NUTCH-1559: do not extract meta values from ParseData's metadata to avoid* duplicate metatag values*///meta标签的元数据(k,v)Metadata generalMetaTags = metaTags.getGeneralTags();for (String tagName : generalMetaTags.names()) {//根据配置进行添加到解析结果里面addIndexedMetatags(metadata, tagName, generalMetaTags.getValues(tagName));}Properties httpequiv = metaTags.getHttpEquivTags();for (Enumeration<?> tagNames = httpequiv.propertyNames(); tagNames.hasMoreElements();) {String name = (String) tagNames.nextElement();String value = httpequiv.getProperty(name);//这里也是添加到解析结果里面addIndexedMetatags(metadata, name, value);}return parseResult;}
addIndexedMetatags方法

观察一下这个方法,你就知道使用metadata plugin时,在使用index-metadata时,为什么配置要进行index的字段名要加上metatag.这个前缀了。

private void addIndexedMetatags(Metadata metadata, String metatag,String value) {String lcMetatag = metatag.toLowerCase(Locale.ROOT);if (metatagset.contains("*") || metatagset.contains(lcMetatag)) {if (LOG.isDebugEnabled()) {LOG.debug("Found meta tag: {}\t{}", lcMetatag, value);}metadata.add("metatag." + lcMetatag, value);}}
metadata plugin的配置

在看看配置并和addIndexedMetatags对比一下,这就可以看出为什么插件的index.parse.md要加上metatag.前缀

<property> <name>metatags.names</name> <value>description,keywords</value> <description> Names of the metatags to extract, separated by ','.Use '*' to extract all metatags. Prefixes the names with 'metatag.'in the parse-metadata. For instance to index description and keywords,you need to activate the plugin index-metadata and set the value of theparameter 'index.parse.md' to 'metatag.description,metatag.keywords'. </description> </property><property><name>index.parse.md</name><!--addIndexedMetatags方法解析出来的metadata有前缀metatag.--><value>metatag.description,metatag.keywords</value><description>Comma-separated list of keys to be taken from the parse metadata to generate fields.Can be used e.g. for 'description' or 'keywords' provided that these values are generatedby a parser (see parse-metatags plugin)</description> </property>

总结

以上是生活随笔为你收集整理的Nutch开发(四)的全部内容,希望文章能够帮你解决所遇到的问题。

如果觉得生活随笔网站内容还不错,欢迎将生活随笔推荐给好友。