使用GooSeeker浏览器的谋数台做了http://www.cnenergy.org/yw/zc/ 的抓取规则和爬虫路线:
数据规则:
<?xml version="1.0" encoding="UTF-8"?>
<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform" >
<xsl:template match="/">
<A2011>
<xsl:apply-templates select="//*[@class='main4_left_m1' and count(.//*[@class='biaoleft']/a/text())>0 and count(.//*[@class='b1 mykw'])>0 and count(.//*[@class='main4_left_m1_t']/a)>0 and count(.//*[@class='main4_left_m1_r']/p)>0 and count(./following-sibling::div[position()=1]/div[position()=3]/div[position()=2]/span[position()=2]/text())>0]" mode="A2011"/>
</A2011>
</xsl:template>
<xsl:template match="//*[@class='main4_left_m1' and count(.//*[@class='biaoleft']/a/text())>0 and count(.//*[@class='b1 mykw'])>0 and count(.//*[@class='main4_left_m1_t']/a)>0 and count(.//*[@class='main4_left_m1_r']/p)>0 and count(./following-sibling::div[position()=1]/div[position()=3]/div[position()=2]/span[position()=2]/text())>0]" mode="A2011">
<item>
<类别>
<xsl:value-of select="*//*[@class='biaoleft']/a/text()"/>
<xsl:value-of select="*[@class='biaoleft']/a/text()"/>
<xsl:if test="@class='biaoleft'">
<xsl:value-of select="a/text()"/>
</xsl:if>
</类别>
<链接>
<xsl:value-of select="*//*[@class='main4_left_m1_t']/a/@href"/>
<xsl:value-of select="*[@class='main4_left_m1_t']/a/@href"/>
<xsl:if test="@class='main4_left_m1_t'">
<xsl:value-of select="a/@href"/>
</xsl:if>
</链接>
<标签>
<xsl:value-of select="*//*[@class='b1 mykw']"/>
<xsl:value-of select="*[@class='b1 mykw']"/>
</标签>
<标题>
<xsl:value-of select="*//*[@class='main4_left_m1_t']/a"/>
<xsl:value-of select="*[@class='main4_left_m1_t']/a"/>
<xsl:if test="@class='main4_left_m1_t'">
<xsl:value-of select="a"/>
</xsl:if>
</标题>
<主要内容>
<xsl:value-of select="*//*[@class='main4_left_m1_r']/p"/>
<xsl:value-of select="*[@class='main4_left_m1_r']/p"/>
<xsl:if test="@class='main4_left_m1_r'">
<xsl:value-of select="p"/>
</xsl:if>
</主要内容>
<日期>
<xsl:value-of select="following-sibling::div[position()=1]/div[position()=3]/div[position()=2]/span[position()=2]/text()"/>
</日期>
</item>
</xsl:template>
</xsl:stylesheet>
线索规则:
<?xml version="1.0" encoding="UTF-8"?>
<spider-clue-extraction>
<theme>能源政策007</theme>
<scope>
<from>HTML</from>
<path-type>a</path-type>
<path>
//*[@class='main4_b1 main4_b1_3']//a[.//text()="加载更多"]
</path>
<clue-type>inthread</clue-type>
<target-theme>
<name>能源政策007</name>
<prefix-position>hostname+pathname</prefix-position>
</target-theme>
</scope>
undefined
</spider-clue-extraction>
在DS 打数机爬数据时,浏览器会按照设定的‘加载更多’页面增长,但抓取下面的数据和第一个都一样,没有变化。
请各位大牛指导!
谢谢! |
|
|
|
|
共 6 个关于本帖的回复 最后回复于 2016-10-13 09:10