基于Python3
一种新的网页正文抽取算法代码,对广告屏蔽很有效
部分代码的想法来源于 《基于行块分布函数的通用网页正文抽取》 这篇论文。
测试请参考 onlytest.py 服务器运行参考 webarticle.py
以后只维护webarticle.py文件
觉得好用点下star,靴靴~
important for English
many rules are made for Chinese, if your web is English check it
A new web page text extraction algorithm code. very effective on the ads. base on Python3.
details look the webarticle.py.
give me a star, if u think useful. thx~~~