Smart content parser.

Hello. Help with the parser algorithm. The current must be taken into account that the structure of the site is not known. We only know the link to the page, we must naturally pull out the content from there. That is, you need to make the script think like (almost like) a person.

I guess it won’t work without some common pattern. Therefore, I propose to discuss and develop such a “template”. Let’s say, to reveal commonality between sites, for example. In general, how to make it clear to the script that the content is located exactly “here”, and not where the banner is.

For example, you can take the titleof the document (usually it matches 90% of the title of the article), find it in the document and parse it below.

You can limit yourself to news sites.

Answer 1, authority 100%

Serious question.

In my opinion, the parser will work at 15-20% at best.

Each site is built with its own HTML architectural tricks.

It’s not a fact that the content will be in tags with ID or Class “content”.

It’s faster to write a parser for each site than to be smart with the structure and “teach” the script to work with hundreds of other sites.

ps Is this a specific task for something, or is it just for yourself to try?

News sites usually post an rss feed that you can safely take from them.

Answer 2, authority 100%

I would look at the weight of each tag. Those. how much clean text in which tag, and somehow choose the golden mean. Those. let’s say if in the tag 60% of the entire text of the page is the desired article.

Well, or option 2

If it is possible to download several different articles from the site, but make some comparison. Immediately delete everything that is repeated on 3 or more pages, try to determine the rest using algorithm 1.

Something so abstract.

Answer 3, authority 33%

Honestly, I’ve seen such layouts that even a person will understand the hell))

If you need guarantees, then only an individual template, or at least a typical one.

I would try to look for DIV blocks in which there is a lot of test, and there are titles of class H2, H3 …

Secondly, I would look at the ID, CLASS words like main, content ))

To do this, I would split everything into DIVs. For each counted:

  • number of links;
  • number of li’s;
  • number of pictures;
  • and number of words;

If the number of a’s (and possibly li’s) is commensurate with the number of words, then this is most likely a menu.
If there are generally 1-2 pictures and a couple of links, then this is a banner.
If there are a lot of words (perceptibly more than anything else) and there are headings (a couple of pictures, a couple of links), etc. – this is the content.

The idea of ​​searching for a title is good, but there is a nuance – the title, which for the user (visible to the user) and which for the SEO (goes to the title and H1), for example, very often differ for me. Yes, they do not differ radically, but they will no longer work for machine analysis.

Leave a reply

Please enter your comment!
Please enter your name here