PHP, regular expressions. Extract lines between tags.

Hello! My task is to get one table (from < table….. to < /table>) from a big html mess. This table differs from the mass of others in that it has the following entry in the opening tag:


Based on this, I compose the appropriate template:

'/<table .* class="table01" .*>[\S\s]*<\/table>/Uix'

And in the end I get zero. And at least like this, even like this:

$pregTable = '/<table .*? class="table01" .*?>[\S\s]*?<\/table>/ix';

Here is the code:

$file = file_get_contents('test.html');
$pregTable = '/<table .* class="table01" .*>[\S\s]*<\/table>/Uix';
$arrTable = array();
preg_match_all($pregTable, $file, $arrTable, PREG_SET_ORDER);

I tried a bunch of different options, I suffer all day, nothing comes out. I get either the text from the beginning of the desired table to the closing of the last one – if I don’t use ? or modifier U, or zero – if with them. What am I doing wrong here?

Answer 1, authority 100%

The easiest and most efficient way in this case is to parse the HTML using DOMand get table via XPath:

$text = <<< EOS
<table class="table01">
    <tr><th>First table</th></tr>
    <tr><td><table><tr><th>Inner <table><tr><th></th></tr></table> table</th></tr></table></td></tr>
    <tr><td><table><tr><th>Second inner table</th></tr></table></td></tr>
    <tr><td>Second outer table</th></tr>
$dom = new DOMDocument();
$xpath = new DOMXPath($dom);
$nodes = $xpath->evaluate('//table[@class="table01"]');

However, if you wish, you can solve this problem using a regular expression. The problem with nested tables in this case is solved using recursive expressions:

$class = 'table01';
//  ,      <table>
$any = "(?: [^<] | <(?!/?table\b) )";
//     <table>,      $any,
//     #2 ( #1 -  , . )
$inner = "(<table[^>]*> (?> $any | (?2) )+? </table>)";
//  ,   $inner,       <table>
//  's'     , ..    - '.'
//   'U'  ,     ascii 
$pattern = "~<table\b[^>]*\bclass=(\"|')?$class\\1[^>]*> (?> $any | $inner )+ </table>~xi";
preg_match($pattern, $text, $m);

Answer 2, authority 30%

How would you explain…
there are two options:


preg_match('~<table.*?>(.*?)</table>~is', $content, $m );

and second:

preg_match('~<table.*?>(.*)</table>~is', $content, $m );

The difference is only in “?”

The first one will not work if there is another, nested table inside the table.

And the second option will not work when there are 2 tables in parallel on the page.