php+XML - escaped ampersands not escaping correctly
ProgrammingThis forum is for all programming questions.
The question does not have to be directly related to Linux and any language is fair game.
Notices
Welcome to LinuxQuestions.org, a friendly and active Linux Community.
You are currently viewing LQ as a guest. By joining our community you will have the ability to post topics, receive our newsletter, use the advanced search, subscribe to threads and access many other special features. Registration is quick, simple and absolutely free. Join our community today!
Note that registered members see fewer ads, and ContentLink is completely disabled once you log in.
If you have any problems with the registration process or your account login, please contact us. If you need to reset your password, click here.
Having a problem logging in? Please visit this page to clear all LQ-related cookies.
Get a virtual cloud desktop with the Linux distro that you want in less than five minutes with Shells! With over 10 pre-installed distros to choose from, the worry-free installation life is here! Whether you are a digital nomad or just looking for flexibility, Shells can put your Linux machine on the device that you want to use.
Exclusive for LQ members, get up to 45% off per month. Click here for more info.
Distribution: Ubuntu Server, Slackware, Red Hat 6.1
Posts: 241
Rep:
php+XML - escaped ampersands not escaping correctly
hello
I'm writing a script that parses an XML file. Right now I have the php code echoing whatever is inside a segment tag. Sometimes the tag content will contain ampersands (which are escaped in the xml file like &, but when they are encountered by the parser, something strange happens..
The parser seems to split the segment data into 3 parts and treats each part like it just found a new segment. I know this because its adding a <br> to the end of each line, which only happens when it finds a new segment tag. Why would it be doing this? BTW - every other piece of data between <segment> displays correctly if it doesn't have an ampersand.
Distribution: Ubuntu Server, Slackware, Red Hat 6.1
Posts: 241
Original Poster
Rep:
Code:
if (! ($xmlparser = xml_parser_create()) )
{
die ("Error: cannot create XML parser");
}
$current = "";
// called when an open tag is found
function open_tag($parser, $tagname, $attribs)
{
global $current;
$current = $tagname;
}
// the function called when a closing tag is found
function close_tag($parser, $tagname)
{
// I dont need it to do anything right now
}
// setting up the tag handler
xml_set_element_handler($xmlparser, "open_tag", "close_tag");
// function called when data between tags is found
function tag_contents($parser, $tagdata)
{
global $current;
if($current == "SEGMENT") //if it detects a segment tag, echo whats inside
{
echo $tagdata."<br>"."\n";
}
}
// setting up the handler for tag contents
xml_set_character_data_handler($xmlparser, "tag_contents");
// set the filename var
$filename = $_FILES['nzbfile']['tmp_name'];
//open the file for reading
if(!($fp=fopen($filename, "r")))
{
die("Cannot open the nzb entitled ".$filename);
}
// read the file, then send the data to the parser
while($data = fread($fp, $_FILES['nzbfile']['size']))
{
if (!xml_parse($xmlparser, $data, feof($fp)))
{
$reason = xml_error_string(xml_get_error_code($xmlparser));
$reason .= xml_get_current_line_number($xmlparser);
die($reason);
}
}
// destroy parser
xml_parser_free($xmlparser);
fclose($fp);
The XML parser that you are using is calling the tag_contents() function multiple times for those lines. You might want to consider using the simpleXML functions. Insert the following code, it should get you going.
Hi
"tag_contents" can be called many times. The way the parser is designed, you don't need to read the whole XML in memory at once, even if you do in your example. The parser works on a stream and buffers can run empty any time.
Change it to:
PHP Code:
function close_tag($parser, $tagname) { global $current; if($current == "SEGMENT") echo "<br>\n"; }
function tag_contents($parser, $tagdata) { global $current; if($current == "SEGMENT") //if it detects a segment tag, echo whats inside { echo $tagdata; } }
Distribution: Ubuntu Server, Slackware, Red Hat 6.1
Posts: 241
Original Poster
Rep:
Guttorm -- thank you, that fixed it the glitch. Could you explain why placing the breaks in close_tag() changed it? To me it seems either way would do the same thing, but obviously that isn't so
Well I tried to say, the SAX XML parser is made to operate on streams - not just memory.
PHP Code:
// read the file, then send the data to the parser while($data = fread($fp, $_FILES['nzbfile']['size'])) { if (!xml_parse($xmlparser, $data, feof($fp))) { $reason = xml_error_string(xml_get_error_code($xmlparser)); $reason .= xml_get_current_line_number($xmlparser); die($reason); } }
In your while loop, you read the entire file at once - passing $_FILES['nzbfile']['size'] as how much you want to read. You could have used a smaller buffer, like 100 bytes, and it still works - even if the segment tag contained more than 100 bytes.
The disadvantage is that tag_contents can be called many times, but it's not a problem since open_tag and close_tag is called so you know when the tag starts and finishes.
The advantage is that the XML file can be gigabytes long, and you can parse it without gigabytes of memory.
LinuxQuestions.org is looking for people interested in writing
Editorials, Articles, Reviews, and more. If you'd like to contribute
content, let us know.