Database dump in easy to parse/grep format?

Skaperen · 06-14-2012, 02:58 PM

Quote:

Originally Posted by schneidz

can you use somthing like tr "\|" "" on mysqlshow's output ?

It would need to be more than that. It would need to remove the tops and bottoms, and all the space added, without removing it from content.

David the H. · 06-14-2012, 07:55 PM

Quote:

Originally Posted by Skaperen

I ran a test. I took an XML dump of a Drupal website database, and converted it to pyx format, then back to xml ... and repeated this 499 times. It did NOT converge. In fact, it collapsed to about 1/3 the size of the original. It appears to be lossy. Program bug?

Actually, I don't really know all that much about it. I just recently discovered the option and saw some potential usefulness to it. It doesn't appear to be particularly designed for round-tripping though, and is more there for avoiding having to struggle with the structure of xml when doing content parsing.

Following the link at the bottom of the xmlstarlet documentation, the more detailed description here points out that it certainly isn't completely lossless:

Quote:

You should notice that the transformation loses the DOCTYPE declaration and the comment in the original XML document. For many purposes, this is not important (parsers often discard this information as well). The PYX format, in contrast to the XML format, allows one to easily pose a variety of ad hoc questions about a document. For example: What are all the attribute values in the sample document?

It all comes down to what your ultimate purpose is, I guess. If you need to round-trip it with fidelity, it's probably not the format for you.

schneidz · 06-14-2012, 09:00 PM

Quote:

Originally Posted by Skaperen

It would need to be more than that. It would need to remove the tops and bottoms, and all the space added, without removing it from content.

i dont have mysqld running so this is untested (i'm sure this can be done way more efficiently in a single awk or sed):

Code:

for db in `mysqlshow | cut -b 3- | awk 'NR>3 {print $1}' | grep -v ^---`
do
 for tab in `mysqlshow $db | cut -b 3- | awk 'NR>3 {print $1}' | grep -v ^---`
 do
  for col in `mysqlshow $db $tab | cut -b 3- | awk 'NR>3 {print $1}' | grep -v ^---`
  do
   #echo $db-$tab-$col >> columns.lst
   mysql $db -e"select $col from $tab" | cut -b 3- | awk 'NR>3 {print $1}' | grep -v ^--- | sed s/^/$db-$tab-$col- >> crazy-dump-format.lst
  done
 done
done

Skaperen · 06-16-2012, 06:15 PM

Quote:

Originally Posted by David the H.

Actually, I don't really know all that much about it. I just recently discovered the option and saw some potential usefulness to it. It doesn't appear to be particularly designed for round-tripping though, and is more there for avoiding having to struggle with the structure of xml when doing content parsing.

Agreed. It looks very simple.

IMHO, that it can even be done shows that XML is a design that should never have been refactored for raw data. It was, and is, a format suitable for documents. Calling a database table a document, however, is just wrong. It's as wrong as calling a document a table.

Quote:

Originally Posted by David the H.

Following the link at the bottom of the xmlstarlet documentation, the more detailed description here points out that it certainly isn't completely lossless:

Quote:

You should notice that the transformation loses the DOCTYPE declaration and the comment in the original XML document. For many purposes, this is not important (parsers often discard this information as well). The PYX format, in contrast to the XML format, allows one to easily pose a variety of ad hoc questions about a document. For example: What are all the attribute values in the sample document?

That would lead me to believe that going back from PYX to XML would create a lesser XML because the lost data is not there. But this should be a specific loss. XML->PYX->XML->PYX should be no less than XML->PYX alone. But MORE is lost the 2nd time. Still MORE is lost the 3rd time. More was lost the 499th time. Also, the slope was not even. There was one point where it lost about 50% in one pass. That just hints at very defective. The concept looks fine. The specs might have an issue. But I suspect the implementation might have a bug.

Quote:

Originally Posted by David the H.

It all comes down to what your ultimate purpose is, I guess. If you need to round-trip it with fidelity, it's probably not the format for you.

Agreed.

Maybe I need to just design my own format somewhat like PYX, but focusing on database/table/row/column/value encoding rather than trying to convert XML. It it weren't for the fact that mysqldump is itself very complex, I might try to add an output format to it, or extract the code pieces that "recurse" through all the databases, tables, rows, and columns, and make a tool for that. The issue I see is figuring how the right way to encode various database column types. Numbers and strings are obvious. I'd have to consult how they do that in SQL and hope there is some commonality I can use for all database types.

Such a format might look like:

Code:

Bdatabasename
Ttablename
R
scolumn1string
scolumn2string
scolumn3string
ncolumn4number
ncolumn5number
ncolumb6number
R
scolumn1string
scolumn2string
scolumn3string
ncolumn4number
ncolumn5number
ncolumb6number
E