Monday, November 2, 2009

10.10 Don't Assume LOAD DATA Knows More than It Does




I l@ve RuBoard










10.10 Don't Assume LOAD DATA Knows More than It Does




10.10.1 Problem



You think LOAD
DATA is smarter than it really is.





10.10.2 Solution



Don't assume that LOAD
DATA knows anything at all about the format of
your datafile. And make sure you yourself know what its format is. If
the file has been transferred from one machine to another, its
contents may have been changed in subtle ways of which
you're not aware.





10.10.3 Discussion



Many LOAD DATA frustrations
occur because people expect MySQL to know things that it cannot
possibly know. LOAD DATA makes
certain assumptions about the structure of input files, represented
as the default settings for the line and field terminators, and for
the quote and escape character settings. If your input
doesn't match those assumptions, you need to tell
MySQL about it.



When in doubt, check the contents of
your datafile using a hex dump program or other utility that displays
a visible representation of whitespace characters like tab, carriage
return, and linefeed. Under
Unix, the
od program can display file contents in a
variety of formats. If you don't have
od or some comparable utility, the
transfer directory of the
recipes distribution contains hex dumpers written
in Perl and Python (hexdump.pl and
hexdump.py), as well as a couple of programs
that display printable representations of all characters of a file
(see.pl and see.py). You
may find them useful for examining files to see what they really
contain. In some cases, you may be surprised to discover that a
file's contents are different than you think. This
is in fact quite likely if the file has been transferred from one
machine to another:




  • An FTP transfer between machines
    running different operating systems typically translates line endings
    to those that are appropriate for the destination machine if the
    transfer is performed in text mode rather than in binary (image)
    mode. Suppose you have tab-delimited linefeed-terminated records in a
    datafile that load into MySQL on a Unix system just fine using the
    default LOAD DATA settings. If
    you copy the file to a Windows machine with FTP using a text transfer
    mode, the linefeeds probably will be converted to carriage
    return/linefeed pairs. On that machine, the file will not load
    properly with the same LOAD
    DATA statement, because its contents will have
    been changed. Does MySQL have any way of knowing that? No. So
    it's up to you to tell it, by adding a
    LINES TERMINATED
    BY '\r\n' clause to the
    statement. Transfers between any two systems with dissimilar default
    line endings can cause these changes. For example, a Macintosh file
    containing carriage returns may contain linefeeds after transfer to a
    Unix system. You should either account for such changes with a
    LINES TERMINATED
    BY clause that reflects the modified line-ending
    sequence, or transfer the file in binary mode so that its contents do
    not change.


  • Datafiles pasted into email messages often do not survive
    intact. Mail software may wrap (break) long lines or convert
    line-ending sequences. If you must transfer a datafile by email,
    it's best sent as an attachment.










    I l@ve RuBoard



    No comments: