How to extract information from a MHT file

Asked By 240 points N/A Posted on -
qa-featured

I have a set of files saved as MHT. I need to programmatically process them and extract the information. How do I go about this? 

SHARE
Best Answer by Stella
Answered By 0 points N/A #97233

How to extract information from a MHT file

qa-featured

An MHT file is a web archive of HTML.  Actually it's a plain ASCII file that you can open up in Notepad. All you need to do is to first identify a unique "string" to identify a chop point. Then you need to do some heavy string manipulation to cut, chop the string to extract the portion of data required. Its kind of a heavy work!

Answered By 240 points N/A #97235

How to extract information from a MHT file

qa-featured

Why do you say it is heavy work? Can't I use the XMLDom document object and load the data into it? All of the web pages are W3C XHTML compliant!

Answered By 0 points N/A #97237

How to extract information from a MHT file

qa-featured

When you choose to save the file as an MHT, the file you save is actually a "Web Archive". This implies that all supporting material are "embedded" inside one file. When this happens, the saved MHT file, actually will not be XML compliant. Therefore, you cannot load it into the XMLDom object "as-is". Furthermore, the HTML is actually altered by Internet Explorer to point to resources within the file. i. e. images etc need to be pointing to a location inside the file.

Answered By 240 points N/A #97239

How to extract information from a MHT file

qa-featured

For my information, how do the images and text get stored in a single file? I thought you cannot mix the two! Text are ASCII and are not  images, binary?

Answered By 0 points N/A #97241

How to extract information from a MHT file

qa-featured

Lukas,

You are correct in stating that ASCII and binary cannot be mixed. What actually happens is that the binary images and resources are encoded using Base64 encoding. The Web Archive format (MHT) is saved using the MIME reference model. For example, when you send emails with attachments, the attachment binary data is encoded as a Base64 string and appended to the text portion of the email.

Base64 consists of  human readable characters which is considered as "web safe".  If you open up the MHT file you will notice these "sections" and the encoded strings. Following is how the data is organized. Notice the "—-=_NextPart " line. This delimits the sections. Top part is the HTML, the next part is the binary image.

Answered By 240 points N/A #97243

How to extract information from a MHT file

qa-featured

Thank you Stella! This is great! Does this mean that I can safely "chop" the MHT file into sections by this "—-_Nexpart" delimiter and then process the sections individually?

Best Answer
Best Answer
Answered By 0 points N/A #97245

How to extract information from a MHT file

qa-featured

 

The delimiter can be used but since it differs from browser to browser, the exact string might vary. What you should do is use the delimiter then work on the HTML part, which will be changed.
 
You will have to make use of trial and error to make sure that the XMLDom object reads the HTML.
 
Answered By 240 points N/A #97246

How to extract information from a MHT file

qa-featured

Thank you Stella for your advise! I see that I will still have to use string manipulation! Your information helped me to shorten the development time! Thank you again!

Answered By 0 points N/A #97247

How to extract information from a MHT file

qa-featured

Glad to be of help! Have a nice day!

Related Questions