{"type":"doc","content":[{"type":"panel","attrs":{"panelType":"info"},"content":[{"type":"paragraph","content":[{"text":"Starting from Datafari 5.1","type":"text","marks":[{"type":"strong"}]}]}]},{"type":"paragraph","content":[{"text":"The goal of this MCF transformation connector is to simply choose an encompassing tag in a HTML document for text extracting. And inside this tag, this connector allows you to remove subparts that you do no want : all the tags corresponding to declared types or specific attribute tag names for example.","type":"text"}]},{"type":"paragraph","content":[{"text":"This way, instead of extracting the entire text on the page, you can select the subpart of the document that you want.","type":"text"}]},{"type":"paragraph","content":[{"text":"The code is here : ","type":"text"},{"text":"https://github.com/francelabs/manifoldcf/tree/trunk/connectors/html-extractor","type":"text","marks":[{"type":"link","attrs":{"href":"https://github.com/francelabs/manifoldcf/tree/trunk/connectors/html-extractor"}}]}]},{"type":"paragraph","content":[{"text":"1.INSTALLATION DOCUMENTATION","type":"text","marks":[{"type":"strong"}]}]},{"type":"paragraph","content":[{"text":"Note : the $MCF variable indicates the installation folder of ManifoldCF for example ","type":"text"},{"text":"/opt/datafari/mcf","type":"text","marks":[{"type":"code"}]}]},{"type":"paragraph","content":[{"text":"To install this connector, you need to follow these steps :","type":"text"}]},{"type":"bulletList","content":[{"type":"listItem","content":[{"type":"paragraph","content":[{"text":"Add the JAR ","type":"text"},{"text":"datafari-html-extractor-connector-4.0.2-SNAPSHOT","type":"text","marks":[{"type":"code"}]},{"text":" in the ","type":"text"},{"text":"connector-lib","type":"text","marks":[{"type":"code"}]},{"text":" folder of MCF ie ","type":"text"},{"text":"$MCF/connector-lib","type":"text","marks":[{"type":"code"}]}]}]},{"type":"listItem","content":[{"type":"paragraph","content":[{"text":"Edit the file ","type":"text","marks":[{"type":"textColor","attrs":{"color":"#333333"}}]},{"text":"$MC/mcf_home/connectors.xml","type":"text","marks":[{"type":"code"}]},{"text":" and add the line :","type":"text","marks":[{"type":"textColor","attrs":{"color":"#333333"}}]}]}]}]},{"type":"codeBlock","content":[{"text":"","type":"text"}]},{"type":"bulletList","content":[{"type":"listItem","content":[{"type":"paragraph","content":[{"text":"If you just downloaded MCF and never launched it, you are done ! ","type":"text","marks":[{"type":"strong"}]}]}]},{"type":"listItem","content":[{"type":"paragraph","content":[{"text":"Additional step if you have an existing MCF application running like in Datafari :","type":"text"},{"type":"hardBreak"},{"text":"You need to initialize the connectors in MCF. To do so, go to ","type":"text"},{"text":"$MCF/mcf_home","type":"text","marks":[{"type":"code"}]},{"text":" and then launch ","type":"text"},{"text":"initialize.sh","type":"text","marks":[{"type":"code"}]},{"text":" (be sure that your JAVA HOME variable is correctly setted and the user you use has sufficient permissions (you need to have your Datafari up and running when doing that):","type":"text"}]},{"type":"codeBlock","content":[{"text":"cd $MCF/mcf_home\nbash initialize.sh","type":"text"}]}]}]},{"type":"paragraph","content":[{"text":"You will see that the connector is now listed amongh the available MCF connectors :","type":"text"}]},{"type":"paragraph","content":[{"text":"You can now use it like the other transformation connectors.","type":"text"}]},{"type":"paragraph","content":[{"text":"2.TECHNICAL DOCUMENTATION","type":"text","marks":[{"type":"strong"}]}]},{"type":"paragraph","content":[{"text":"It is a Maven project. It contains Java code, HTML and Javascript code and finally i18n files.","type":"text"}]},{"type":"mediaSingle","attrs":{"layout":"align-start","width":30.0},"content":[{"type":"media","attrs":{"width":476,"id":"a54bf4d4-a1ec-4c98-97f9-7f9cbd3fa815","collection":"contentId-237240321","type":"file","height":562}}]},{"type":"paragraph","content":[{"text":"It is based on the \"Null\" transformation connector in the source code of ManifoldCF. See ","type":"text"},{"text":"https://github.com/apache/manifoldcf/tree/trunk/connectors/nulltransformation","type":"text","marks":[{"type":"link","attrs":{"href":"https://github.com/apache/manifoldcf/tree/trunk/connectors/nulltransformation"}}]}]},{"type":"paragraph","content":[{"text":"The main Java class is ","type":"text"},{"text":"HTMLExtractor.java","type":"text","marks":[{"type":"code"}]},{"text":".","type":"text"}]},{"type":"paragraph","content":[{"text":"Inside it, the main method is ","type":"text"},{"text":"public int addOrReplaceDocumentWithException(String documentURI, VersionContext pipelineDescription, RepositoryDocument document, String authorityNameString, IOutputAddActivity activities)","type":"text","marks":[{"type":"code"}]},{"text":". ","type":"text"}]},{"type":"extension","attrs":{"layout":"default","extensionType":"com.atlassian.confluence.macro.core","extensionKey":"gliffy","parameters":{"macroParams":{"imageAttachmentId":{"value":"att237273122"},"baseUrl":{"value":"https://datafari.atlassian.net/wiki"},"name":{"value":"MCF HTML Extractor Connector"},"diagramAttachmentId":{"value":"att237273117"},"containerId":{"value":"237240321"},"timestamp":{"value":"1518715892779"}},"macroMetadata":{"macroId":{"value":"b68411a5-8318-4d93-9542-ef846a093725"},"schemaVersion":{"value":"1"},"placeholder":[{"type":"image","data":{"url":"https://confluence-connect.gliffy.net/diagram/image/placeHolder.png?imageAttachmentId=att237273122&baseUrl=https%3A%2F%2Fdatafari.atlassian.net%2Fwiki&name=MCF+HTML+Extractor+Connector&diagramAttachmentId=att237273117&containerId=237240321×tamp=1518715892779"}}],"title":"Gliffy Diagram"}},"localId":"dd789a3b-8d87-492c-8e78-b8d91e8373d3"}},{"type":"paragraph","content":[{"text":"3.USER DOCUMENTATION ","type":"text","marks":[{"type":"strong"}]}]},{"type":"paragraph","content":[{"text":"To use this transformation connector, you first have to create it in the Transformation connector menu then to add it in a job.","type":"text"}]},{"type":"bulletList","content":[{"type":"listItem","content":[{"type":"paragraph","content":[{"text":"Add the transformation connector","type":"text"}]}]}]},{"type":"paragraph","content":[{"text":"Go to outputs → List transformation connectors","type":"text"}]},{"type":"paragraph","content":[{"text":"Click on the button Add a new transformation connector","type":"text"}]},{"type":"mediaSingle","attrs":{"layout":"wide","width":100.0},"content":[{"type":"media","attrs":{"width":1358,"id":"3d795ba7-8dbd-4b05-9dab-83a78f3bd45b","collection":"contentId-237240321","type":"file","height":483}}]},{"type":"paragraph","content":[{"text":"On the first tab choose a name for your transformation connector, for example : ","type":"text"},{"text":"HtmlExtractorConnector","type":"text","marks":[{"type":"code"}]},{"text":".","type":"text"},{"type":"hardBreak"},{"text":"On the second tab, choose the type of the transformation connector : ","type":"text"}]},{"type":"mediaSingle","attrs":{"layout":"align-start","width":56.0},"content":[{"type":"media","attrs":{"width":1056,"id":"181808c3-af0d-44a7-9015-3a7404021a8a","collection":"contentId-237240321","type":"file","height":676}}]},{"type":"paragraph","content":[{"text":"Click on Continue then Save.","type":"text"}]},{"type":"paragraph","content":[{"text":"The message Connection working is displayed.","type":"text"}]},{"type":"mediaSingle","attrs":{"layout":"wide","width":100.0},"content":[{"type":"media","attrs":{"width":1770,"id":"008c0e35-99c0-4c09-b762-10b1f456c02c","collection":"contentId-237240321","type":"file","height":568}}]},{"type":"bulletList","content":[{"type":"listItem","content":[{"type":"paragraph","content":[{"text":"Create your job (in this example we have a Web job) as usual","type":"text"}]}]}]},{"type":"paragraph","content":[{"text":"Note that the Html Extractor connector will be applied only on documents with HTML mimetype.","type":"text"}]},{"type":"paragraph","content":[{"text":"-In the tab Connection, add the Html Extractor transformation connector :","type":"text"}]},{"type":"mediaSingle","attrs":{"layout":"align-start","width":95.0},"content":[{"type":"media","attrs":{"width":1806,"id":"f41b860e-11f4-4dbe-9351-3387037ad355","collection":"contentId-237240321","type":"file","height":686}}]},{"type":"paragraph","content":[{"text":"Be careful about the order of the transformation connector in your job pipeline.","type":"text"}]},{"type":"paragraph","content":[{"text":"If you use Tika, the Html Extractor connector must be before it.","type":"text"},{"type":"hardBreak"},{"text":"So if you use Tika in MCF, the pipeline is : Html Extractor Connector / Tika Extractor connector / Output. If you use Tika in Solr the pipeline is : Html Extractor Connector / Output connector (Tika in Solr : SolrCell)","type":"text"}]},{"type":"paragraph","content":[{"text":"-A new tab appears now : HTML Extractor. Click on it.","type":"text"}]},{"type":"paragraph","content":[{"text":"HTML strip tags : it means that you can choose to strip HTML tags from the extracted text or keep raw text (ie with HTML tags).","type":"text"}]},{"type":"paragraph","content":[{"text":"In the section Englobing tag, add the tag in which the text will be extracted. In the example we choose body but it can be whatever CSS selector you want. If you want to extract the text from ","type":"text"},{"text":"div id=\"example\"","type":"text","marks":[{"type":"code"}]},{"text":", in the text box, enter : ","type":"text"},{"text":"div#example","type":"text","marks":[{"type":"code"}]},{"text":".","type":"text"}]},{"type":"paragraph","content":[{"text":"You can only choose one tag. By default it will be the body tag.","type":"text"}]},{"type":"paragraph","content":[{"text":"In the section blacklist, you can choose all the elements that you do not want in the previous englobing tag. In the example, we choose to remove all of the and

Browser not supported