Poor performance with simple table extraction task
There is a lot of hype around multimodal models, such as Qwen 2.5 VL or Omni, I would like to know if others made a similar experience in practice: While they can do impressive things, they still struggle with table extraction, in cases which are straight-forward for humans.
Attached is a simple example, all I need is a reconstruction of the table as a flat CSV, preserving empty all empty cells correctly. Which open source model is able to do that?
There is a lot of hype around multimodal models, such as Qwen 2.5 VL or Omni, I would like to know if others made a similar experience in practice: While they can do impressive things, they still struggle with table extraction, in cases which are straight-forward for humans.
Attached is a simple example, all I need is a reconstruction of the table as a flat CSV, preserving empty all empty cells correctly. Which open source model is able to do that?
I don't have much experience with visual models, but if it's anything comparable to classic LLM, then you may need to scale up the model size with the task difficulty. In other words, if certain model size isn't enough for certain task, you may need to use a bigger model for such task. Usually in standard text generation tasks, 7B models are barely enough to understand the context let alone generate an adequate response to the user's input. It's like an entry level of a model for text generation tasks. It can handle simple tasks, but for some more complex tasks it's probably better to use something bigger.
I had to construct a table extraction model from scratch. My two cents are that no model will consistently work with table extraction out of the box. Take an existing model and train it on your corpus. When dealing with text image resolution matters, pick a model with a patch size of 14px and image resolution of at least 1200px by 1200px . My train set was 120K tables to HTML table mapping.If you want good results, that would be your target. Finally, outputting tables as csv is not a very good idea unless it is very simple table csv is not a good choice; best would be HTML code for the table. Here is a screen shot of you tables fully extracted and converted to HTML tables .
I had to construct a table extraction model from scratch. My two cents are that no model will consistently work with table extraction out of the box. Take an existing model and train it on your corpus. When dealing with text image resolution matters, pick a model with a patch size of 14px and image resolution of at least 1200px by 1200px . My train set was 120K tables to HTML table mapping.If you want good results, that would be your target. Finally, outputting tables as csv is not a very good idea unless it is very simple table csv is not a good choice; best would be HTML code for the table. Here is a screen shot of you tables fully extracted and converted to HTML tables .
And even your trained model couldn't do it. FROM TO columns should belong to XYZ header. So FROM values are totally missing. Plus dates lost "." - 11 2020 instead of 1.1.2020. And finally - MAT should contain MAT1 and MAT2, in your example they moved to BATCH A and BATCH B....