How to successfully convert math papers to plain text











up vote
-1
down vote

favorite












Goals:



1.Develop a canonical method to use plain text to uniquely represent STEM papers in general and math papers in particular.




  1. Develop softwares that can convert existing typed STEM papers into that canonical form with 100% accuracy. Note that I can't tolerate any inaccuracy simply because as a single individual I can't proofread millions of papers to correct inaccuracy in conversion even at the rate of 0.001 errors per paper on average.


Problems:




  1. All PDF to text, TeX to text etc programs I have seen here on Stackoverflow and elsewhere such as PyMuPDF do not really work due to math symbols that can not be processed.


2.PDF is really hard to process.



3.TeX is really hard to process because of numerous macros STEM paper authors tend to add to their source files which tend to break LatexML and other converters. It is very easy to process my own papers because I don't use a lot of new commands. However there are many authors whose papers contain def macros which can not even be processed by de-macro. To actually get TeX to work, assuming that I can even get source files of most papers on arXiv at all, I will pretty much have to actually write my own variant of TeX engine that somehow expand all required macros and produce a plain text document.



Is there any other way to solve this problem? Currently the target format I prefer is pretty much just plain text + math symbols written in LaTeX without formatting other than those that are semantically significant such as mathcal{A} and A being separate entities. I can learn to set up a neural network to train it to understand these printed math symbols assuming that my laptop is sufficiently powerful. There are literally just less than 200 symbols for the network to learn and their shapes should be very easy to recognize due to lack of variation. Shall I do that?










share|improve this question
























  • This is an extremely broad request that requires extremely accurate results without any examples of what to work with in general.
    – Werner
    Nov 21 at 1:51










  • @Werner Sure. My goal is to convert the text in a random paper such as arxiv.org/abs/1802.00001 to plain text while retaining all the semantically significant information.
    – Ying Zhou
    Nov 21 at 18:40















up vote
-1
down vote

favorite












Goals:



1.Develop a canonical method to use plain text to uniquely represent STEM papers in general and math papers in particular.




  1. Develop softwares that can convert existing typed STEM papers into that canonical form with 100% accuracy. Note that I can't tolerate any inaccuracy simply because as a single individual I can't proofread millions of papers to correct inaccuracy in conversion even at the rate of 0.001 errors per paper on average.


Problems:




  1. All PDF to text, TeX to text etc programs I have seen here on Stackoverflow and elsewhere such as PyMuPDF do not really work due to math symbols that can not be processed.


2.PDF is really hard to process.



3.TeX is really hard to process because of numerous macros STEM paper authors tend to add to their source files which tend to break LatexML and other converters. It is very easy to process my own papers because I don't use a lot of new commands. However there are many authors whose papers contain def macros which can not even be processed by de-macro. To actually get TeX to work, assuming that I can even get source files of most papers on arXiv at all, I will pretty much have to actually write my own variant of TeX engine that somehow expand all required macros and produce a plain text document.



Is there any other way to solve this problem? Currently the target format I prefer is pretty much just plain text + math symbols written in LaTeX without formatting other than those that are semantically significant such as mathcal{A} and A being separate entities. I can learn to set up a neural network to train it to understand these printed math symbols assuming that my laptop is sufficiently powerful. There are literally just less than 200 symbols for the network to learn and their shapes should be very easy to recognize due to lack of variation. Shall I do that?










share|improve this question
























  • This is an extremely broad request that requires extremely accurate results without any examples of what to work with in general.
    – Werner
    Nov 21 at 1:51










  • @Werner Sure. My goal is to convert the text in a random paper such as arxiv.org/abs/1802.00001 to plain text while retaining all the semantically significant information.
    – Ying Zhou
    Nov 21 at 18:40













up vote
-1
down vote

favorite









up vote
-1
down vote

favorite











Goals:



1.Develop a canonical method to use plain text to uniquely represent STEM papers in general and math papers in particular.




  1. Develop softwares that can convert existing typed STEM papers into that canonical form with 100% accuracy. Note that I can't tolerate any inaccuracy simply because as a single individual I can't proofread millions of papers to correct inaccuracy in conversion even at the rate of 0.001 errors per paper on average.


Problems:




  1. All PDF to text, TeX to text etc programs I have seen here on Stackoverflow and elsewhere such as PyMuPDF do not really work due to math symbols that can not be processed.


2.PDF is really hard to process.



3.TeX is really hard to process because of numerous macros STEM paper authors tend to add to their source files which tend to break LatexML and other converters. It is very easy to process my own papers because I don't use a lot of new commands. However there are many authors whose papers contain def macros which can not even be processed by de-macro. To actually get TeX to work, assuming that I can even get source files of most papers on arXiv at all, I will pretty much have to actually write my own variant of TeX engine that somehow expand all required macros and produce a plain text document.



Is there any other way to solve this problem? Currently the target format I prefer is pretty much just plain text + math symbols written in LaTeX without formatting other than those that are semantically significant such as mathcal{A} and A being separate entities. I can learn to set up a neural network to train it to understand these printed math symbols assuming that my laptop is sufficiently powerful. There are literally just less than 200 symbols for the network to learn and their shapes should be very easy to recognize due to lack of variation. Shall I do that?










share|improve this question















Goals:



1.Develop a canonical method to use plain text to uniquely represent STEM papers in general and math papers in particular.




  1. Develop softwares that can convert existing typed STEM papers into that canonical form with 100% accuracy. Note that I can't tolerate any inaccuracy simply because as a single individual I can't proofread millions of papers to correct inaccuracy in conversion even at the rate of 0.001 errors per paper on average.


Problems:




  1. All PDF to text, TeX to text etc programs I have seen here on Stackoverflow and elsewhere such as PyMuPDF do not really work due to math symbols that can not be processed.


2.PDF is really hard to process.



3.TeX is really hard to process because of numerous macros STEM paper authors tend to add to their source files which tend to break LatexML and other converters. It is very easy to process my own papers because I don't use a lot of new commands. However there are many authors whose papers contain def macros which can not even be processed by de-macro. To actually get TeX to work, assuming that I can even get source files of most papers on arXiv at all, I will pretty much have to actually write my own variant of TeX engine that somehow expand all required macros and produce a plain text document.



Is there any other way to solve this problem? Currently the target format I prefer is pretty much just plain text + math symbols written in LaTeX without formatting other than those that are semantically significant such as mathcal{A} and A being separate entities. I can learn to set up a neural network to train it to understand these printed math symbols assuming that my laptop is sufficiently powerful. There are literally just less than 200 symbols for the network to learn and their shapes should be very easy to recognize due to lack of variation. Shall I do that?







pdf latex ps mathml






share|improve this question















share|improve this question













share|improve this question




share|improve this question








edited Nov 20 at 4:12

























asked Nov 20 at 4:04









Ying Zhou

7210




7210












  • This is an extremely broad request that requires extremely accurate results without any examples of what to work with in general.
    – Werner
    Nov 21 at 1:51










  • @Werner Sure. My goal is to convert the text in a random paper such as arxiv.org/abs/1802.00001 to plain text while retaining all the semantically significant information.
    – Ying Zhou
    Nov 21 at 18:40


















  • This is an extremely broad request that requires extremely accurate results without any examples of what to work with in general.
    – Werner
    Nov 21 at 1:51










  • @Werner Sure. My goal is to convert the text in a random paper such as arxiv.org/abs/1802.00001 to plain text while retaining all the semantically significant information.
    – Ying Zhou
    Nov 21 at 18:40
















This is an extremely broad request that requires extremely accurate results without any examples of what to work with in general.
– Werner
Nov 21 at 1:51




This is an extremely broad request that requires extremely accurate results without any examples of what to work with in general.
– Werner
Nov 21 at 1:51












@Werner Sure. My goal is to convert the text in a random paper such as arxiv.org/abs/1802.00001 to plain text while retaining all the semantically significant information.
– Ying Zhou
Nov 21 at 18:40




@Werner Sure. My goal is to convert the text in a random paper such as arxiv.org/abs/1802.00001 to plain text while retaining all the semantically significant information.
– Ying Zhou
Nov 21 at 18:40












1 Answer
1






active

oldest

votes

















up vote
0
down vote













Yes you can try that. Recognition of symbols, with subsequent transformation of them into LaTeX format(for example, for every square root to write sqrt).



You can further refer to the issue of recognition to this paper:



https://www.sciencedirect.com/science/article/abs/pii/003132039090113Y -



Recognition of handwritten symbols



Torfinn Taxt,Jórunn B.Ólafsdóttir,MortenDæhlen∥



http://neuralnetworksanddeeplearning.com/chap1.html - here you can find out more, with code samples, on implementation of neural network to handwritten manuscripts.






share|improve this answer





















  • Really thanks! Well I don't even need to process handwritten symbols. Just typical papers on arXiv that typically have very limited range of fonts.
    – Ying Zhou
    Nov 20 at 16:30












  • Oh , it makes the issue even easier!
    – Farid Hasanov
    Nov 21 at 8:03










  • With handwritten symbols there is an issue of every symbol being subjected to human error and inaccuracy in writing. You seem to be relieved from that error.
    – Farid Hasanov
    Nov 21 at 8:04











Your Answer






StackExchange.ifUsing("editor", function () {
StackExchange.using("externalEditor", function () {
StackExchange.using("snippets", function () {
StackExchange.snippets.init();
});
});
}, "code-snippets");

StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "1"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});

function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});


}
});














draft saved

draft discarded


















StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53386054%2fhow-to-successfully-convert-math-papers-to-plain-text%23new-answer', 'question_page');
}
);

Post as a guest















Required, but never shown

























1 Answer
1






active

oldest

votes








1 Answer
1






active

oldest

votes









active

oldest

votes






active

oldest

votes








up vote
0
down vote













Yes you can try that. Recognition of symbols, with subsequent transformation of them into LaTeX format(for example, for every square root to write sqrt).



You can further refer to the issue of recognition to this paper:



https://www.sciencedirect.com/science/article/abs/pii/003132039090113Y -



Recognition of handwritten symbols



Torfinn Taxt,Jórunn B.Ólafsdóttir,MortenDæhlen∥



http://neuralnetworksanddeeplearning.com/chap1.html - here you can find out more, with code samples, on implementation of neural network to handwritten manuscripts.






share|improve this answer





















  • Really thanks! Well I don't even need to process handwritten symbols. Just typical papers on arXiv that typically have very limited range of fonts.
    – Ying Zhou
    Nov 20 at 16:30












  • Oh , it makes the issue even easier!
    – Farid Hasanov
    Nov 21 at 8:03










  • With handwritten symbols there is an issue of every symbol being subjected to human error and inaccuracy in writing. You seem to be relieved from that error.
    – Farid Hasanov
    Nov 21 at 8:04















up vote
0
down vote













Yes you can try that. Recognition of symbols, with subsequent transformation of them into LaTeX format(for example, for every square root to write sqrt).



You can further refer to the issue of recognition to this paper:



https://www.sciencedirect.com/science/article/abs/pii/003132039090113Y -



Recognition of handwritten symbols



Torfinn Taxt,Jórunn B.Ólafsdóttir,MortenDæhlen∥



http://neuralnetworksanddeeplearning.com/chap1.html - here you can find out more, with code samples, on implementation of neural network to handwritten manuscripts.






share|improve this answer





















  • Really thanks! Well I don't even need to process handwritten symbols. Just typical papers on arXiv that typically have very limited range of fonts.
    – Ying Zhou
    Nov 20 at 16:30












  • Oh , it makes the issue even easier!
    – Farid Hasanov
    Nov 21 at 8:03










  • With handwritten symbols there is an issue of every symbol being subjected to human error and inaccuracy in writing. You seem to be relieved from that error.
    – Farid Hasanov
    Nov 21 at 8:04













up vote
0
down vote










up vote
0
down vote









Yes you can try that. Recognition of symbols, with subsequent transformation of them into LaTeX format(for example, for every square root to write sqrt).



You can further refer to the issue of recognition to this paper:



https://www.sciencedirect.com/science/article/abs/pii/003132039090113Y -



Recognition of handwritten symbols



Torfinn Taxt,Jórunn B.Ólafsdóttir,MortenDæhlen∥



http://neuralnetworksanddeeplearning.com/chap1.html - here you can find out more, with code samples, on implementation of neural network to handwritten manuscripts.






share|improve this answer












Yes you can try that. Recognition of symbols, with subsequent transformation of them into LaTeX format(for example, for every square root to write sqrt).



You can further refer to the issue of recognition to this paper:



https://www.sciencedirect.com/science/article/abs/pii/003132039090113Y -



Recognition of handwritten symbols



Torfinn Taxt,Jórunn B.Ólafsdóttir,MortenDæhlen∥



http://neuralnetworksanddeeplearning.com/chap1.html - here you can find out more, with code samples, on implementation of neural network to handwritten manuscripts.







share|improve this answer












share|improve this answer



share|improve this answer










answered Nov 20 at 10:56









Farid Hasanov

12




12












  • Really thanks! Well I don't even need to process handwritten symbols. Just typical papers on arXiv that typically have very limited range of fonts.
    – Ying Zhou
    Nov 20 at 16:30












  • Oh , it makes the issue even easier!
    – Farid Hasanov
    Nov 21 at 8:03










  • With handwritten symbols there is an issue of every symbol being subjected to human error and inaccuracy in writing. You seem to be relieved from that error.
    – Farid Hasanov
    Nov 21 at 8:04


















  • Really thanks! Well I don't even need to process handwritten symbols. Just typical papers on arXiv that typically have very limited range of fonts.
    – Ying Zhou
    Nov 20 at 16:30












  • Oh , it makes the issue even easier!
    – Farid Hasanov
    Nov 21 at 8:03










  • With handwritten symbols there is an issue of every symbol being subjected to human error and inaccuracy in writing. You seem to be relieved from that error.
    – Farid Hasanov
    Nov 21 at 8:04
















Really thanks! Well I don't even need to process handwritten symbols. Just typical papers on arXiv that typically have very limited range of fonts.
– Ying Zhou
Nov 20 at 16:30






Really thanks! Well I don't even need to process handwritten symbols. Just typical papers on arXiv that typically have very limited range of fonts.
– Ying Zhou
Nov 20 at 16:30














Oh , it makes the issue even easier!
– Farid Hasanov
Nov 21 at 8:03




Oh , it makes the issue even easier!
– Farid Hasanov
Nov 21 at 8:03












With handwritten symbols there is an issue of every symbol being subjected to human error and inaccuracy in writing. You seem to be relieved from that error.
– Farid Hasanov
Nov 21 at 8:04




With handwritten symbols there is an issue of every symbol being subjected to human error and inaccuracy in writing. You seem to be relieved from that error.
– Farid Hasanov
Nov 21 at 8:04


















draft saved

draft discarded




















































Thanks for contributing an answer to Stack Overflow!


  • Please be sure to answer the question. Provide details and share your research!

But avoid



  • Asking for help, clarification, or responding to other answers.

  • Making statements based on opinion; back them up with references or personal experience.


To learn more, see our tips on writing great answers.





Some of your past answers have not been well-received, and you're in danger of being blocked from answering.


Please pay close attention to the following guidance:


  • Please be sure to answer the question. Provide details and share your research!

But avoid



  • Asking for help, clarification, or responding to other answers.

  • Making statements based on opinion; back them up with references or personal experience.


To learn more, see our tips on writing great answers.




draft saved


draft discarded














StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53386054%2fhow-to-successfully-convert-math-papers-to-plain-text%23new-answer', 'question_page');
}
);

Post as a guest















Required, but never shown





















































Required, but never shown














Required, but never shown












Required, but never shown







Required, but never shown

































Required, but never shown














Required, but never shown












Required, but never shown







Required, but never shown







Popular posts from this blog

Tonle Sap (See)

I get strange results when I access the Sqlitedatabase with Unity C# via XAMPP

Guatemaltekische Davis-Cup-Mannschaft